bels)`, *optional*): Labels for computing the visual question answering loss. This tensor must be either a one-hot encoding of all answers that are applicable for a given example in the batch, or a soft encoding indicating which answers are applicable, where 1.0 is the highest score. Returns: Examples: ```python >>> from transformers import ViltProcessor, ViltForQuestionAnswering >>> import requests >>> from PIL import Image >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" >>> image = Image.open(requests.get(url, stream=True).raw) >>> text = "How many cats are there?" >>> processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa") >>> model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa") >>> # prepare inputs >>> encoding = processor(image, text, return_tensors="pt") >>> # forward pass >>> outputs = model(**encoding) >>> logits = outputs.logits >>> idx = logits.argmax(-1).item() >>> print("Predicted answer:", model.config.id2label[idx]) Predicted answer: 2 ```Nr"