e a sequence pair (see `input_ids` docstring) Indices should be in `[0, 1]`: - 0 indicates sequence B is a matching pair of sequence A for the given image, - 1 indicates sequence B is a random sequence w.r.t A for the given image. Returns: Example: ```python # Assumption: *get_visual_embeddings(image)* gets the visual embeddings of the image in the batch. from transformers import AutoTokenizer, VisualBertForPreTraining tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") model = VisualBertForPreTraining.from_pretrained("uclanlp/visualbert-vqa-coco-pre") inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt") visual_embeds = get_visual_embeddings(image).unsqueeze(0) visual_token_type_ids = torch.ones(visual_embeds.shape[:-1], dtype=torch.long) visual_attention_mask = torch.ones(visual_embeds.shape[:-1], dtype=torch.float) inputs.update( { "visual_embeds": visual_embeds, "visual_token_type_ids": visual_token_type_ids, "visual_attention_mask": visual_attention_mask, } ) max_length = inputs["input_ids"].shape[-1] + visual_embeds.shape[-2] labels = tokenizer( "The capital of France is Paris.", return_tensors="pt", padding="max_length", max_length=max_length )["input_ids"] sentence_image_labels = torch.tensor(1).unsqueeze(0) # Batch_size outputs = model(**inputs, labels=labels, sentence_image_labels=sentence_image_labels) loss = outputs.loss prediction_logits = outputs.prediction_logits seq_relationship_logits = outputs.seq_relationship_logits ```NŠ