ixel_values` are present): The output of the [`FlavaImageModel`]. text_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `input_ids` are present): The text embeddings which are basically the pooled output of [`FlavaTextModel`]. text_output (`BaseModelOutputWithPooling`, *optional*, returned when `input_ids` are present): The output of the [`FlavaTextModel`]. multimodal_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `input_ids` and `pixel_values` are present and `skip_unmasked_multimodal_encoder` is `None` or `False`): The multimodal embeddings which are basically the pooled output of [`FlavaTextModel`]. multimodal_output (`BaseModelOutputWithPooling`, returned when `input_ids` and `pixel_values` are present and `skip_unmasked_multimodal_encoder` is `None` or `False`): The output of the [`FlavaMultimodalModel`]. image_masked_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `pixel_values` are present): The image embeddings which are basically the pooled output of [`FlavaImageModel`]. Uses `bool_masked_pos` to create masked images. image_masked_output (`BaseModelOutputWithPooling`, *optional*, returned when `pixel_values` are present): The output of the [`FlavaImageModel`]. Uses `bool_masked_pos` to create masked images. text_masked_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `input_ids_masked` are present): The text embeddings which are basically the pooled output of [`FlavaTextModel`]. text_masked_output (`BaseModelOutputWithPooling`, *optional*, returned when `input_ids_masked` are present): The output of the [`FlavaTextModel`]. multimodal_masked_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `input_ids` and `pixel_values` are present): The multimodal embeddings which are basically the pooled output of [`FlavaTextModel`]. multimodal_masked_output (`BaseModelOutputWithPooling`, returned when `input_ids_masked` and `pixel_values` are present): The output of the [`FlavaMultimodalModel`]. mim_logits (`torch.FloatTensor` of shape `(batch_size, num_image_patches, image_vocab_size)` or of shape `(total_masked_patches, image_vocab_size)` , *optional*, returned when `pixel_values` are present and `input_ids_masked` are not): The logits for MIM unimodal loss. Uses `book_masked_pos` to get masked patches. The flattened output is returned when `bool_masked_pos` has some of the patches masked. mlm_logits (`torch.FloatTensor` of shape `(batch_size, text_seq_length, text_vocab_size)` or of shape `(total_masked_seq_length, text_vocab_size)`, *optional*, returned when `input_ids_masked` are present and `pixel_values` are not): The logits for MLM unimodal loss. The flattened output is returned when `input_ids_masked` has some of the tokens masked. itm_logits (`torch.FloatTensor` of shape `(batch_size, 2)`, *optional*, returned when `input_ids_masked` and `pixel_values` are present): The logits for ITM loss. Note that ITM loss is calculated on masked pairs in FLAVA. mmm_image_logits (`torch.FloatTensor` of shape `(batch_size, num_image_patches, image_vocab_size)` or of shape`(total_masked_patches, image_vocab_size)`, *optional*, returned when `pixel_values` and `input_ids_masked` are present): The logits for MMM image multimodal loss. Uses `book_masked_pos` to get masked patches. The flattened output is returned when `bool_masked_pos` has some of the patches masked. mmm_text_logits (`torch.FloatTensor` of shape `(batch_size, text_seq_length, text_vocab_size)` or of shape `(`(total_masked_seq_length, text_vocab_size)`), *optional*, returned when `pixel_values` and `input_ids_masked` are present): The logits for MMM text multimodal loss. The flattened output is returned when `input_ids_masked` has some of the tokens masked. contrastive_logits_per_image (`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`): The scaled dot product scores between `image_embeddings` and `text_embeddings` but passed through FLAVA's `image_projection` and `text_projection` layers respectively. This represents the image-text similarity scores. This is calculated on unmasked images and texts. contrastive_logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`): The scaled dot product scores between `text_embeddings` and `image_embeddings` but passed through FLAVA's `text_projection` and `image_projection` layers respectively. This is calculated on unmasked images and texts. NÚ