torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`): Contrastive loss for image-text similarity. logits_per_image:(`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`): The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text similarity scores. logits_per_text:(`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`): The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image similarity scores. text_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by applying the projection layer to the pooled output of [`AlignTextModel`]. image_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): The output of [`AlignVisionModel`]. text_model_output(`BaseModelOutputWithPoolingAndCrossAttentions`): The output of the [`AlignTextModel`]. vision_model_output(`BaseModelOutputWithPoolingAndNoAttention`): The output of the [`AlignVisionModel`]. NÚ