e scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image similarity scores. text_embeds(`jnp.ndarray` of shape `(batch_size, output_dim`): The text embeddings obtained by applying the projection layer to the pooled output of [`FlaxCLIPTextModel`]. image_embeds(`jnp.ndarray` of shape `(batch_size, output_dim`): The image embeddings obtained by applying the projection layer to the pooled output of [`FlaxCLIPVisionModel`]. text_model_output(`FlaxBaseModelOutputWithPooling`): The output of the [`FlaxCLIPTextModel`]. vision_model_output(`FlaxBaseModelOutputWithPooling`): The output of the [`FlaxCLIPVisionModel`]. NÚ