. image_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): The image embeddings obtained by applying the projection layer to the pooled output of [`AltCLIPVisionModel`]. text_model_output(`BaseModelOutputWithPooling`): The output of the [`AltCLIPTextModel`]. vision_model_output(`BaseModelOutputWithPooling`): The output of the [`AltCLIPVisionModel`]. NÚ