.
        image_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`):
            The image embeddings obtained by applying the projection layer to the pooled output of
            [`AltCLIPVisionModel`].
        text_model_output(`BaseModelOutputWithPooling`):
            The output of the [`AltCLIPTextModel`].
        vision_model_output(`BaseModelOutputWithPooling`):
            The output of the [`AltCLIPVisionModel`].
    NÚ