pe `(batch_size, output_dim`): The image embeddings obtained by applying the projection layer to the pooled output of vision model. Nr[