d `text_embeds`. This represents the audio-text similarity scores. logits_per_text:(`torch.FloatTensor` of shape `(text_batch_size, audio_batch_size)`): The scaled dot product scores between `text_embeds` and `audio_embeds`. This represents the text-audio similarity scores. text_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by applying the projection layer to the pooled output of [`ClapTextModel`]. audio_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): The audio embeddings obtained by applying the projection layer to the pooled output of [`ClapAudioModel`]. text_model_output(`BaseModelOutputWithPooling`): The output of the [`ClapTextModel`]. audio_model_output(`BaseModelOutputWithPooling`): The output of the [`ClapAudioModel`]. NÚ