r` of shape `(batch_size, config.speaker_embedding_dim)`, *optional*):
                Tensor containing the speaker embeddings.
            threshold (`float`, *optional*, defaults to 0.5):
                The generated sequence ends when the predicted stop token probability exceeds this value.
            minlenratio (`float`, *optional*, defaults to 0.0):
                Used to calculate the minimum required length for the output sequence.
            maxlenratio (`float`, *optional*, defaults to 20.0):
                Used to calculate the maximum allowed length for the output sequence.
            vocoder (`nn.Module`, *optional*):
                The vocoder that converts the mel spectrogram into a speech waveform. If `None`, the output is the mel
                spectrogram.
            output_cross_attentions (`bool`, *optional*, defaults to `False`):
                Whether or not to return the attentions tensors of the decoder's cross-attention layers.

        Returns:
            `tuple(torch.FloatTensor)` comprising various elements depending on the inputs:
            - **spectrogram** (*optional*, returned when no `vocoder` is provided) `torch.FloatTensor` of shape
              `(output_sequence_length, config.num_mel_bins)` -- The predicted log-mel spectrogram.
            - **waveform** (*optional*, returned when a `vocoder` is provided) `torch.FloatTensor` of shape
              `(num_frames,)` -- The predicted speech waveform.
            - **cross_attentions** (*optional*, returned when `output_cross_attentions` is `True`) `torch.FloatTensor`
              of shape `(config.decoder_layers, config.decoder_attention_heads, output_sequence_length,
              input_sequence_length)` -- The outputs of the decoder's cross-attention layers.
        NŠ