r` of shape `(batch_size, config.speaker_embedding_dim)`, *optional*): Tensor containing the speaker embeddings. threshold (`float`, *optional*, defaults to 0.5): The generated sequence ends when the predicted stop token probability exceeds this value. minlenratio (`float`, *optional*, defaults to 0.0): Used to calculate the minimum required length for the output sequence. maxlenratio (`float`, *optional*, defaults to 20.0): Used to calculate the maximum allowed length for the output sequence. vocoder (`nn.Module`, *optional*): The vocoder that converts the mel spectrogram into a speech waveform. If `None`, the output is the mel spectrogram. output_cross_attentions (`bool`, *optional*, defaults to `False`): Whether or not to return the attentions tensors of the decoder's cross-attention layers. Returns: `tuple(torch.FloatTensor)` comprising various elements depending on the inputs: - **spectrogram** (*optional*, returned when no `vocoder` is provided) `torch.FloatTensor` of shape `(output_sequence_length, config.num_mel_bins)` -- The predicted log-mel spectrogram. - **waveform** (*optional*, returned when a `vocoder` is provided) `torch.FloatTensor` of shape `(num_frames,)` -- The predicted speech waveform. - **cross_attentions** (*optional*, returned when `output_cross_attentions` is `True`) `torch.FloatTensor` of shape `(config.decoder_layers, config.decoder_attention_heads, output_sequence_length, input_sequence_length)` -- The outputs of the decoder's cross-attention layers. NŠ