edicted speech waveform.
            - **cross_attentions** (*optional*, returned when `output_cross_attentions` is `True`) `torch.FloatTensor`
              of shape `(config.decoder_layers, config.decoder_attention_heads, output_sequence_length,
              input_sequence_length)` -- The outputs of the decoder's cross-attention layers.
        Nr