edicted speech waveform. - **cross_attentions** (*optional*, returned when `output_cross_attentions` is `True`) `torch.FloatTensor` of shape `(config.decoder_layers, config.decoder_attention_heads, output_sequence_length, input_sequence_length)` -- The outputs of the decoder's cross-attention layers. Nr