d (`int`, *optional*, defaults to 0) Beginning of stream token id. eos_token_id (`int`, *optional*, defaults to 2) End of stream token id. ngram (`int`, *optional*, defaults to 2) Number of future tokens to predict. Set to 1 to be same as traditional Language model to predict next first token. num_buckets (`int`, *optional*, defaults to 32) The number of buckets to use for each attention layer. This is for relative position calculation. See the [T5 paper](see https://arxiv.org/abs/1910.10683) for more details. relative_max_distance (`int`, *optional*, defaults to 128) Relative distances greater than this number will be put into the last same bucket. This is for relative position calculation. See the [T5 paper](see https://arxiv.org/abs/1910.10683) for more details. disable_ngram_loss (`bool`, *optional*, defaults to `False`): Whether be trained predicting only the next first token. eps (`float`, *optional*, defaults to 0.0): Controls the `epsilon` parameter value for label smoothing in the loss calculation. If set to 0, no label smoothing is performed. use_cache (`bool`, *optional*, defaults to `True`): Whether or not the model should return the last key/values attentions (not used by all models). Z prophetnetZpast_key_valuesZ