ional*, defaults to 2): Number of decoder layers. encoder_attention_heads (`int`, *optional*, defaults to 2): Number of attention heads for each attention layer in the Transformer encoder. decoder_attention_heads (`int`, *optional*, defaults to 2): Number of attention heads for each attention layer in the Transformer decoder. encoder_ffn_dim (`int`, *optional*, defaults to 32): Dimension of the "intermediate" (often named feed-forward) layer in encoder. decoder_ffn_dim (`int`, *optional*, defaults to 32): Dimension of the "intermediate" (often named feed-forward) layer in decoder. activation_function (`str` or `function`, *optional*, defaults to `"gelu"`): The non-linear activation function (function or string) in the encoder and decoder. If string, `"gelu"` and `"relu"` are supported. dropout (`float`, *optional*, defaults to 0.1): The dropout probability for all fully connected layers in the encoder, and decoder. encoder_layerdrop (`float`, *optional*, defaults to 0.1): The dropout probability for the attention and fully connected layers for each encoder layer. decoder_layerdrop (`float`, *optional*, defaults to 0.1): The dropout probability for the attention and fully connected layers for each decoder layer. attention_dropout (`float`, *optional*, defaults to 0.1): The dropout probability for the attention probabilities. activation_dropout (`float`, *optional*, defaults to 0.1): The dropout probability used between the two layers of the feed-forward networks. num_parallel_samples (`int`, *optional*, defaults to 100): The number of samples to generate in parallel for each time step of inference. init_std (`float`, *optional*, defaults to 0.02): The standard deviation of the truncated normal weight initialization distribution. use_cache (`bool`, *optional*, defaults to `True`): Whether to use the past key/values attentions (if applicable to the model) to speed up decoding. label_length (`int`, *optional*, defaults to 10): Start token length of the Autoformer decoder, which is used for direct multi-step prediction (i.e. non-autoregressive generation). moving_average (`int`, defaults to 25): The window size of the moving average. In practice, it's the kernel size in AvgPool1d of the Decomposition Layer. autocorrelation_factor (`int`, defaults to 3): "Attention" (i.e. AutoCorrelation mechanism) factor which is used to find top k autocorrelations delays. It's recommended in the paper to set it to a number between 1 and 5. Example: ```python >>> from transformers import AutoformerConfig, AutoformerModel >>> # Initializing a default Autoformer configuration >>> configuration = AutoformerConfig() >>> # Randomly initializing a model (with random weights) from the configuration >>> model = AutoformerModel(configuration) >>> # Accessing the model configuration >>> configuration = model.config ```Z autoformerÚ