aining purposes `hash_seed` should be left as `None` to ensure fully random rotations in local sensitive hashing scheme. hidden_act (`str` or `Callable`, *optional*, defaults to `"relu"`): The non-linear activation function (function or string) in the feed forward layer in the residual attention block. If string, `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported. hidden_dropout_prob (`float`, *optional*, defaults to 0.05): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. hidden_size (`int`, *optional*, defaults to 256): Dimensionality of the output hidden states of the residual attention blocks. initializer_range (`float`, *optional*, defaults to 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. is_decoder (`bool`, *optional*, defaults to `False`): Whether or not to use a causal mask in addition to the `attention_mask` passed to [`ReformerModel`]. When using the Reformer for causal language modeling, this argument should be set to `True`. layer_norm_eps (`float`, *optional*, defaults to 1e-12): The epsilon used by the layer normalization layers. local_chunk_length (`int`, *optional*, defaults to 64): Length of chunk which attends to itself in `LocalSelfAttention`. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention). local_num_chunks_before (`int`, *optional*, defaults to 1): Number of previous neighbouring chunks to attend to in `LocalSelfAttention` layer to itself. local_num_chunks_after (`int`, *optional*, defaults to 0): Number of following neighbouring chunks to attend to in `LocalSelfAttention` layer in addition to itself. local_attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1): The dropout ratio for the attention probabilities in `LocalSelfAttention`. lsh_attn_chunk_length (`int`, *optional*, defaults to 64): Length of chunk which attends to itself in `LSHSelfAttention`. Chunking reduces memory complexity from sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk length (chunked self attention). lsh_num_chunks_before (`int`, *optional*, defaults to 1): Number of previous neighbouring chunks to attend to in `LSHSelfAttention` layer to itself. lsh_num_chunks_after (`int`, *optional*, defaults to 0): Number of following neighbouring chunks to attend to in `LSHSelfAttention` layer to itself. lsh_attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1): The dropout ratio for the attention probabilities in `LSHSelfAttention`. max_position_embeddings (`int`, *optional*, defaults to 4096): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). num_attention_heads (`int`, *optional*, defaults to 12): Number of attention heads for each attention layer in the Transformer encoder. num_buckets (`int` or `List[int]`, *optional*): Number of buckets, the key query vectors can be "hashed into" using the locality sensitive hashing scheme. Each query key vector is hashed into a hash in `1, ..., num_buckets`. The number of buckets can also be factorized into a list for improved memory complexity. In this case, each query key vector is hashed into a hash in `1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if `num_buckets` is factorized into two factors. The number of buckets (or the product the factors) should approximately equal sequence length / lsh_chunk_length. If `num_buckets` not set, a good value is calculated on the fly. num_hashes (`int`, *optional*, defaults to 1): Number of hashing rounds (e.g., number of random rotations) in Local Sensitive Hashing scheme. The higher `num_hashes`, the more accurate the `LSHSelfAttention` becomes, but also the more memory and time intensive the hashing becomes. pad_token_id (`int`, *optional*, defaults to 0): The token id for the padding token. vocab_size (`int`, *optional*, defaults to 320):\ Vocabulary size of the Reformer model. Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling [`ReformerModel`]. tie_word_embeddings (`bool`, *optional*, defaults to `False`): Whether to tie input and output embeddings. use_cache (`bool`, *optional*, defaults to `True`): Whether or not the model should return the last key/values attentions (not used by all models). classifier_dropout (`float`, *optional*): The dropout ratio for the classification head. Examples: ```python >>> from transformers import ReformerConfig, ReformerModel >>> # Initializing a Reformer configuration >>> configuration = ReformerConfig() >>> # Initializing a Reformer model (with random weights) >>> model = ReformerModel(configuration) >>> # Accessing the model configuration >>> configuration = model.config ``` ZreformerZ