aining purposes `hash_seed` should be left as `None` to
            ensure fully random rotations in local sensitive hashing scheme.
        hidden_act (`str` or `Callable`, *optional*, defaults to `"relu"`):
            The non-linear activation function (function or string) in the feed forward layer in the residual attention
            block. If string, `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.05):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        hidden_size (`int`, *optional*, defaults to 256):
            Dimensionality of the output hidden states of the residual attention blocks.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        is_decoder (`bool`, *optional*, defaults to `False`):
            Whether or not to use a causal mask in addition to the `attention_mask` passed to [`ReformerModel`]. When
            using the Reformer for causal language modeling, this argument should be set to `True`.
        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
        local_chunk_length (`int`, *optional*, defaults to 64):
            Length of chunk which attends to itself in `LocalSelfAttention`. Chunking reduces memory complexity from
            sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk
            length (chunked self attention).
        local_num_chunks_before (`int`, *optional*, defaults to 1):
            Number of previous neighbouring chunks to attend to in `LocalSelfAttention` layer to itself.
        local_num_chunks_after (`int`, *optional*, defaults to 0):
            Number of following neighbouring chunks to attend to in `LocalSelfAttention` layer in addition to itself.
        local_attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout ratio for the attention probabilities in `LocalSelfAttention`.
        lsh_attn_chunk_length (`int`, *optional*, defaults to 64):
            Length of chunk which attends to itself in `LSHSelfAttention`. Chunking reduces memory complexity from
            sequence length x sequence length (self attention) to chunk length x chunk length x sequence length / chunk
            length (chunked self attention).
        lsh_num_chunks_before (`int`, *optional*, defaults to 1):
            Number of previous neighbouring chunks to attend to in `LSHSelfAttention` layer to itself.
        lsh_num_chunks_after (`int`, *optional*, defaults to 0):
            Number of following neighbouring chunks to attend to in `LSHSelfAttention` layer to itself.
        lsh_attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout ratio for the attention probabilities in `LSHSelfAttention`.
        max_position_embeddings (`int`, *optional*, defaults to 4096):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
        num_attention_heads (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
        num_buckets (`int` or `List[int]`, *optional*):
            Number of buckets, the key query vectors can be "hashed into" using the locality sensitive hashing scheme.
            Each query key vector is hashed into a hash in `1, ..., num_buckets`. The number of buckets can also be
            factorized into a list for improved memory complexity. In this case, each query key vector is hashed into a
            hash in `1-1, 1-2, ..., num_buckets[0]-1, ..., num_buckets[0]-num_buckets[1]` if `num_buckets` is
            factorized into two factors. The number of buckets (or the product the factors) should approximately equal
            sequence length / lsh_chunk_length. If `num_buckets` not set, a good value is calculated on the fly.
        num_hashes (`int`, *optional*, defaults to 1):
            Number of hashing rounds (e.g., number of random rotations) in Local Sensitive Hashing scheme. The higher
            `num_hashes`, the more accurate the `LSHSelfAttention` becomes, but also the more memory and time intensive
            the hashing becomes.
        pad_token_id (`int`, *optional*, defaults to 0):
            The token id for the padding token.
        vocab_size (`int`, *optional*, defaults to 320):\
            Vocabulary size of the Reformer model. Defines the number of different tokens that can be represented by
            the `inputs_ids` passed when calling [`ReformerModel`].
        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
            Whether to tie input and output embeddings.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models).
        classifier_dropout (`float`, *optional*):
            The dropout ratio for the classification head.

    Examples:

    ```python
    >>> from transformers import ReformerConfig, ReformerModel

    >>> # Initializing a Reformer configuration
    >>> configuration = ReformerConfig()

    >>> # Initializing a Reformer model (with random weights)
    >>> model = ReformerModel(configuration)

    >>> # Accessing the model configuration
    >>> configuration = model.config
    ```
ZreformerZ