e them. max_position_embeddings (`int`, *optional*, defaults to 512): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). embed_init_std (`float`, *optional*, defaults to 2048^-0.5): The standard deviation of the truncated_normal_initializer for initializing the embedding matrices. init_std (`int`, *optional*, defaults to 50257): The standard deviation of the truncated_normal_initializer for initializing all weight matrices except the embedding matrices. layer_norm_eps (`float`, *optional*, defaults to 1e-12): The epsilon used by the layer normalization layers. bos_index (`int`, *optional*, defaults to 0): The index of the beginning of sentence token in the vocabulary. eos_index (`int`, *optional*, defaults to 1): The index of the end of sentence token in the vocabulary. pad_index (`int`, *optional*, defaults to 2): The index of the padding token in the vocabulary. unk_index (`int`, *optional*, defaults to 3): The index of the unknown token in the vocabulary. mask_index (`int`, *optional*, defaults to 5): The index of the masking token in the vocabulary. is_encoder(`bool`, *optional*, defaults to `True`): Whether or not the initialized model should be a transformer encoder or decoder as seen in Vaswani et al. summary_type (`string`, *optional*, defaults to "first"): Argument used when doing sequence summary. Used in the sequence classification and multiple choice models. Has to be one of the following options: - `"last"`: Take the last token hidden state (like XLNet). - `"first"`: Take the first token hidden state (like BERT). - `"mean"`: Take the mean of all tokens hidden states. - `"cls_index"`: Supply a Tensor of classification token position (like GPT/GPT-2). - `"attn"`: Not implemented now, use multi-head attention. summary_use_proj (`bool`, *optional*, defaults to `True`): Argument used when doing sequence summary. Used in the sequence classification and multiple choice models. Whether or not to add a projection after the vector extraction. summary_activation (`str`, *optional*): Argument used when doing sequence summary. Used in the sequence classification and multiple choice models. Pass `"tanh"` for a tanh activation to the output, any other value will result in no activation. summary_proj_to_labels (`bool`, *optional*, defaults to `True`): Used in the sequence classification and multiple choice models. Whether the projection outputs should have `config.num_labels` or `config.hidden_size` classes. summary_first_dropout (`float`, *optional*, defaults to 0.1): Used in the sequence classification and multiple choice models. The dropout ratio to be used after the projection and activation. start_n_top (`int`, *optional*, defaults to 5): Used in the SQuAD evaluation script. end_n_top (`int`, *optional*, defaults to 5): Used in the SQuAD evaluation script. mask_token_id (`int`, *optional*, defaults to 0): Model agnostic parameter to identify masked tokens when generating text in an MLM context. lang_id (`int`, *optional*, defaults to 1): The ID of the language used by the model. This parameter is used when generating text in a given language. Examples: ```python >>> from transformers import XLMConfig, XLMModel >>> # Initializing a XLM configuration >>> configuration = XLMConfig() >>> # Initializing a model (with random weights) from the configuration >>> model = XLMModel(configuration) >>> # Accessing the model configuration >>> configuration = model.config ```Z