normalize_router_prob_before_dropping (`bool`, *optional*, defaults to `True`): Whether or not to normalize the router probabilities before applying a mask based on the experts capacity (capacity dropping). batch_prioritized_routing (`bool`, *optional*, defaults to `True`): Whether or not to orders the tokens by their router probabilities before capacity dropping. This means that the tokens that have the highest probabilities will be routed before other tokens that might be further in the sequence. moe_eval_capacity_token_fraction (`float`, *optional*, defaults to 1.0): Fraction of tokens as capacity during validation, if set to negative, uses the same as training. Should be in range: (0.0, 1.0]. num_experts (`int`, *optional*, defaults to 128): Number of experts for each NllbMoeSparseMlp layer. expert_capacity (`int`, *optional*, defaults to 64): Number of tokens that can be stored in each expert. encoder_sparse_step (`int`, *optional*, defaults to 4): Frequency of the sparse layers in the encoder. 4 means that one out of 4 layers will be sparse. decoder_sparse_step (`int`, *optional*, defaults to 4): Frequency of the sparse layers in the decoder. 4 means that one out of 4 layers will be sparse. router_dtype (`str`, *optional*, default to `"float32"`): The `dtype` used for the routers. It is preferable to keep the `dtype` to `"float32"` as specified in the *selective precision* discussion in [the paper](https://arxiv.org/abs/2101.03961). router_ignore_padding_tokens (`bool`, *optional*, defaults to `False`): Whether to ignore padding tokens when routing. if `False`, the padding tokens are not routed to any experts. router_bias (`bool`, *optional*, defaults to `False`): Whether or not the classifier of the router should have a bias. moe_token_dropout (`float`, *optional*, defualt ot 0.2): Masking rate for MoE expert output masking (EOM), which is implemented via a Dropout2d on the expert outputs. output_router_logits (`bool`, *optional*, defaults to `False`): Whether or not to return the router logits. Only set to `True` to get the auxiliary loss when training. use_cache (`bool`, *optional*, defaults to `True`): Whether or not the model should return the last key/values attentions (not used by all models). Example: ```python >>> from transformers import NllbMoeModel, NllbMoeConfig >>> # Initializing a NllbMoe facebook/nllb-moe-54b style configuration >>> configuration = NllbMoeConfig() >>> # Initializing a model from the facebook/nllb-moe-54b style configuration >>> model = NllbMoeModel(configuration) >>> # Accessing the model configuration >>> configuration = model.config ```znllb-moeZpast_key_valuesÚ