ts implementation from Module. Args: d_model: the number of expected features in the input (required). nhead: the number of heads in the multiheadattention models (required). dim_feedforward: the dimension of the feedforward network model (default=2048). dropout: the dropout value (default=0.1). activation: the activation function of the intermediate layer, can be a string ("relu" or "gelu") or a unary callable. Default: relu layer_norm_eps: the eps value in layer normalization components (default=1e-5). batch_first: If ``True``, then the input and output tensors are provided as (batch, seq, feature). Default: ``False`` (seq, batch, feature). norm_first: if ``True``, layer norm is done prior to attention and feedforward operations, respectively. Otherwise it's done after. Default: ``False`` (after). bias: If set to ``False``, ``Linear`` and ``LayerNorm`` layers will not learn an additive bias. Default: ``True``. Examples:: >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8) >>> src = torch.rand(10, 32, 512) >>> out = encoder_layer(src) Alternatively, when ``batch_first`` is ``True``: >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True) >>> src = torch.rand(32, 10, 512) >>> out = encoder_layer(src) Fast path: forward() will use a special optimized implementation described in `FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness`_ if all of the following conditions are met: - Either autograd is disabled (using ``torch.inference_mode`` or ``torch.no_grad``) or no tensor argument ``requires_grad`` - training is disabled (using ``.eval()``) - batch_first is ``True`` and the input is batched (i.e., ``src.dim() == 3``) - activation is one of: ``"relu"``, ``"gelu"``, ``torch.functional.relu``, or ``torch.functional.gelu`` - at most one of ``src_mask`` and ``src_key_padding_mask`` is passed - if src is a `NestedTensor `_, neither ``src_mask`` nor ``src_key_padding_mask`` is passed - the two ``LayerNorm`` instances have a consistent ``eps`` value (this will naturally be the case unless the caller has manually modified one without modifying the other) If the optimized implementation is in use, a `NestedTensor `_ can be passed for ``src`` to represent padding more efficiently than using a padding mask. In this case, a `NestedTensor `_ will be returned, and an additional speedup proportional to the fraction of the input that is padding can be expected. .. _`FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness`: https://arxiv.org/abs/2205.14135 r7