bound of the *uniform initializer* for initializing all weight matrices in attention layers. initializer_std (`float`, *optional*): The standard deviation of the *normal initializer* for initializing the embedding matrix and the weight of linear layers. Will default to 1 for the embedding matrix and the value given by Xavier initialization for linear layers. layer_norm_eps (`float`, *optional*, defaults to 1e-9): The epsilon used by the layer normalization layers. pooling_type (`str`, *optional*, defaults to `"mean"`): Possible values are `"mean"` or `"max"`. The way pooling is performed at the beginning of each block. attention_type (`str`, *optional*, defaults to `"relative_shift"`): Possible values are `"relative_shift"` or `"factorized"`. The former is faster on CPU/GPU while the latter is faster on TPU. separate_cls (`bool`, *optional*, defaults to `True`): Whether or not to separate the cls token when applying pooling. truncate_seq (`bool`, *optional*, defaults to `False`): When using `separate_cls`, whether or not to truncate the last token when pooling, to avoid getting a sequence length that is not a multiple of 2. pool_q_only (`bool`, *optional*, defaults to `False`): Whether or not to apply the pooling only to the query or to query, key and values for the attention layers. Z