defaults to `True`): Argument used when doing sequence summary, used in the models [`GPT2DoubleHeadsModel`] and [`TFGPT2DoubleHeadsModel`]. Whether or not to add a projection after the vector extraction. summary_activation (`str`, *optional*): Argument used when doing sequence summary. Used in for the multiple choice head in [`GPT2DoubleHeadsModel`]. Pass `"tanh"` for a tanh activation to the output, any other value will result in no activation. summary_proj_to_labels (`bool`, *optional*, defaults to `True`): Argument used when doing sequence summary, used in the models [`GPT2DoubleHeadsModel`] and [`TFGPT2DoubleHeadsModel`]. Whether the projection outputs should have `config.num_labels` or `config.hidden_size` classes. summary_first_dropout (`float`, *optional*, defaults to 0.1): Argument used when doing sequence summary, used in the models [`GPT2DoubleHeadsModel`] and [`TFGPT2DoubleHeadsModel`]. The dropout ratio to be used after the projection and activation. scale_attn_weights (`bool`, *optional*, defaults to `True`): Scale attention weights by dividing by sqrt(hidden_size).. use_cache (`bool`, *optional*, defaults to `True`): Whether or not the model should return the last key/values attentions (not used by all models). scale_attn_by_inverse_layer_idx (`bool`, *optional*, defaults to `False`): Whether to additionally scale attention weights by `1 / layer_idx + 1`. reorder_and_upcast_attn (`bool`, *optional*, defaults to `False`): Whether to scale keys (K) prior to computing attention (dot-product) and upcast attention dot-product/softmax to float() when training with mixed precision. Example: ```python >>> from transformers import GPT2Config, GPT2Model >>> # Initializing a GPT2 configuration >>> configuration = GPT2Config() >>> # Initializing a model (with random weights) from the configuration >>> model = GPT2Model(configuration) >>> # Accessing the model configuration >>> configuration = model.config ```r