`decoder_input_ids` are provided to the `generate` function. It is used to guide the model`s generation process depending on the task. use_cache (`bool`, *optional*, defaults to `True`): Whether or not the model should return the last key/values attentions (not used by all models). is_encoder_decoder (`bool`, *optional*, defaults to `True`): Whether the model is used as an encoder/decoder or not. activation_function (`str`, *optional*, defaults to `"gelu"`): The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported. d_model (`int`, *optional*, defaults to 256): Dimensionality of the layers. dropout (`float`, *optional*, defaults to 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. attention_dropout (`float`, *optional*, defaults to 0.0): The dropout ratio for the attention probabilities. activation_dropout (`float`, *optional*, defaults to 0.0): The dropout ratio for activations inside the fully connected layer. init_std (`float`, *optional*, defaults to 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. scale_embedding (`bool`, *optional*, defaults to False): Scale embeddings by diving by sqrt(d_model). max_source_positions (`int`, *optional*, defaults to 1500): The maximum sequence length of log-mel filter-bank features that this model might ever be used with. max_target_positions (`int`, *optional*, defaults to 448): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). pad_token_id (`int`, *optional*, defaults to 50256): Padding token id. bos_token_id (`int`, *optional*, defaults to 50256): Begin of stream token id. eos_token_id (`int`, *optional*, defaults to 50256): End of stream token id. suppress_tokens (`List[int]`, *optional*): A list containing the non-speech tokens that will be used by the logit processor in the `generate` function. NON_SPEECH_TOKENS and NON_SPEECH_TOKENS_MULTI each correspond to the `english-only` and the `multilingual` model. begin_suppress_tokens (`List[int]`, *optional*, defaults to `[220,50256]`): A list containing tokens that will be supressed at the beginning of the sampling process. Initialized as the token for `" "` (`blank_token_id`) and the `eos_token_id` use_weighted_layer_sum (`bool`, *optional*, defaults to `False`): Whether to use a weighted average of layer outputs with learned weights. Only relevant when using an instance of [`WhisperForAudioClassification`]. classifier_proj_size (`int`, *optional*, defaults to 256): Dimensionality of the projection before token mean-pooling for classification. Only relevant when using an instance of [`WhisperForAudioClassification`]. apply_spec_augment (`bool`, *optional*, defaults to `False`): Whether to apply *SpecAugment* data augmentation to the outputs of the feature encoder. For reference see [SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/abs/1904.08779). mask_time_prob (`float`, *optional*, defaults to 0.05): Percentage (between 0 and 1) of all feature vectors along the time axis which will be masked. The masking procecure generates `mask_time_prob*len(time_axis)/mask_time_length` independent masks over the axis. If reasoning from the propability of each feature vector to be chosen as the start of the vector span to be masked, *mask_time_prob* should be `prob_vector_start*mask_time_length`. Note that overlap may decrease the actual percentage of masked vectors. This is only relevant if `apply_spec_augment == True`. mask_time_length (`int`, *optional*, defaults to 10): Length of vector span along the time axis. mask_time_min_masks (`int`, *optional*, defaults to 2),: The minimum number of masks of length `mask_feature_length` generated along the time axis, each time step, irrespectively of `mask_feature_prob`. Only relevant if ''mask_time_prob*len(time_axis)/mask_time_length < mask_time_min_masks'' mask_feature_prob (`float`, *optional*, defaults to 0.0): Percentage (between 0 and 1) of all feature vectors along the feature axis which will be masked. The masking procecure generates `mask_feature_prob*len(feature_axis)/mask_time_length` independent masks over the axis. If reasoning from the propability of each feature vector to be chosen as the start of the vector span to be masked, *mask_feature_prob* should be `prob_vector_start*mask_feature_length`. Note that overlap may decrease the actual percentage of masked vectors. This is only relevant if `apply_spec_augment is True`. mask_feature_length (`int`, *optional*, defaults to 10): Length of vector span along the feature axis. mask_feature_min_masks (`int`, *optional*, defaults to 0),: The minimum number of masks of length `mask_feature_length` generated along the feature axis, each time step, irrespectively of `mask_feature_prob`. Only relevant if `mask_feature_prob*len(feature_axis)/mask_feature_length < mask_feature_min_masks`. median_filter_width (`int`, *optional*, defaults to 7): Width of the median filter used to smoothen to cross-attention outputs when computing token timestamps. Should be an odd number. Example: ```python >>> from transformers import WhisperConfig, WhisperModel >>> # Initializing a Whisper tiny style configuration >>> configuration = WhisperConfig() >>> # Initializing a model (with random weights) from the tiny style configuration >>> model = WhisperModel(configuration) >>> # Accessing the model configuration >>> configuration = model.config ```Z