pProcessor` class. spec_size (`int`, *optional*, defaults to 256): Desired input size of the spectrogram that the model supports. It can be different from the output of the `ClapFeatureExtractor`, in which case the input features will be resized. Corresponds to the `image_size` of the audio models. hidden_act (`str`, *optional*, defaults to `"gelu"`): The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported. patch_size (`int`, *optional*, defaults to 4): Patch size for the audio spectrogram patch_stride (`list`, *optional*, defaults to `[4, 4]`): Patch stride for the audio spectrogram num_classes (`int`, *optional*, defaults to 527): Number of classes used for the head training hidden_size (`int`, *optional*, defaults to 768): Hidden size of the output of the audio encoder. Correspond to the dimension of the penultimate layer's output,which is sent to the projection MLP layer. projection_dim (`int`, *optional*, defaults to 512): Hidden size of the projection layer. depths (`list`, *optional*, defaults to `[2, 2, 6, 2]`): Depths used for the Swin Layers of the audio model num_attention_heads (`list`, *optional*, defaults to `[4, 8, 16, 32]`): Number of attention heads used for the Swin Layers of the audio model enable_fusion (`bool`, *optional*, defaults to `False`): Whether or not to enable patch fusion. This is the main contribution of the authors, and should give the best results. hidden_dropout_prob (`float`, *optional*, defaults to 0.1): The dropout probabilitiy for all fully connected layers in the encoder. fusion_type (`[type]`, *optional*): Fusion type used for the patch fusion. patch_embed_input_channels (`int`, *optional*, defaults to 1): Number of channels used for the input spectrogram flatten_patch_embeds (`bool`, *optional*, defaults to `True`): Whether or not to flatten the patch embeddings patch_embeds_hidden_size (`int`, *optional*, defaults to 96): Hidden size of the patch embeddings. It is used as the number of output channels. enable_patch_layer_norm (`bool`, *optional*, defaults to `True`): Whether or not to enable layer normalization for the patch embeddings drop_path_rate (`float`, *optional*, defaults to 0.0): Drop path rate for the patch fusion attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0): The dropout ratio for the attention probabilities. qkv_bias (`bool`, *optional*, defaults to `True`): Whether or not to add a bias to the query, key, value projections. mlp_ratio (`float`, *optional*, defaults to 4.0): Ratio of the mlp hidden dim to embedding dim. aff_block_r (`int`, *optional*, defaults to 4): downsize_ratio used in the AudioFF block num_hidden_layers (`int`, *optional*, defaults to 4): Number of hidden layers in the Transformer encoder. projection_hidden_act (`str`, *optional*, defaults to `"relu"`): The non-linear activation function (function or string) in the projection layer. If string, `"gelu"`, `"relu"`, `"silu"` and `"gelu_new"` are supported. layer_norm_eps (`[type]`, *optional*, defaults to `1e-5`): The epsilon used by the layer normalization layers. initializer_factor (`float`, *optional*, defaults to 1.0): A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing). Example: ```python >>> from transformers import ClapAudioConfig, ClapAudioModel >>> # Initializing a ClapAudioConfig with laion/clap-htsat-fused style configuration >>> configuration = ClapAudioConfig() >>> # Initializing a ClapAudioModel (with random weights) from the laion/clap-htsat-fused style configuration >>> model = ClapAudioModel(configuration) >>> # Accessing the model configuration >>> configuration = model.config ```Z