. decoder_intermediate_size (`int`, *optional*, defaults to 2048): Dimensionality of the "intermediate" (i.e., feed-forward) layer in the decoder. pixel_mask_ratio (`float`, *optional*, defaults to 0.75): Image patch masking ratio. audio_mask_ratio (`float`, *optional*, defaults to 0.15): Audio patch masking ratio. audio_mask_type (`str`, *optional*, defaults to `"frame-level"`): Audio patch masking type, choose between "frame-level" and "patch-level". task_matching (`bool`, *optional*, defaults to `True`): Whether to use vision audio matching task in pretraining. task_mae (`bool`, *optional*, defaults to `True`): Whether to use the masked auto-encoder (MAE) in pretraining. loss_type (`str`, *optional*, defaults to `"classification"`): Loss types including regression and classification. Example: ```python >>> from transformers import TvltConfig, TvltModel >>> # # Initializing a TVLT ZinengTang/tvlt-base style configuration >>> configuration = TvltConfig() >>> # # Initializing a model (with random weights) from the ZinengTang/tvlt-base style configuration >>> model = TvltModel(configuration) >>> # Accessing the model configuration >>> configuration = model.config ```Z