.
        decoder_intermediate_size (`int`, *optional*, defaults to 2048):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the decoder.
        pixel_mask_ratio (`float`, *optional*, defaults to 0.75):
            Image patch masking ratio.
        audio_mask_ratio (`float`, *optional*, defaults to 0.15):
            Audio patch masking ratio.
        audio_mask_type (`str`, *optional*, defaults to `"frame-level"`):
            Audio patch masking type, choose between "frame-level" and "patch-level".
        task_matching (`bool`, *optional*, defaults to `True`):
            Whether to use vision audio matching task in pretraining.
        task_mae (`bool`, *optional*, defaults to `True`):
            Whether to use the masked auto-encoder (MAE) in pretraining.
        loss_type (`str`, *optional*, defaults to `"classification"`):
            Loss types including regression and classification.

    Example:

    ```python
    >>> from transformers import TvltConfig, TvltModel

    >>> # # Initializing a TVLT ZinengTang/tvlt-base style configuration
    >>> configuration = TvltConfig()

    >>> # # Initializing a model (with random weights) from the ZinengTang/tvlt-base style configuration
    >>> model = TvltModel(configuration)

    >>> # Accessing the model configuration
    >>> configuration = model.config
    ```Z