use_query_residual (`float`, *optional*, defaults to `True`): Whether to add a query residual in the cross-attention layer of the encoder. vocab_size (`int`, *optional*, defaults to 262): Vocabulary size for the masked language modeling model. max_position_embeddings (`int`, *optional*, defaults to 2048): The maximum sequence length that the masked language modeling model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). image_size (`int`, *optional*, defaults to 56): Size of the images after preprocessing, for [`PerceiverForImageClassificationLearned`]. train_size (`List[int]`, *optional*, defaults to [368, 496]): Training size of the images for the optical flow model. num_frames (`int`, *optional*, defaults to 16): Number of video frames used for the multimodal autoencoding model. audio_samples_per_frame (`int`, *optional*, defaults to 1920): Number of audio samples per frame for the multimodal autoencoding model. samples_per_patch (`int`, *optional*, defaults to 16): Number of audio samples per patch when preprocessing the audio for the multimodal autoencoding model. output_num_channels (`int`, *optional*, defaults to 512): Number of output channels for each modalitiy decoder. output_shape (`List[int]`, *optional*, defaults to `[1, 16, 224, 224]`): Shape of the output (batch_size, num_frames, height, width) for the video decoder queries of the multimodal autoencoding model. This excludes the channel dimension. Example: ```python >>> from transformers import PerceiverModel, PerceiverConfig >>> # Initializing a Perceiver deepmind/language-perceiver style configuration >>> configuration = PerceiverConfig() >>> # Initializing a model from the deepmind/language-perceiver style configuration >>> model = PerceiverModel(configuration) >>> # Accessing the model configuration >>> configuration = model.config ```Z perceiveré