en indices specifically for images. Mask values selected in `[0, 1]`: - 1 for tokens that are **not masked**, - 0 for tokens that are **masked**. [What are attention masks?](../glossary#attention-mask) skip_unmasked_multimodal_encoder (*bool*, *optional*): Skip any calculations for multimodal encoder for unmasked inputs. FLAVA pretraining doesn't need unmasked multimodal embeddings or outputs as of now. mlm_labels (`torch.LongTensor` of shape `(batch_size, text_seq_len)`, *optional*): Labels for computing the left-to-right language and multimodal masked modeling loss (next word prediction). Indices should be in `[-100, 0, ..., text_config.vocab_size - 1]` (see `input_ids` docstring). Tokens with indices set to `-100` are ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., text_config.vocab_size - 1]`. mim_labels (`torch.LongTensor` of shape `(batch_size, image_num_patches)`, *optional*): Labels for computing the image and multimodal masked modeling loss. Indices should be in `[-100, 0, ..., image_config.vocab_size - 1]`. Tokens with indices set to `-100` are ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., image_config.vocab_size - 1]`. If not passed, they are generated automatically using the image codebook assigned to the model. By default, it uses [`FlavaImageCodebook`]. See [`FlavaImageCodebook`] to understand how to generate mim_labels. itm_labels (`torch.LongTensor` of shape `(batch_size, 1)`, *optional*): Labels for computing the image-text matching loss. 0 means the pairs don't match and 1 means they match. The pairs with 0 will be skipped for calculation of MMM and global contrastive losses as well. return_loss (`bool`, *optional*, default to None): Whether to return calculated loss or not. zī Parameters: image_codebook ([`nn.Module`]): If passed, the image codebook will be set to this. Otherwise. it will be initialized using the image_codebook_config defined in the config first as the first parameter. c