loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]` obj_labels (`Dict[Str: Tuple[tf.Tensor, tf.Tensor]]`, *optional*, defaults to `None`): each key is named after each one of the visual losses and each element of the tuple is of the shape `(batch_size, num_features)` and `(batch_size, num_features, visual_feature_dim)` for each the label id and the label score respectively matched_label (`tf.Tensor` of shape `(batch_size,)`, *optional*): Labels for computing the whether or not the text input matches the image (classification) loss. Input should be a sequence pair (see `input_ids` docstring) Indices should be in `[0, 1]`: - 0 indicates that the sentence does not match the image, - 1 indicates that the sentence does match the image. ans (`Torch.Tensor` of shape `(batch_size)`, *optional*, defaults to `None`): a one hot representation hof the correct answer *optional* Returns: r