ds` (those that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `decoder_input_ids` of shape `(batch_size, sequence_length)`. labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`. use_cache (`bool`, *optional*): If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see `past_key_values`). - 1 for tokens that are **not masked**, - 0 for tokens that are **masked**. output_attentions (`bool`, *optional*): Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned tensors for more detail. output_hidden_states (`bool`, *optional*): Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for more detail. return_dict (`bool`, *optional*): Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple. Returns: Example: ```python >>> from transformers import ( ... TrOCRConfig, ... TrOCRProcessor, ... TrOCRForCausalLM, ... ViTConfig, ... ViTModel, ... VisionEncoderDecoderModel, ... ) >>> import requests >>> from PIL import Image >>> # TrOCR is a decoder model and should be used within a VisionEncoderDecoderModel >>> # init vision2text model with random weights >>> encoder = ViTModel(ViTConfig()) >>> decoder = TrOCRForCausalLM(TrOCRConfig()) >>> model = VisionEncoderDecoderModel(encoder=encoder, decoder=decoder) >>> # If you want to start from the pretrained model, load the checkpoint with `VisionEncoderDecoderModel` >>> processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten") >>> model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-handwritten") >>> # load image from the IAM dataset >>> url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg" >>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB") >>> pixel_values = processor(image, return_tensors="pt").pixel_values >>> text = "industry, ' Mr. Brown commented icily. ' Let us have a" >>> # training >>> model.config.decoder_start_token_id = processor.tokenizer.cls_token_id >>> model.config.pad_token_id = processor.tokenizer.pad_token_id >>> model.config.vocab_size = model.config.decoder.vocab_size >>> labels = processor.tokenizer(text, return_tensors="pt").input_ids >>> outputs = model(pixel_values, labels=labels) >>> loss = outputs.loss >>> round(loss.item(), 2) 5.30 >>> # inference >>> generated_ids = model.generate(pixel_values) >>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] >>> generated_text 'industry, " Mr. Brown commented icily. " Let us have a' ```N)