th)`, *optional*): Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the loss is only computed for the tokens with labels n `[0, ..., config.vocab_size]` past_key_values (`tuple(tuple(torch.FloatTensor))` of length `config.n_layers` with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`): Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding. If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `decoder_input_ids` of shape `(batch_size, sequence_length)`. use_cache (`bool`, *optional*): If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see `past_key_values`). Returns: Examples: Image captioning example: ```python >>> from transformers import AutoProcessor, AutoModelForCausalLM >>> import requests >>> from PIL import Image >>> processor = AutoProcessor.from_pretrained("microsoft/git-base-coco") >>> model = AutoModelForCausalLM.from_pretrained("microsoft/git-base-coco") >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" >>> image = Image.open(requests.get(url, stream=True).raw) >>> pixel_values = processor(images=image, return_tensors="pt").pixel_values >>> generated_ids = model.generate(pixel_values=pixel_values, max_length=50) >>> generated_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] >>> print(generated_caption) two cats sleeping on a pink blanket next to remotes. ``` Visual question answering (VQA) example: ```python >>> from transformers import AutoProcessor, AutoModelForCausalLM >>> from huggingface_hub import hf_hub_download >>> from PIL import Image >>> processor = AutoProcessor.from_pretrained("microsoft/git-base-textvqa") >>> model = AutoModelForCausalLM.from_pretrained("microsoft/git-base-textvqa") >>> file_path = hf_hub_download(repo_id="nielsr/textvqa-sample", filename="bus.png", repo_type="dataset") >>> image = Image.open(file_path).convert("RGB") >>> pixel_values = processor(images=image, return_tensors="pt").pixel_values >>> question = "what does the front of the bus say at the top?" >>> input_ids = processor(text=question, add_special_tokens=False).input_ids >>> input_ids = [processor.tokenizer.cls_token_id] + input_ids >>> input_ids = torch.tensor(input_ids).unsqueeze(0) >>> generated_ids = model.generate(pixel_values=pixel_values, input_ids=input_ids, max_length=50) >>> print(processor.batch_decode(generated_ids, skip_special_tokens=True)) ['what does the front of the bus say at the top? special'] ``` Video captioning example: ```python >>> import av >>> import numpy as np >>> from PIL import Image >>> from huggingface_hub import hf_hub_download >>> from transformers import AutoProcessor, AutoModelForCausalLM >>> processor = AutoProcessor.from_pretrained("microsoft/git-base-vatex") >>> model = AutoModelForCausalLM.from_pretrained("microsoft/git-base-vatex") >>> # set seed for reproducability >>> np.random.seed(45) >>> def read_video_pyav(container, indices): ... ''' ... Decode the video with PyAV decoder. ... Args: ... container (`av.container.input.InputContainer`): PyAV container. ... indices (`List[int]`): List of frame indices to decode. ... Returns: ... result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3). ... ''' ... frames = [] ... container.seek(0) ... start_index = indices[0] ... end_index = indices[-1] ... for i, frame in enumerate(container.decode(video=0)): ... if i > end_index: ... break ... if i >= start_index and i in indices: ... frames.append(frame) ... return np.stack([x.to_ndarray(format="rgb24") for x in frames]) >>> def sample_frame_indices(clip_len, frame_sample_rate, seg_len): ... converted_len = int(clip_len * frame_sample_rate) ... end_idx = np.random.randint(converted_len, seg_len) ... start_idx = end_idx - converted_len ... indices = np.linspace(start_idx, end_idx, num=clip_len) ... indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64) ... return indices >>> # load video >>> file_path = hf_hub_download( ... repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset" ... ) >>> container = av.open(file_path) >>> # sample frames >>> num_frames = model.config.num_image_with_embedding >>> indices = sample_frame_indices( ... clip_len=num_frames, frame_sample_rate=4, seg_len=container.streams.video[0].frames ... ) >>> frames = read_video_pyav(container, indices) >>> pixel_values = processor(images=list(frames), return_tensors="pt").pixel_values >>> generated_ids = model.generate(pixel_values=pixel_values, max_length=50) >>> print("Generated caption:", processor.batch_decode(generated_ids, skip_special_tokens=True)) Generated caption: ['a woman is sitting at a table and she is talking about the food she is holding.'] ``` NF) rw