3184777.jpg" img = processor.image_processor.fetch_images([url])[0] prompts = [ "User:", img, "Describe this image. Assistant: An image of two kittens in grass. ", "User:", "https://hips.hearstapps.com/hmg-prod/images/dog-puns-1581708208.jpg", "Describe this image. Assistant:", ] inputs = processor(prompts, return_tensors="pt") generated_ids = model.generate(**inputs, max_length=100) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] ``` In this example the `prompts` will be converted into: ``` User:Describe this image. Assistant: An image of two kittens in grass. User:Describe this image. Assistant:' ``` and the two images will be massaged using [`IdeficsImageProcessor.__call__`] method and placed inside the `pixel_values` dict entry of the return value. This example also examplifies that images can be passed as objects or as text urls. It can be seen that the first image is passed as object and the second one as a url. To do training do: ```python image_transform = transforms.Compose( [ transforms.RandomResizedCrop( (w, h), scale=(0.9, 1.0), interpolation=transforms.InterpolationMode.BICUBIC ), transforms.ToTensor(), transforms.Normalize(mean=self.image_mean, std=self.image_std), ] ) inputs = processor(prompts, transform=image_transform, return_tensors="pt") ``` In order to help debug prompt generation enable `debug=True` which will show you what's happening. Nc