d(): ... for i in range(tl): ... encoded = processor.tokenizer(inferred_token) ... input_ids = torch.tensor(encoded.input_ids) ... encoded = encoded["input_ids"][0][1:-1] ... outputs = model(input_ids=input_ids, pixel_values=encoding.pixel_values) ... mlm_logits = outputs.logits[0] # shape (seq_len, vocab_size) ... # only take into account text features (minus CLS and SEP token) ... mlm_logits = mlm_logits[1 : input_ids.shape[1] - 1, :] ... mlm_values, mlm_ids = mlm_logits.softmax(dim=-1).max(dim=-1) ... # only take into account text ... mlm_values[torch.tensor(encoded) != 103] = 0 ... select = mlm_values.argmax().item() ... encoded[select] = mlm_ids[select].item() ... inferred_token = [processor.decode(encoded)] >>> selected_token = "" >>> encoded = processor.tokenizer(inferred_token) >>> output = processor.decode(encoded.input_ids[0], skip_special_tokens=True) >>> print(output) a bunch of cats laying on a couch. ```N© r“