ns=False) ... ).unsqueeze( ... 0 ... ) # We will predict the masked token >>> labels = torch.tensor(tokenizer.encode("cute", add_special_tokens=False)).unsqueeze(0) >>> assert labels.shape[0] == 1, "only one word will be predicted" >>> perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float) >>> perm_mask[ ... :, :, -1 ... ] = 1.0 # Previous tokens don't see last token as is done in standard auto-regressive lm training >>> target_mapping = torch.zeros( ... (1, 1, input_ids.shape[1]), dtype=torch.float ... ) # Shape [1, 1, seq_length] => let's predict one token >>> target_mapping[ ... 0, 0, -1 ... ] = 1.0 # Our first (and only) prediction will be the last token of the sequence (the masked token) >>> outputs = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping, labels=labels) >>> loss = outputs.loss >>> next_token_logits = ( ... outputs.logits ... ) # Logits have shape [target_mapping.size(0), target_mapping.size(1), config.vocab_size] ```NŠ