xact implementation of the attention mechanism. While merging the TP rank tensors, due to slicing operations the results may be slightly different between the model trained on Megatron and our model. Please refer to [this issue](https://github.com/pytorch/pytorch/issues/76232). A solution to obtain more accurate results is to enable this feature. Enabling this will hurt the computational time of the inference. Will be probably resolved in the future once the main model has been fine-tuned with TP_rank=1. Example: ```python >>> from transformers import BloomConfig, BloomModel >>> # Initializing a Bloom configuration >>> configuration = BloomConfig() >>> # Initializing a model (with random weights) from the configuration >>> model = BloomModel(configuration) >>> # Accessing the model configuration >>> configuration = model.config ```Z