f an automatically loaded configuration. Configuration can be automatically loaded when: - The model is a model provided by the library (loaded with the *model id* string of a pretrained model). - The model was saved using [`~PreTrainedModel.save_pretrained`] and is reloaded by supplying the save directory. - The model is loaded by supplying a local directory as `pretrained_model_name_or_path` and a configuration JSON file named *config.json* is found in the directory. state_dict (`Dict[str, torch.Tensor]`, *optional*): A state dictionary to use instead of a state dictionary loaded from saved weights file. This option can be used if you want to create a model from a pretrained configuration but load your own weights. In this case though, you should check if using [`~PreTrainedModel.save_pretrained`] and [`~PreTrainedModel.from_pretrained`] is not a simpler option. cache_dir (`Union[str, os.PathLike]`, *optional*): Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used. from_tf (`bool`, *optional*, defaults to `False`): Load the model weights from a TensorFlow checkpoint save file (see docstring of `pretrained_model_name_or_path` argument). from_flax (`bool`, *optional*, defaults to `False`): Load the model weights from a Flax checkpoint save file (see docstring of `pretrained_model_name_or_path` argument). ignore_mismatched_sizes (`bool`, *optional*, defaults to `False`): Whether or not to raise an error if some of the weights from the checkpoint do not have the same size as the weights of the model (if for instance, you are instantiating a model with 10 labels from a checkpoint with 3 labels). force_download (`bool`, *optional*, defaults to `False`): Whether or not to force the (re-)download of the model weights and configuration files, overriding the cached versions if they exist. resume_download (`bool`, *optional*, defaults to `False`): Whether or not to delete incompletely received files. Will attempt to resume the download if such a file exists. proxies (`Dict[str, str]`, *optional*): A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request. output_loading_info(`bool`, *optional*, defaults to `False`): Whether ot not to also return a dictionary containing missing keys, unexpected keys and error messages. local_files_only(`bool`, *optional*, defaults to `False`): Whether or not to only look at local files (i.e., do not try to download the model). token (`str` or `bool`, *optional*): The token to use as HTTP bearer authorization for remote files. If `True`, or not specified, will use the token generated when running `huggingface-cli login` (stored in `~/.huggingface`). revision (`str`, *optional*, defaults to `"main"`): The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any identifier allowed by git. To test a pull request you made on the Hub, you can pass `revision="refs/pr/". mirror (`str`, *optional*): Mirror source to accelerate downloads in China. If you are from China and have an accessibility problem, you can set this option to resolve it. Note that we do not guarantee the timeliness or safety. Please refer to the mirror site for more information. _fast_init(`bool`, *optional*, defaults to `True`): Whether or not to disable fast initialization. One should only disable *_fast_init* to ensure backwards compatibility with `transformers.__version__ < 4.6.0` for seeded model initialization. This argument will be removed at the next major version. See [pull request 11471](https://github.com/huggingface/transformers/pull/11471) for more information. > Parameters for big model inference low_cpu_mem_usage(`bool`, *optional*): Tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model. This is an experimental feature and a subject to change at any moment. torch_dtype (`str` or `torch.dtype`, *optional*): Override the default `torch.dtype` and load the model under a specific `dtype`. The different options are: 1. `torch.float16` or `torch.bfloat16` or `torch.float`: load in a specified `dtype`, ignoring the model's `config.torch_dtype` if one exists. If not specified - the model will get loaded in `torch.float` (fp32). 2. `"auto"` - A `torch_dtype` entry in the `config.json` file of the model will be attempted to be used. If this entry isn't found then next check the `dtype` of the first weight in the checkpoint that's of a floating point type and use that as `dtype`. This will load the model using the `dtype` it was saved in at the end of the training. It can't be used as an indicator of how the model was trained. Since it could be trained in one of half precision dtypes, but saved in fp32. For some models the `dtype` they were trained in is unknown - you may try to check the model's paper or reach out to the authors and ask them to add this information to the model's card and to insert the `torch_dtype` entry in `config.json` on the hub. device_map (`str` or `Dict[str, Union[int, str, torch.device]]` or `int` or `torch.device`, *optional*): A map that specifies where each submodule should go. It doesn't need to be refined to each parameter/buffer name, once a given module name is inside, every submodule of it will be sent to the same device. If we only pass the device (*e.g.*, `"cpu"`, `"cuda:1"`, `"mps"`, or a GPU ordinal rank like `1`) on which the model will be allocated, the device map will map the entire model to this device. Passing `device_map = 0` means put the whole model on GPU 0. To have Accelerate compute the most optimized `device_map` automatically, set `device_map="auto"`. For more information about each option see [designing a device map](https://hf.co/docs/accelerate/main/en/usage_guides/big_modeling#designing-a-device-map). max_memory (`Dict`, *optional*): A dictionary device identifier to maximum memory. Will default to the maximum memory available for each GPU and the available CPU RAM if unset. offload_folder (`str` or `os.PathLike`, *optional*): If the `device_map` contains any value `"disk"`, the folder where we will offload weights. offload_state_dict (`bool`, *optional*): If `True`, will temporarily offload the CPU state dict to the hard drive to avoid getting out of CPU RAM if the weight of the CPU state dict + the biggest shard of the checkpoint does not fit. Defaults to `True` when there is some disk offload. load_in_8bit (`bool`, *optional*, defaults to `False`): If `True`, will convert the loaded model into mixed-8bit quantized model. To use this feature please install `bitsandbytes` (`pip install -U bitsandbytes`). load_in_4bit (`bool`, *optional*, defaults to `False`): If `True`, will convert the loaded model into 4bit precision quantized model. To use this feature install the latest version of `bitsandbytes` (`pip install -U bitsandbytes`). quantization_config (`Union[QuantizationConfigMixin,Dict]`, *optional*): A dictionary of configuration parameters or a QuantizationConfigMixin object for quantization (e.g bitsandbytes, gptq) subfolder (`str`, *optional*, defaults to `""`): In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can specify the folder name here. variant (`str`, *optional*): If specified load weights from `variant` filename, *e.g.* pytorch_model..bin. `variant` is ignored when using `from_tf` or `from_flax`. use_safetensors (`bool`, *optional*, defaults to `None`): Whether or not to use `safetensors` checkpoints. Defaults to `None`. If not specified and `safetensors` is not installed, it will be set to `False`. kwargs (remaining dictionary of keyword arguments, *optional*): Can be used to update the configuration object (after it being loaded) and initiate the model (e.g., `output_attentions=True`). Behaves differently depending on whether a `config` is provided or automatically loaded: - If a configuration is provided with `config`, `**kwargs` will be directly passed to the underlying model's `__init__` method (we assume all relevant updates to the configuration have already been done) - If a configuration is not provided, `kwargs` will be first passed to the configuration class initialization function ([`~PretrainedConfig.from_pretrained`]). Each key of `kwargs` that corresponds to a configuration attribute will be used to override said attribute with the supplied `kwargs` value. Remaining keys that do not correspond to any configuration attribute will be passed to the underlying model's `__init__` function. Activate the special ["offline-mode"](https://huggingface.co/transformers/installation.html#offline-mode) to use this method in a firewalled environment. Examples: ```python >>> from transformers import BertConfig, BertModel >>> # Download model and configuration from huggingface.co and cache. >>> model = BertModel.from_pretrained("bert-base-uncased") >>> # Model was saved using *save_pretrained('./test/saved_model/')* (for example purposes, not runnable). >>> model = BertModel.from_pretrained("./test/saved_model/") >>> # Update configuration during loading. >>> model = BertModel.from_pretrained("bert-base-uncased", output_attentions=True) >>> assert model.config.output_attentions == True >>> # Loading from a TF checkpoint file instead of a PyTorch model (slower, for example purposes, not runnable). >>> config = BertConfig.from_json_file("./tf_model/my_tf_model_config.json") >>> model = BertModel.from_pretrained("./tf_model/my_tf_checkpoint.ckpt.index", from_tf=True, config=config) >>> # Loading from a Flax checkpoint file instead of a PyTorch model (slower) >>> model = BertModel.from_pretrained("bert-base-uncased", from_flax=True) ``` * `low_cpu_mem_usage` algorithm: This is an experimental function that loads the model using ~1x model size CPU memory Here is how it works: 1. save which state_dict keys we have 2. drop state_dict before the model is created, since the latter takes 1x model size CPU memory 3. after the model has been instantiated switch to the meta device all params/buffers that are going to be replaced from the loaded state_dict 4. load state_dict 2nd time 5. replace the params/buffers from the state_dict Currently, it can't handle deepspeed ZeRO stage 3 and ignores loading errors r«