d* of a predefined tokenizer hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`. - A path to a *directory* containing vocabulary files required by the tokenizer, for instance saved using the [`~PreTrainedTokenizer.save_pretrained`] method, e.g., `./my_model_directory/`. dataset (`Union[List[str]]`, *optional*): The dataset used for quantization. You can provide your own dataset in a list of string or just use the original datasets used in GPTQ paper ['wikitext2','c4','c4-new','ptb','ptb-new'] group_size (`int`, *optional*, defaults to 128): The group size to use for quantization. Recommended value is 128 and -1 uses per-column quantization. damp_percent (`float`, *optional*, defaults to 0.1): The percent of the average Hessian diagonal to use for dampening. Recommended value is 0.1. desc_act (`bool`, *optional*, defaults to `False`): Whether to quantize columns in order of decreasing activation size. Setting it to False can significantly speed up inference but the perplexity may become slightly worse. Also known as act-order. sym (`bool`, *optional*, defaults to `True`): Whether to use symetric quantization. true_sequential (`bool`, *optional*, defaults to `True`): Whether to perform sequential quantization even within a single Transformer block. Instead of quantizing the entire block at once, we perform layer-wise quantization. As a result, each layer undergoes quantization using inputs that have passed through the previously quantized layers. use_cuda_fp16 (`bool`, *optional*, defaults to `False`): Whether or not to use optimized cuda kernel for fp16 model. Need to have model in fp16. model_seqlen (`int`, *optional*): The maximum sequence length that the model can take. block_name_to_quantize (`str`, *optional*): The transformers block name to quantize. module_name_preceding_first_block (`List[str]`, *optional*): The layers that are preceding the first Transformer block. batch_size (`int`, *optional*, defaults to 1): The batch size used when processing the dataset pad_token_id (`int`, *optional*): The pad token id. Needed to prepare the dataset when `batch_size` > 1. disable_exllama (`bool`, *optional*, defaults to `False`): Whether to use exllama backend. Only works with `bits` = 4. Né€