g rate to get the waveform using *ffmpeg*. This requires *ffmpeg* to be installed on the system. - `bytes` it is supposed to be the content of an audio file and is interpreted by *ffmpeg* in the same way. - (`np.ndarray` of shape (n, ) of type `np.float32` or `np.float64`) Raw audio at the correct sampling rate (no further check will be done) - `dict` form can be used to pass raw audio sampled at arbitrary `sampling_rate` and let this pipeline do the resampling. The dict must be in the format `{"sampling_rate": int, "raw": np.array}` with optionally a `"stride": (left: int, right: int)` than can ask the pipeline to treat the first `left` samples and last `right` samples to be ignored in decoding (but used at inference to provide more context to the model). Only use `stride` with CTC models. return_timestamps (*optional*, `str`): Only available for pure CTC models. If set to `"char"`, the pipeline will return `timestamps` along the text for every character in the text. For instance if you get `[{"text": "h", "timestamps": (0.5,0.6), {"text": "i", "timestamps": (0.7, .9)}]`, then it means the model predicts that the letter "h" was pronounced after `0.5` and before `0.6` seconds. If set to `"word"`, the pipeline will return `timestamps` along the text for every word in the text. For instance if you get `[{"text": "hi ", "timestamps": (0.5,0.9), {"text": "there", "timestamps": (1.0, .1.5)}]`, then it means the model predicts that the word "hi" was pronounced after `0.5` and before `0.9` seconds. generate_kwargs (`dict`, *optional*): The dictionary of ad-hoc parametrization of `generate_config` to be used for the generation call. For a complete overview of generate, check the [following guide](https://huggingface.co/docs/transformers/en/main_classes/text_generation). max_new_tokens (`int`, *optional*): The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt. Return: `Dict`: A dictionary with the following keys: - **text** (`str` ) -- The recognized text. - **chunks** (*optional(, `List[Dict]`) When using `return_timestamps`, the `chunks` will become a list containing all the various text chunks identified by the model, *e.g.* `[{"text": "hi ", "timestamps": (0.5,0.9), {"text": "there", "timestamps": (1.0, 1.5)}]`. The original full text can roughly be recovered by doing `"".join(chunk["text"] for chunk in output["chunks"])`. )