Finetuning Speech-to-Text models: a Blueprint by Mozilla.ai for building your own STT/ASR dataset & model
This blueprint enables you to create your own Speech-to-Text dataset and model, optimizing performance for your specific language and use case. Everything runs locally—even on your laptop, ensuring your data stays private. You can finetune a model using your own data or leverage the Common Voice dataset, which supports a wide range of languages. To see the full list of supported languages, visit the CommonVoice website.
Input Speech audio:
audio.online-video-cutter.com.mp4
Text output:
Ground Truth | openai/whisper-small | mozilla-ai/whisper-small-gl * |
---|---|---|
O Comité Económico e Social Europeo deu luz verde esta terza feira ao uso de galego, euskera e catalán nas súas sesións plenarias, segundo informou o Ministerio de Asuntos Exteriores nun comunicado no que se felicitou da decisión. | O Comité Económico Social Europeo de Uluz Verde está terza feira a Ousse de Gallego e Uskera e Catalan a súas asesións planarias, segundo informou o Ministerio de Asuntos Exteriores nun comunicado no que se felicitou da decisión. | O Comité Económico Social Europeo deu luz verde esta terza feira ao uso de galego e usquera e catalán nas súas sesións planarias, segundo informou o Ministerio de Asuntos Exteriores nun comunicado no que se felicitou da decisión. |
* Finetuned on the Galician set Common Voice 17.0
👀 You can find a list of finetuned models, created by this Blueprint, on our HuggingFace collection.
Finetune a STT model on Google Colab | Transcribe using a HuggingFace model | Explore all the functionality on GitHub Codespaces |
---|---|---|
The same instructions apply for the GitHub Codespaces option.
- Use a virtual environment and install dependencies:
pip install -e .
& ffmpeg e.g. for Ubuntu:sudo apt install ffmpeg
, for Mac:brew install ffmpeg
- Simply execute:
python demo/transcribe_app.py
- Add the HF model id of your choice
- Record a sample of your voice and get the transcribe text back
- Create your own, local, custom dataset by running this command and following the instructions:
python src/speech_to_text_finetune/make_custom_dataset_app.py
- Configure
config.yaml
with the model, custom data directory and hyperparameters of your choice. Note that if you selectpush_to_hub: True
you need to have an HF account and log in locally. - Finetune a model by running:
python src/speech_to_text_finetune/finetune_whisper.py
- Test the finetuned model in the transcription app:
python demo/transcribe_app.py
There are two ways to download the Common Voice dataset:
- Go to https://commonvoice.mozilla.org/en/datasets, pick your language and dataset version and download the dataset
- Move the zipped file under a directory of your choice and extract it
Note: A Hugging Face account is required.
Note 2: The Common Voice dataset is not properly maintained on HuggingFace and the latest release there is a much older version.
- Go to the Common Voice dataset repo and ask for explicit access request (should be approved instantly).
- On Hugging Face create an Access Token and in your terminal, run the command
huggingface-cli login
and follow the instructions to log in to your account.
- Configure
config.yaml
with the model, the extracted Common Voice dir OR the dataset repo id of HF and hyperparameters of your choice. - Finetune a model by running:
python src/speech_to_text_finetune/finetune_whisper.py
- Test the finetuned model in the transcription app:
python demo/transcribe_app.py
Tip
Run python demo/model_comparison_app.py
to easily compare the performance of two models side by side (example).
If you are having issues / bugs, check our Troubleshooting section, before opening a new issue.
This project is licensed under the Apache 2.0 License. See the LICENSE file for details.
Contributions are welcome! To get started, you can check out the CONTRIBUTING.md file.