Skip to content

mozilla-ai/speech-to-text-finetune

Repository files navigation

Project logo

Finetuning Speech-to-Text models: a Blueprint by Mozilla.ai for building your own STT/ASR dataset & model

This blueprint enables you to create your own Speech-to-Text dataset and model, optimizing performance for your specific language and use case. Everything runs locally—even on your laptop, ensuring your data stays private. You can finetune a model using your own data or leverage the Common Voice dataset, which supports a wide range of languages. To see the full list of supported languages, visit the CommonVoice website.

speech-to-text-finetune Diagram

Example result on Galician

Input Speech audio:

audio.online-video-cutter.com.mp4

Text output:

Ground Truth openai/whisper-small mozilla-ai/whisper-small-gl *
O Comité Económico e Social Europeo deu luz verde esta terza feira ao uso de galego, euskera e catalán nas súas sesións plenarias, segundo informou o Ministerio de Asuntos Exteriores nun comunicado no que se felicitou da decisión. O Comité Económico Social Europeo de Uluz Verde está terza feira a Ousse de Gallego e Uskera e Catalan a súas asesións planarias, segundo informou o Ministerio de Asuntos Exteriores nun comunicado no que se felicitou da decisión. O Comité Económico Social Europeo deu luz verde esta terza feira ao uso de galego e usquera e catalán nas súas sesións planarias, segundo informou o Ministerio de Asuntos Exteriores nun comunicado no que se felicitou da decisión.

* Finetuned on the Galician set Common Voice 17.0

👀 You can find a list of finetuned models, created by this Blueprint, on our HuggingFace collection.

Quick-start

Finetune a STT model on Google Colab Transcribe using a HuggingFace model Explore all the functionality on GitHub Codespaces
Try Finetuning on Colab Try on Spaces Try on Codespaces

Try it locally

The same instructions apply for the GitHub Codespaces option.

Setup

  1. Use a virtual environment and install dependencies: pip install -e . & ffmpeg e.g. for Ubuntu: sudo apt install ffmpeg, for Mac: brew install ffmpeg

Evaluate existing STT models from the HuggingFace repository.

  1. Simply execute: python demo/transcribe_app.py
  2. Add the HF model id of your choice
  3. Record a sample of your voice and get the transcribe text back

Making your own STT model using Custom Data

  1. Create your own, local, custom dataset by running this command and following the instructions: python src/speech_to_text_finetune/make_custom_dataset_app.py
  2. Configure config.yaml with the model, custom data directory and hyperparameters of your choice. Note that if you select push_to_hub: True you need to have an HF account and log in locally.
  3. Finetune a model by running: python src/speech_to_text_finetune/finetune_whisper.py
  4. Test the finetuned model in the transcription app: python demo/transcribe_app.py

Making your own STT model using Common Voice

There are two ways to download the Common Voice dataset:

From Common Voice's website (Recommended)

  1. Go to https://commonvoice.mozilla.org/en/datasets, pick your language and dataset version and download the dataset
  2. Move the zipped file under a directory of your choice and extract it

From HuggingFace

Note: A Hugging Face account is required.

Note 2: The Common Voice dataset is not properly maintained on HuggingFace and the latest release there is a much older version.

  1. Go to the Common Voice dataset repo and ask for explicit access request (should be approved instantly).
  2. On Hugging Face create an Access Token and in your terminal, run the command huggingface-cli login and follow the instructions to log in to your account.

After you have completed the steps above

  1. Configure config.yaml with the model, the extracted Common Voice dir OR the dataset repo id of HF and hyperparameters of your choice.
  2. Finetune a model by running: python src/speech_to_text_finetune/finetune_whisper.py
  3. Test the finetuned model in the transcription app: python demo/transcribe_app.py

Tip

Run python demo/model_comparison_app.py to easily compare the performance of two models side by side (example).

Troubleshooting

If you are having issues / bugs, check our Troubleshooting section, before opening a new issue.

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Contributing

Contributions are welcome! To get started, you can check out the CONTRIBUTING.md file.