Finetuning Speech-to-Text models: a Blueprint by Mozilla.ai for building your own STT/ASR dataset & model

Blueprints Hub | Documentation | Getting Started | Contributing

Finetuning Speech-to-Text models: a Blueprint by Mozilla.ai for building your own STT/ASR dataset & model

This blueprint enables you to create your own Speech-to-Text dataset and model, optimizing performance for your specific language and use case. Everything runs locally—even on your laptop, ensuring your data stays private. You can finetune a model using your own data or leverage the Common Voice dataset, which supports a wide range of languages. To see the full list of supported languages, visit the CommonVoice website.

Example result on Galician

Input Speech audio:

audio.online-video-cutter.com.mp4

Text output:

Ground Truth	openai/whisper-small	mozilla-ai/whisper-small-gl *
O Comité Económico e Social Europeo deu luz verde esta terza feira ao uso de galego, euskera e catalán nas súas sesións plenarias, segundo informou o Ministerio de Asuntos Exteriores nun comunicado no que se felicitou da decisión.	O Comité Económico Social Europeo de Uluz Verde está terza feira a Ousse de Gallego e Uskera e Catalan a súas asesións planarias, segundo informou o Ministerio de Asuntos Exteriores nun comunicado no que se felicitou da decisión.	O Comité Económico Social Europeo deu luz verde esta terza feira ao uso de galego e usquera e catalán nas súas sesións planarias, segundo informou o Ministerio de Asuntos Exteriores nun comunicado no que se felicitou da decisión.

* Finetuned on the Galician set Common Voice 17.0

👀 You can find a list of finetuned models, created by this Blueprint, on our HuggingFace collection.

Quick-start

Finetune a STT model on Google Colab	Transcribe using a HuggingFace model	Explore all the functionality on GitHub Codespaces

Try it locally

The same instructions apply for the GitHub Codespaces option.

Setup

Use a virtual environment and install dependencies: pip install -e . & ffmpeg e.g. for Ubuntu: sudo apt install ffmpeg, for Mac: brew install ffmpeg

Evaluate existing STT models from the HuggingFace repository.

Simply execute: python demo/transcribe_app.py
Add the HF model id of your choice
Record a sample of your voice and get the transcribe text back

Making your own STT model using Custom Data

Create your own, local, custom dataset by running this command and following the instructions: python src/speech_to_text_finetune/make_custom_dataset_app.py
Configure config.yaml with the model, custom data directory and hyperparameters of your choice. Note that if you select push_to_hub: True you need to have an HF account and log in locally.
Finetune a model by running: python src/speech_to_text_finetune/finetune_whisper.py
Test the finetuned model in the transcription app: python demo/transcribe_app.py

Making your own STT model using Common Voice

There are two ways to download the Common Voice dataset:

From Common Voice's website (Recommended)

Go to https://commonvoice.mozilla.org/en/datasets, pick your language and dataset version and download the dataset
Move the zipped file under a directory of your choice and extract it

From HuggingFace

Note: A Hugging Face account is required.

Note 2: The Common Voice dataset is not properly maintained on HuggingFace and the latest release there is a much older version.

Go to the Common Voice dataset repo and ask for explicit access request (should be approved instantly).
On Hugging Face create an Access Token and in your terminal, run the command huggingface-cli login and follow the instructions to log in to your account.

After you have completed the steps above

Configure config.yaml with the model, the extracted Common Voice dir OR the dataset repo id of HF and hyperparameters of your choice.
Finetune a model by running: python src/speech_to_text_finetune/finetune_whisper.py
Test the finetuned model in the transcription app: python demo/transcribe_app.py

Tip

Run python demo/model_comparison_app.py to easily compare the performance of two models side by side (example).

Troubleshooting

If you are having issues / bugs, check our Troubleshooting section, before opening a new issue.

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Contributing

Contributions are welcome! To get started, you can check out the CONTRIBUTING.md file.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
.devcontainer		.devcontainer
.github		.github
demo		demo
docs		docs
example_data		example_data
images		images
src/speech_to_text_finetune		src/speech_to_text_finetune
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finetuning Speech-to-Text models: a Blueprint by Mozilla.ai for building your own STT/ASR dataset & model

Example result on Galician

Quick-start

Try it locally

Setup

Evaluate existing STT models from the HuggingFace repository.

Making your own STT model using Custom Data

Making your own STT model using Common Voice

From Common Voice's website (Recommended)

From HuggingFace

After you have completed the steps above

Troubleshooting

License

Contributing

About

Releases 6

Packages

Contributors 4

Languages

License

mozilla-ai/speech-to-text-finetune

Folders and files

Latest commit

History

Repository files navigation

Finetuning Speech-to-Text models: a Blueprint by Mozilla.ai for building your own STT/ASR dataset & model

Example result on Galician

Quick-start

Try it locally

Setup

Evaluate existing STT models from the HuggingFace repository.

Making your own STT model using Custom Data

Making your own STT model using Common Voice

From Common Voice's website (Recommended)

From HuggingFace

After you have completed the steps above

Troubleshooting

License

Contributing

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 4

Languages

Packages