Besides speaker related tasks, speaker embeddings can be utilized for many related tasks which requires speaker modeling, such as
- voice conversion
- text-to-speech
- speaker adaptive ASR
- target speaker extraction
For users who would like to verify the SV performance or extract speaker embeddings for the above tasks without troubling about training the speaker embedding learner, we provide two types of pretrained models.
-
Checkpoint Model, with suffix .pt, the model trained and saved as checkpoint by WeSpeaker python code, you can reproduce our published result with it, or you can use it as checkpoint to continue.
-
Runtime Model, with suffix .onnx, the
runtime model
is exported byOnnxruntime
on thecheckpoint model
.
The pretrained model in WeNet follows the license of it's corresponding dataset.
For example, the pretrained model on VoxCeleb follows Creative Commons Attribution 4.0 International License.
, since
it is used as license of the VoxCeleb dataset, see https://mm.kaist.ac.kr/datasets/voxceleb/.
To use the pretrained model in pytorch
format, please directly refer to the run.sh
in corresponding recipe.
As for extracting speaker embeddings from the onnx
model, the following is a toy example.
# Download the pretrained model in onnx format and save it as onnx_path
# wav_path is the path to your wave file (16k)
python wespeaker/bin/infer_onnx.py --onnx_path $onnx_path --wav_path $wav_path
You can easily adapt infer_onnx.py
to your application, a speaker diarization example can be found
in the voxconverse recipe.
The model with suffix LM means that it is further fine-tuned using large-margin fine-tuning, which could perform better on long audios, e.g. >3s.
Datasets | Languages | Checkpoint (pt) | Runtime Model (onnx) |
---|---|---|---|
VoxCeleb | EN | ResNet34 / ResNet34_LM | ResNet34 / ResNet34_LM |
VoxCeleb | EN | ResNet152_LM | ResNet152_LM |
VoxCeleb | EN | ResNet221_LM | ResNet221_LM |
VoxCeleb | EN | ResNet293_LM | ResNet293_LM |
VoxCeleb | EN | CAM++ / CAM++_LM | CAM++ / CAM++_LM |
VoxCeleb | EN | ECAPA512 / ECAPA512_LM / ECAPA512_DINO | ECAPA512 / ECAPA512_LM |
VoxCeleb | EN | ECAPA1024 / ECAPA1024_LM | ECAPA1024 / ECAPA1024_LM |
VoxCeleb | EN | Gemini_DFResnet114_LM | Gemini_DFResnet114_LM |
CNCeleb | CN | ResNet34 / ResNet34_LM | ResNet34 / ResNet34_LM |
VoxBlink2 | Multilingual | SimAMResNet34 | SimAMResNet34 |
VoxBlink2 (pretrain) + VoxCeleb2 (finetune) | Multilingual | SimAMResNet34 | SimAMResNet34 |
VoxBlink2 | Multilingual | SimAMResNet100 | SimAMResNet100 |
VoxBlink2 (pretrain) + VoxCeleb2 (finetune) | Multilingual | SimAMResNet100 | SimAMResNet100 |
Datasets | Languages | Checkpoint (pt) | Runtime Model (onnx) |
---|---|---|---|
VoxCeleb | EN | ResNet34 / ResNet34_LM | ResNet34 / ResNet34_LM |
VoxCeleb | EN | ResNet152_LM | ResNet152_LM |
VoxCeleb | EN | ResNet221_LM | ResNet221_LM |
VoxCeleb | EN | ResNet293_LM | ResNet293_LM |
VoxCeleb | EN | CAM++ / CAM++_LM | CAM++ / CAM++_LM |
VoxCeleb | EN | ECAPA512 / ECAPA512_LM | ECAPA512 / ECAPA512_LM |
VoxCeleb | EN | ECAPA1024 / ECAPA1024_LM | ECAPA1024 / ECAPA1024_LM |
VoxCeleb | EN | Gemini_DFResnet114_LM | Gemini_DFResnet114_LM |
CNCeleb | CN | ResNet34 / ResNet34_LM | ResNet34 / ResNet34_LM |