Speech To Text

Project to combine VAD, Speaker Diarization, Speech Recognition together.

Getting Started

As DeepSpeech pre-trained English model is too big to commit to git. You could download from (https://github.com/mozilla/DeepSpeech/releases/download/v0.6.1/deepspeech-0.6.1-models.tar.gz)

Please put the downloaded model files into folder speech-to-text/deepspeech/models

# Create and activate a virtual environment
python3 -m venv speech-to-text/env
source speech-to-text/env/bin/activate

# Install prerequisites
pip3 install -r requirements.txt

# Speech To Text
python3 speech_to_text.py --audio=wavs/test2.wav

It will output the txt file with speakers and speech text, side by side the wav file

Overview

It's just to combine speaker diarization and speech recognization together.

Only support 16k sample rate PCM wav file. You can use ffmpeg to convert sound file format. i.e.

ffmpeg -i input.mp3 -acodec pcm_s16le -ar 16000 output.wav

Main flows

Filter out silence frames and break down to segments with webrtcvad.
Generate utterances spec with librosa
Get utterances features with ghostvlad
Classify features with uisrnn model
Recognize speeches segment by segment with deepspeech

It might takes long period(tens minutes) if the wav is too big.(seems uisrnn part takes the longest) The test wavs in the wavs folder are from movie sound clips. The speech accuracy is not perfect, it might relative to the pretrained deepspeech model and the background noise

Prerequisites

pytorch
keras
tensorflow
pyaudio
librosa
webrtcvad
deepspeech

References

Tip

Following are the libs version installed in my env, just for your reference.

absl-py 0.7.1
astor 0.8.0
astroid 2.4.2
audioread 2.1.8
cffi 1.12.3
decorator 4.4.0
deepspeech 0.5.1
gast 0.2.2
google-pasta 0.1.7
grpcio 1.22.0
h5py 2.9.0
isort 5.6.4
joblib 0.13.2
Keras 2.2.4
Keras-Applications 1.0.8
Keras-Preprocessing 1.1.0
lazy-object-proxy 1.4.3
librosa 0.7.0
llvmlite 0.29.0
Markdown 3.1.1
mccabe 0.6.1
numba 0.45.0
numpy 1.16.4
pip 19.2
protobuf 3.9.0
PyAudio 0.2.11
pycparser 2.19
pylint 2.6.0
PyYAML 5.1.1
resampy 0.2.1
scikit-learn 0.21.2
scipy 1.3.0
setuptools 41.0.1
six 1.12.0
SoundFile 0.10.2
tensorboard 1.14.0
tensorflow 1.14.0
tensorflow-estimator 1.14.0
termcolor 1.1.0
toml 0.10.1
torch 1.1.0.post2
typed-ast 1.4.1
webrtcvad 2.0.10
Werkzeug 0.15.5
wheel 0.33.4
wrapt 1.11.2

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
ghostvlad		ghostvlad
uisrnn		uisrnn
wavs		wavs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
speaker_diarization.py		speaker_diarization.py
speech_to_text.py		speech_to_text.py
wavSplit.py		wavSplit.py
wavTranscriber.py		wavTranscriber.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech To Text

Getting Started

Overview

Prerequisites

References

Tip

About

Releases

Packages

Languages

License

terry-yip/speech-to-text

Folders and files

Latest commit

History

Repository files navigation

Speech To Text

Getting Started

Overview

Prerequisites

References

Tip

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages