Skip to content

Commit 3b2da7d

Browse files
pavel-esirtsavina
andauthored
[docs] corrected DeepSpeech conversion (#6011)
* corrected output names in DeepSpeech conversion doc * mo args correction * changed instruction for DeepSpeech version 0.8.2 * added venv activate; removed redundant ending * added picture and squashed MO graph input args into one * Apply suggestions from code review Co-authored-by: Tatiana Savina <tatiana.savina@intel.com> * applied review comments Co-authored-by: Tatiana Savina <tatiana.savina@intel.com>
1 parent 6799a31 commit 3b2da7d

File tree

3 files changed

+55
-40
lines changed

3 files changed

+55
-40
lines changed

docs/MO_DG/img/DeepSpeech-0.8.2.png

+3
Loading

docs/MO_DG/img/DeepSpeech.png

-3
This file was deleted.

docs/MO_DG/prepare_model/convert_model/tf_specific/Convert_DeepSpeech_From_Tensorflow.md

+52-37
Original file line numberDiff line numberDiff line change
@@ -2,66 +2,81 @@
22

33
[DeepSpeech project](https://github.com/mozilla/DeepSpeech) provides an engine to train speech-to-text models.
44

5-
## Download the Pre-Trained DeepSpeech Model
5+
## Download the Pretrained DeepSpeech Model
66

7-
[Pre-trained English speech-to-text model](https://github.com/mozilla/DeepSpeech#getting-the-pre-trained-model)
8-
is publicly available. To download the model, please follow the instruction below:
7+
Create a directory where model and metagraph with pretrained weights will be stored:
8+
```
9+
mkdir deepspeech
10+
cd deepspeech
11+
```
12+
[Pretrained English speech-to-text model](https://github.com/mozilla/DeepSpeech/releases/tag/v0.8.2) is publicly available.
13+
To download the model, follow the instruction below:
914

1015
* For UNIX*-like systems, run the following command:
1116
```
12-
wget -O - https://github.com/mozilla/DeepSpeech/releases/download/v0.3.0/deepspeech-0.3.0-models.tar.gz | tar xvfz -
17+
wget -O - https://github.com/mozilla/DeepSpeech/archive/v0.8.2.tar.gz | tar xvfz -
18+
wget -O - https://github.com/mozilla/DeepSpeech/releases/download/v0.8.2/deepspeech-0.8.2-checkpoint.tar.gz | tar xvfz -
1319
```
1420
* For Windows* systems:
15-
1. Download the archive from the DeepSpeech project repository: [https://github.com/mozilla/DeepSpeech/releases/download/v0.3.0/deepspeech-0.3.0-models.tar.gz](https://github.com/mozilla/DeepSpeech/releases/download/v0.3.0/deepspeech-0.3.0-models.tar.gz).
16-
2. Unpack it with a file archiver application.
21+
1. Download the archive with the model: [https://github.com/mozilla/DeepSpeech/archive/v0.8.2.tar.gz](https://github.com/mozilla/DeepSpeech/archive/v0.8.2.tar.gz).
22+
2. Download the TensorFlow\* MetaGraph with pretrained weights: [https://github.com/mozilla/DeepSpeech/releases/download/v0.8.2/deepspeech-0.8.2-checkpoint.tar.gz](https://github.com/mozilla/DeepSpeech/releases/download/v0.8.2/deepspeech-0.8.2-checkpoint.tar.gz).
23+
3. Unpack it with a file archiver application.
24+
25+
## Freeze the Model into a *.pb File
1726

18-
After you unpack the archive with the pre-trained model, you will have the new `models` directory with the
19-
following files:
27+
After unpacking the archives above, you have to freeze the model. Note that this requires
28+
TensorFlow* version 1 which is not available under Python 3.8, so you need Python 3.7 or lower.
29+
Before freezing, deploy a virtual environment and install the required packages:
2030
```
21-
alphabet.txt
22-
lm.binary
23-
output_graph.pb
24-
output_graph.pbmm
25-
output_graph.rounded.pb
26-
output_graph.rounded.pbmm
27-
trie
31+
virtualenv --python=python3.7 venv-deep-speech
32+
source venv-deep-speech/bin/activate
33+
cd DeepSpeech-0.8.2
34+
pip3 install -e .
2835
```
36+
Freeze the model with the following command:
37+
```
38+
python3 DeepSpeech.py --checkpoint_dir ../deepspeech-0.8.2-checkpoint --export_dir ../
39+
```
40+
After that, you will get the pretrained frozen model file `output_graph.pb` in the directory `deepspeech` created at
41+
the beginning. The model contains the preprocessing and main parts. The first preprocessing part performs conversion of input
42+
spectrogram into a form useful for speech recognition (mel). This part of the model is not convertible into
43+
IR because it contains unsupported operations `AudioSpectrogram` and `Mfcc`.
2944

30-
Pre-trained frozen model file is `output_graph.pb`.
31-
32-
![DeepSpeech model view](../../../img/DeepSpeech.png)
45+
The main and most computationally expensive part of the model converts the preprocessed audio into text.
46+
There are two specificities with the supported part of the model.
3347

34-
As you can see, the frozen model still has two variables: `previous_state_c` and
35-
`previous_state_h`. It means that the model keeps training those variables at each inference.
48+
The first is that the model contains an input with sequence length. So the model can be converted with
49+
a fixed input length shape, thus the model is not reshapeable.
50+
Refer to the [Using Shape Inference](../../../../IE_DG/ShapeInference.md).
3651

37-
At the first inference of this graph, the variables are initialized by zero tensors. After executing the `lstm_fused_cell` nodes, cell state and hidden state, which are the results of the `BlockLSTM` execution, are assigned to these two variables.
52+
The second is that the frozen model still has two variables: `previous_state_c` and `previous_state_h`, figure
53+
with the frozen *.pb model is below. It means that the model keeps training these variables at each inference.
3854

39-
With each inference of the DeepSpeech graph, initial cell state and hidden state data for `BlockLSTM` is taken from previous inference from variables. Outputs (cell state and hidden state) of `BlockLSTM` are reassigned to the same variables.
55+
![DeepSpeech model view](../../../img/DeepSpeech-0.8.2.png)
4056

41-
It helps the model to remember the context of the words that it takes as input.
57+
At the first inference the variables are initialized with zero tensors. After executing, the results of the `BlockLSTM`
58+
are assigned to cell state and hidden state, which are these two variables.
4259

43-
## Convert the TensorFlow* DeepSpeech Model to IR
60+
## Convert the Main Part of DeepSpeech Model into IR
4461

45-
The Model Optimizer assumes that the output model is for inference only. That is why you should cut those variables off and resolve keeping cell and hidden states on the application level.
62+
Model Optimizer assumes that the output model is for inference only. That is why you should cut `previous_state_c`
63+
and `previous_state_h` variables off and resolve keeping cell and hidden states on the application level.
4664

4765
There are certain limitations for the model conversion:
4866
- Time length (`time_len`) and sequence length (`seq_len`) are equal.
4967
- Original model cannot be reshaped, so you should keep original shapes.
5068

51-
To generate the DeepSpeech Intermediate Representation (IR), provide the TensorFlow DeepSpeech model to the Model Optimizer with the following parameters:
69+
To generate the IR, run the Model Optimizer with the following parameters:
5270
```sh
53-
python3 ./mo_tf.py \
54-
--input_model path_to_model/output_graph.pb \
55-
--freeze_placeholder_with_value input_lengths->[16] \
56-
--input input_node,previous_state_h/read,previous_state_c/read \
57-
--input_shape [1,16,19,26],[1,2048],[1,2048] \
58-
--output raw_logits,lstm_fused_cell/GatherNd,lstm_fused_cell/GatherNd_1 \
71+
python3 {path_to_mo}/mo_tf.py \
72+
--input_model output_graph.pb \
73+
--input "input_lengths->[16],input_node[1 16 19 26],previous_state_h[1 2048],previous_state_c[1 2048]" \
74+
--output "cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/GatherNd_1,cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/GatherNd,logits" \
5975
--disable_nhwc_to_nchw
6076
```
6177

6278
Where:
63-
* `--freeze_placeholder_with_value input_lengths->[16]` freezes sequence length
64-
* `--input input_node,previous_state_h/read,previous_state_c/read` and
65-
`--input_shape [1,16,19,26],[1,2048],[1,2048]` replace the variables with a placeholder
66-
* `--output raw_logits,lstm_fused_cell/GatherNd,lstm_fused_cell/GatherNd_1` gets data for the next model
67-
execution.
79+
* `input_lengths->[16]` Replaces the input node with name "input_lengths" with a constant tensor of shape [1] with a
80+
single integer value 16. This means that the model now can consume input sequences of length 16 only.
81+
* `input_node[1 16 19 26],previous_state_h[1 2048],previous_state_c[1 2048]` replaces the variables with a placeholder.
82+
* `--output ".../GatherNd_1,.../GatherNd,logits" ` output node names.

0 commit comments

Comments
 (0)