Skip to content

Latest commit

 

History

History
39 lines (31 loc) · 1.45 KB

README.md

File metadata and controls

39 lines (31 loc) · 1.45 KB

Listen, attend and spell

Minimal tf 2.0 implementation of Listen, attend and spell (https://arxiv.org/abs/1508.01211). To get a better understanding of the naming of the models variables please see the paper above.

Done:

  • Model architecture looks right to me. If you find an error in the code please dont hesitate to open an issue 😊

ToDo:

  • Implement data handing for easier training of model.
  • Train on LibriSpeech 100h
  • Implement specAugment features (prev SOTA LibriSpeech) (https://arxiv.org/abs/1904.08779)

Usage

The file model.py contains the architecture of the model. Example usage below.

"""
def LAS(dim, f_1, no_tokens):
  dim: Number of hidden neurons for most LSTM's.
  f_1: pBLSTM takes (Batch, timesteps, f_1) as input, f_1 is number of features of the mel spectrogram 
       per timestep. Timestep is the width of the spectrogram.
  No_tokens: Number of unique tokens for input and output vector.
"""

model = LAS(256, 256, 16)
model.compile(loss="mse", optimizer="adam")

# x_1 should have shape (Batch-size, timesteps, f_1)
x_1 = np.random.random((1, 550, 256))

# x_2 should have shape (Batch-size, no_prev_tokens, No_tokens). The token vector should be one-hot encoded.
x_2 = np.zeros((1,12,16))
for n in range(12):
  x_2[0, n, np.random.randint(1, 16)] = 1

# By passing x_1 and x_2 the model will predict the 12th token 
# given by the spectogram and the prev predicted tokens

model.predict([x_1, x_2])