Artificial Intelligence Nanodegree

Voice User Interfaces

Project: Speech Recognition with Neural Networks


In this notebook, some template code has already been provided for you, and you will need to implement additional functionality to successfully complete this project. You will not need to modify the included code beyond what is requested. Sections that begin with '(IMPLEMENTATION)' in the header indicate that the following blocks of code will require additional functionality which you must provide. Please be sure to read the instructions carefully!

Note: Once you have completed all of the code implementations, you need to finalize your work by exporting the Jupyter Notebook as an HTML document. Before exporting the notebook to html, all of the code cells need to have been run so that reviewers can see the final implementation and output. You can then export the notebook by using the menu above and navigating to \n", "File -> Download as -> HTML (.html). Include the finished document along with this notebook as your submission.

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a 'Question X' header. Carefully read each question and provide thorough answers in the following text boxes that begin with 'Answer:'. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.

Note: Code and Markdown cells can be executed using the Shift + Enter keyboard shortcut. Markdown cells can be edited by double-clicking the cell to enter edit mode.

The rubric contains optional "Stand Out Suggestions" for enhancing the project beyond the minimum requirements. If you decide to pursue the "Stand Out Suggestions", you should include the code in this Jupyter notebook.


Introduction

In this notebook, you will build a deep neural network that functions as part of an end-to-end automatic speech recognition (ASR) pipeline! Your completed pipeline will accept raw audio as input and return a predicted transcription of the spoken language. The full pipeline is summarized in the figure below.

  • STEP 1 is a pre-processing step that converts raw audio to one of two feature representations that are commonly used for ASR.
  • STEP 2 is an acoustic model which accepts audio features as input and returns a probability distribution over all potential transcriptions. After learning about the basic types of neural networks that are often used for acoustic modeling, you will engage in your own investigations, to design your own acoustic model!
  • STEP 3 in the pipeline takes the output from the acoustic model and returns a predicted transcription.

Feel free to use the links below to navigate the notebook:

The Data

We begin by investigating the dataset that will be used to train and evaluate your pipeline. LibriSpeech is a large corpus of English-read speech, designed for training and evaluating models for ASR. The dataset contains 1000 hours of speech derived from audiobooks. We will work with a small subset in this project, since larger-scale data would take a long while to train. However, after completing this project, if you are interested in exploring further, you are encouraged to work with more of the data that is provided online.

In the code cells below, you will use the vis_train_features module to visualize a training example. The supplied argument index=0 tells the module to extract the first example in the training set. (You are welcome to change index=0 to point to a different training example, if you like, but please DO NOT amend any other code in the cell.) The returned variables are:

  • vis_text - transcribed text (label) for the training example.
  • vis_raw_audio - raw audio waveform for the training example.
  • vis_mfcc_feature - mel-frequency cepstral coefficients (MFCCs) for the training example.
  • vis_spectrogram_feature - spectrogram for the training example.
  • vis_audio_path - the file path to the training example.
In [1]:
from data_generator import vis_train_features

# extract label and audio features for a single training example
vis_text, vis_raw_audio, vis_mfcc_feature, vis_spectrogram_feature, vis_audio_path = vis_train_features()
There are 2023 total training examples.

The following code cell visualizes the audio waveform for your chosen example, along with the corresponding transcript. You also have the option to play the audio in the notebook!

In [2]:
from IPython.display import Markdown, display
from data_generator import vis_train_features, plot_raw_audio
from IPython.display import Audio
%matplotlib inline

# plot audio signal
plot_raw_audio(vis_raw_audio)
# print length of audio signal
display(Markdown('**Shape of Audio Signal** : ' + str(vis_raw_audio.shape)))
# print transcript corresponding to audio clip
display(Markdown('**Transcript** : ' + str(vis_text)))
# play the audio file
Audio(vis_audio_path)

Shape of Audio Signal : (129103,)

Transcript : mister quilter is the apostle of the middle classes and we are glad to welcome his gospel

Out[2]:

STEP 1: Acoustic Features for Speech Recognition

For this project, you won't use the raw audio waveform as input to your model. Instead, we provide code that first performs a pre-processing step to convert the raw audio to a feature representation that has historically proven successful for ASR models. Your acoustic model will accept the feature representation as input.

In this project, you will explore two possible feature representations. After completing the project, if you'd like to read more about deep learning architectures that can accept raw audio input, you are encouraged to explore this research paper.

Spectrograms

The first option for an audio feature representation is the spectrogram. In order to complete this project, you will not need to dig deeply into the details of how a spectrogram is calculated; but, if you are curious, the code for calculating the spectrogram was borrowed from this repository. The implementation appears in the utils.py file in your repository.

The code that we give you returns the spectrogram as a 2D tensor, where the first (vertical) dimension indexes time, and the second (horizontal) dimension indexes frequency. To speed the convergence of your algorithm, we have also normalized the spectrogram. (You can see this quickly in the visualization below by noting that the mean value hovers around zero, and most entries in the tensor assume values close to zero.)

In [3]:
from data_generator import plot_spectrogram_feature

# plot normalized spectrogram
plot_spectrogram_feature(vis_spectrogram_feature)
# print shape of spectrogram
display(Markdown('**Shape of Spectrogram** : ' + str(vis_spectrogram_feature.shape)))

Shape of Spectrogram : (584, 161)

Mel-Frequency Cepstral Coefficients (MFCCs)

The second option for an audio feature representation is MFCCs. You do not need to dig deeply into the details of how MFCCs are calculated, but if you would like more information, you are welcome to peruse the documentation of the python_speech_features Python package. Just as with the spectrogram features, the MFCCs are normalized in the supplied code.

The main idea behind MFCC features is the same as spectrogram features: at each time window, the MFCC feature yields a feature vector that characterizes the sound within the window. Note that the MFCC feature is much lower-dimensional than the spectrogram feature, which could help an acoustic model to avoid overfitting to the training dataset.

In [4]:
from data_generator import plot_mfcc_feature

# plot normalized MFCC
plot_mfcc_feature(vis_mfcc_feature)
# print shape of MFCC
display(Markdown('**Shape of MFCC** : ' + str(vis_mfcc_feature.shape)))

Shape of MFCC : (584, 13)

When you construct your pipeline, you will be able to choose to use either spectrogram or MFCC features. If you would like to see different implementations that make use of MFCCs and/or spectrograms, please check out the links below:

STEP 2: Deep Neural Networks for Acoustic Modeling

In this section, you will experiment with various neural network architectures for acoustic modeling.

You will begin by training five relatively simple architectures. Model 0 is provided for you. You will write code to implement Models 1, 2, 3, and 4. If you would like to experiment further, you are welcome to create and train more models under the Models 5+ heading.

All models will be specified in the sample_models.py file. After importing the sample_models module, you will train your architectures in the notebook.

After experimenting with the five simple architectures, you will have the opportunity to compare their performance. Based on your findings, you will construct a deeper architecture that is designed to outperform all of the shallow models.

For your convenience, we have designed the notebook so that each model can be specified and trained on separate occasions. That is, say you decide to take a break from the notebook after training Model 1. Then, you need not re-execute all prior code cells in the notebook before training Model 2. You need only re-execute the code cell below, that is marked with RUN THIS CODE CELL IF YOU ARE RESUMING THE NOTEBOOK AFTER A BREAK, before transitioning to the code cells corresponding to Model 2.

In [7]:
#####################################################################
# RUN THIS CODE CELL IF YOU ARE RESUMING THE NOTEBOOK AFTER A BREAK #
#####################################################################

# allocate 50% of GPU memory (if you like, feel free to change this)
from keras.backend.tensorflow_backend import set_session
import tensorflow as tf 
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.5
set_session(tf.Session(config=config))

# watch for any changes in the sample_models module, and reload it automatically
%load_ext autoreload
%autoreload 2
# import NN architectures for speech recognition
from sample_models import *
# import function for training acoustic model
from train_utils import train_model

Model 0: RNN

Given their effectiveness in modeling sequential data, the first acoustic model you will use is an RNN. As shown in the figure below, the RNN we supply to you will take the time sequence of audio features as input.

At each time step, the speaker pronounces one of 28 possible characters, including each of the 26 letters in the English alphabet, along with a space character (" "), and an apostrophe (').

The output of the RNN at each time step is a vector of probabilities with 29 entries, where the $i$-th entry encodes the probability that the $i$-th character is spoken in the time sequence. (The extra 29th character is an empty "character" used to pad training examples within batches containing uneven lengths.) If you would like to peek under the hood at how characters are mapped to indices in the probability vector, look at the char_map.py file in the repository. The figure below shows an equivalent, rolled depiction of the RNN that shows the output layer in greater detail.

The model has already been specified for you in Keras. To import it, you need only run the code cell below.

In [8]:
model_0 = simple_rnn_model(input_dim=13) # change to 13 if you would like to use MFCC features

As explored in the lesson, you will train the acoustic model with the CTC loss criterion. Custom loss functions take a bit of hacking in Keras, and so we have implemented the CTC loss function for you, so that you can focus on trying out as many deep learning architectures as possible :). If you'd like to peek at the implementation details, look at the add_ctc_loss function within the train_utils.py file in the repository.

To train your architecture, you will use the train_model function within the train_utils module; it has already been imported in one of the above code cells. The train_model function takes three required arguments:

  • input_to_softmax - a Keras model instance.
  • pickle_path - the name of the pickle file where the loss history will be saved.
  • save_model_path - the name of the HDF5 file where the model will be saved.

If we have already supplied values for input_to_softmax, pickle_path, and save_model_path, please DO NOT modify these values.

There are several optional arguments that allow you to have more control over the training process. You are welcome to, but not required to, supply your own values for these arguments.

  • minibatch_size - the size of the minibatches that are generated while training the model (default: 20).
  • spectrogram - Boolean value dictating whether spectrogram (True) or MFCC (False) features are used for training (default: True).
  • mfcc_dim - the size of the feature dimension to use when generating MFCC features (default: 13).
  • optimizer - the Keras optimizer used to train the model (default: SGD(lr=0.02, decay=1e-6, momentum=0.9, nesterov=True, clipnorm=5)).
  • epochs - the number of epochs to use to train the model (default: 20). If you choose to modify this parameter, make sure that it is at least 20.
  • verbose - controls the verbosity of the training output in the model.fit_generator method (default: 1).
  • sort_by_duration - Boolean value dictating whether the training and validation sets are sorted by (increasing) duration before the start of the first epoch (default: False).

The train_model function defaults to using spectrogram features; if you choose to use these features, note that the acoustic model in simple_rnn_model should have input_dim=161. Otherwise, if you choose to use MFCC features, the acoustic model should have input_dim=13.

We have chosen to use GRU units in the supplied RNN. If you would like to experiment with LSTM or SimpleRNN cells, feel free to do so here. If you change the GRU units to SimpleRNN cells in simple_rnn_model, you may notice that the loss quickly becomes undefined (nan) - you are strongly encouraged to check this for yourself! This is due to the exploding gradients problem. We have already implemented gradient clipping in your optimizer to help you avoid this issue.

IMPORTANT NOTE: If you notice that your gradient has exploded in any of the models below, feel free to explore more with gradient clipping (the clipnorm argument in your optimizer) or swap out any SimpleRNN cells for LSTM or GRU cells. You can also try restarting the kernel to restart the training process.

In [9]:
train_model(input_to_softmax=model_0, 
            pickle_path='model_0.pickle', 
            save_model_path='model_0.h5',
            spectrogram=False) # change to False if you would like to use MFCC features
Epoch 1/20
101/101 [==============================] - 119s - loss: 837.7633 - val_loss: 756.5092
Epoch 2/20
101/101 [==============================] - 115s - loss: 779.7239 - val_loss: 758.3753
Epoch 3/20
101/101 [==============================] - 115s - loss: 779.5841 - val_loss: 751.7412
Epoch 4/20
101/101 [==============================] - 117s - loss: 779.3612 - val_loss: 760.3783
Epoch 5/20
101/101 [==============================] - 117s - loss: 779.2541 - val_loss: 758.1526
Epoch 6/20
101/101 [==============================] - 115s - loss: 779.1985 - val_loss: 754.7404
Epoch 7/20
101/101 [==============================] - 115s - loss: 779.3162 - val_loss: 752.1416
Epoch 8/20
101/101 [==============================] - 115s - loss: 779.1746 - val_loss: 763.1058
Epoch 9/20
101/101 [==============================] - 117s - loss: 779.4733 - val_loss: 763.4400
Epoch 10/20
101/101 [==============================] - 120s - loss: 779.2467 - val_loss: 761.2202
Epoch 11/20
101/101 [==============================] - 117s - loss: 779.3238 - val_loss: 758.5069
Epoch 12/20
101/101 [==============================] - 117s - loss: 778.9843 - val_loss: 757.9038
Epoch 13/20
101/101 [==============================] - 113s - loss: 779.0450 - val_loss: 754.7018
Epoch 14/20
101/101 [==============================] - 118s - loss: 779.1672 - val_loss: 755.7240
Epoch 15/20
101/101 [==============================] - 118s - loss: 778.9910 - val_loss: 756.8546
Epoch 16/20
101/101 [==============================] - 125s - loss: 779.3150 - val_loss: 771.5823
Epoch 17/20
101/101 [==============================] - 122s - loss: 779.2270 - val_loss: 743.7648
Epoch 18/20
101/101 [==============================] - 114s - loss: 779.5980 - val_loss: 765.3291
Epoch 19/20
101/101 [==============================] - 117s - loss: 779.5590 - val_loss: 755.4187
Epoch 20/20
101/101 [==============================] - 119s - loss: 779.4739 - val_loss: 756.5679

(IMPLEMENTATION) Model 1: RNN + TimeDistributed Dense

Read about the TimeDistributed wrapper and the BatchNormalization layer in the Keras documentation. For your next architecture, you will add batch normalization to the recurrent layer to reduce training times. The TimeDistributed layer will be used to find more complex patterns in the dataset. The unrolled snapshot of the architecture is depicted below.

The next figure shows an equivalent, rolled depiction of the RNN that shows the (TimeDistrbuted) dense and output layers in greater detail.

Use your research to complete the rnn_model function within the sample_models.py file. The function should specify an architecture that satisfies the following requirements:

  • The first layer of the neural network should be an RNN (SimpleRNN, LSTM, or GRU) that takes the time sequence of audio features as input. We have added GRU units for you, but feel free to change GRU to SimpleRNN or LSTM, if you like!
  • Whereas the architecture in simple_rnn_model treated the RNN output as the final layer of the model, you will use the output of your RNN as a hidden layer. Use TimeDistributed to apply a Dense layer to each of the time steps in the RNN output. Ensure that each Dense layer has output_dim units.

Use the code cell below to load your model into the model_1 variable. Use a value for input_dim that matches your chosen audio features, and feel free to change the values for units and activation to tweak the behavior of your recurrent layer.

In [10]:
model_1 = rnn_model(input_dim=161, # change to 13 if you would like to use MFCC features
                    units=200,
                    activation='relu')
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
the_input (InputLayer)       (None, None, 161)         0         
_________________________________________________________________
rnn (LSTM)                   (None, None, 200)         289600    
_________________________________________________________________
bn_rnn_1d (BatchNormalizatio (None, None, 200)         800       
_________________________________________________________________
time_distributed_1 (TimeDist (None, None, 29)          5829      
_________________________________________________________________
softmax (Activation)         (None, None, 29)          0         
=================================================================
Total params: 296,229
Trainable params: 295,829
Non-trainable params: 400
_________________________________________________________________
None

Please execute the code cell below to train the neural network you specified in input_to_softmax. After the model has finished training, the model is saved in the HDF5 file model_1.h5. The loss history is saved in model_1.pickle. You are welcome to tweak any of the optional parameters while calling the train_model function, but this is not required.

In [11]:
train_model(input_to_softmax=model_1, 
            pickle_path='model_1.pickle', 
            save_model_path='model_1.h5',
            spectrogram=True) # change to False if you would like to use MFCC features
Epoch 1/20
101/101 [==============================] - 117s - loss: 300.2976 - val_loss: 225.7929
Epoch 2/20
101/101 [==============================] - 117s - loss: 233.6712 - val_loss: 222.8865
Epoch 3/20
101/101 [==============================] - 116s - loss: 233.3889 - val_loss: 222.9376
Epoch 4/20
101/101 [==============================] - 117s - loss: 233.0243 - val_loss: 223.8781
Epoch 5/20
101/101 [==============================] - 119s - loss: 233.1200 - val_loss: 223.6895
Epoch 6/20
101/101 [==============================] - 114s - loss: 233.0251 - val_loss: 221.9075
Epoch 7/20
101/101 [==============================] - 116s - loss: 233.0817 - val_loss: 226.5832
Epoch 8/20
101/101 [==============================] - 123s - loss: 232.9440 - val_loss: 222.6163
Epoch 9/20
101/101 [==============================] - 116s - loss: 232.9801 - val_loss: 222.6744
Epoch 10/20
101/101 [==============================] - 115s - loss: 232.8376 - val_loss: 223.2700
Epoch 11/20
101/101 [==============================] - 115s - loss: 232.9068 - val_loss: 224.0518
Epoch 12/20
101/101 [==============================] - 113s - loss: 232.6422 - val_loss: 222.5316
Epoch 13/20
101/101 [==============================] - 114s - loss: 232.6875 - val_loss: 224.4348
Epoch 14/20
101/101 [==============================] - 117s - loss: 232.8250 - val_loss: 220.8547
Epoch 15/20
101/101 [==============================] - 114s - loss: 233.0361 - val_loss: 226.2597
Epoch 16/20
101/101 [==============================] - 120s - loss: 232.8660 - val_loss: 221.1238
Epoch 17/20
101/101 [==============================] - 123s - loss: 232.8365 - val_loss: 223.1869
Epoch 18/20
101/101 [==============================] - 120s - loss: 232.8119 - val_loss: 221.8838
Epoch 19/20
101/101 [==============================] - 120s - loss: 232.8611 - val_loss: 222.8693
Epoch 20/20
101/101 [==============================] - 114s - loss: 232.8867 - val_loss: 221.4095

(IMPLEMENTATION) Model 2: CNN + RNN + TimeDistributed Dense

The architecture in cnn_rnn_model adds an additional level of complexity, by introducing a 1D convolution layer.

This layer incorporates many arguments that can be (optionally) tuned when calling the cnn_rnn_model module. We provide sample starting parameters, which you might find useful if you choose to use spectrogram audio features.

If you instead want to use MFCC features, these arguments will have to be tuned. Note that the current architecture only supports values of 'same' or 'valid' for the conv_border_mode argument.

When tuning the parameters, be careful not to choose settings that make the convolutional layer overly small. If the temporal length of the CNN layer is shorter than the length of the transcribed text label, your code will throw an error.

Before running the code cell below, you must modify the cnn_rnn_model function in sample_models.py. Please add batch normalization to the recurrent layer, and provide the same TimeDistributed layer as before.

In [12]:
model_2 = cnn_rnn_model(input_dim=161, # change to 13 if you would like to use MFCC features
                        filters=200,
                        kernel_size=11, 
                        conv_stride=2,
                        conv_border_mode='valid',
                        units=200)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
the_input (InputLayer)       (None, None, 161)         0         
_________________________________________________________________
conv1d (Conv1D)              (None, None, 200)         354400    
_________________________________________________________________
bn_conv_1d (BatchNormalizati (None, None, 200)         800       
_________________________________________________________________
rnn (GRU)                    (None, None, 200)         240600    
_________________________________________________________________
bn_rnn_1d (BatchNormalizatio (None, None, 200)         800       
_________________________________________________________________
time_distributed_2 (TimeDist (None, None, 29)          5829      
_________________________________________________________________
softmax (Activation)         (None, None, 29)          0         
=================================================================
Total params: 602,429
Trainable params: 601,629
Non-trainable params: 800
_________________________________________________________________
None

Please execute the code cell below to train the neural network you specified in input_to_softmax. After the model has finished training, the model is saved in the HDF5 file model_2.h5. The loss history is saved in model_2.pickle. You are welcome to tweak any of the optional parameters while calling the train_model function, but this is not required.

In [13]:
train_model(input_to_softmax=model_2, 
            pickle_path='model_2.pickle', 
            save_model_path='model_2.h5', 
            spectrogram=True) # change to False if you would like to use MFCC features
Epoch 1/20
101/101 [==============================] - 63s - loss: 249.1161 - val_loss: 279.6265
Epoch 2/20
101/101 [==============================] - 62s - loss: 192.2664 - val_loss: 181.4677
Epoch 3/20
101/101 [==============================] - 60s - loss: 156.6647 - val_loss: 154.3840
Epoch 4/20
101/101 [==============================] - 60s - loss: 140.1063 - val_loss: 144.7106
Epoch 5/20
101/101 [==============================] - 60s - loss: 129.1700 - val_loss: 135.3142
Epoch 6/20
101/101 [==============================] - 60s - loss: 121.0549 - val_loss: 132.8940
Epoch 7/20
101/101 [==============================] - 61s - loss: 114.1555 - val_loss: 131.7514
Epoch 8/20
101/101 [==============================] - 66s - loss: 108.3093 - val_loss: 132.3798
Epoch 9/20
101/101 [==============================] - 65s - loss: 102.7633 - val_loss: 130.5701
Epoch 10/20
101/101 [==============================] - 65s - loss: 98.1989 - val_loss: 130.0429
Epoch 11/20
101/101 [==============================] - 64s - loss: 93.3816 - val_loss: 128.8333
Epoch 12/20
101/101 [==============================] - 63s - loss: 89.0302 - val_loss: 129.7670
Epoch 13/20
101/101 [==============================] - 61s - loss: 85.0271 - val_loss: 130.5819
Epoch 14/20
101/101 [==============================] - 64s - loss: 81.2161 - val_loss: 131.1650
Epoch 15/20
101/101 [==============================] - 64s - loss: 77.4708 - val_loss: 133.3010
Epoch 16/20
101/101 [==============================] - 63s - loss: 73.7549 - val_loss: 138.3304
Epoch 17/20
101/101 [==============================] - 62s - loss: 70.4167 - val_loss: 136.0437
Epoch 18/20
101/101 [==============================] - 62s - loss: 67.1312 - val_loss: 144.0295
Epoch 19/20
101/101 [==============================] - 62s - loss: 64.0253 - val_loss: 144.0692
Epoch 20/20
101/101 [==============================] - 62s - loss: 60.9438 - val_loss: 147.9649

(IMPLEMENTATION) Model 3: Deeper RNN + TimeDistributed Dense

Review the code in rnn_model, which makes use of a single recurrent layer. Now, specify an architecture in deep_rnn_model that utilizes a variable number recur_layers of recurrent layers. The figure below shows the architecture that should be returned if recur_layers=2. In the figure, the output sequence of the first recurrent layer is used as input for the next recurrent layer.

Feel free to change the supplied values of units to whatever you think performs best. You can change the value of recur_layers, as long as your final value is greater than 1. (As a quick check that you have implemented the additional functionality in deep_rnn_model correctly, make sure that the architecture that you specify here is identical to rnn_model if recur_layers=1.)

In [14]:
model_3 = deep_rnn_model(input_dim=161, # change to 13 if you would like to use MFCC features
                         units=200,
                         recur_layers=2) 
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
the_input (InputLayer)       (None, None, 161)         0         
_________________________________________________________________
lstm_1 (LSTM)                (None, None, 200)         289600    
_________________________________________________________________
bt_rnn_1 (BatchNormalization (None, None, 200)         800       
_________________________________________________________________
lstm_2 (LSTM)                (None, None, 200)         320800    
_________________________________________________________________
bt_rnn_last_rnn (BatchNormal (None, None, 200)         800       
_________________________________________________________________
time_distributed_3 (TimeDist (None, None, 29)          5829      
_________________________________________________________________
softmax (Activation)         (None, None, 29)          0         
=================================================================
Total params: 617,829
Trainable params: 617,029
Non-trainable params: 800
_________________________________________________________________
None

Please execute the code cell below to train the neural network you specified in input_to_softmax. After the model has finished training, the model is saved in the HDF5 file model_3.h5. The loss history is saved in model_3.pickle. You are welcome to tweak any of the optional parameters while calling the train_model function, but this is not required.

In [15]:
train_model(input_to_softmax=model_3, 
            pickle_path='model_3.pickle', 
            save_model_path='model_3.h5', 
            spectrogram=True) # change to False if you would like to use MFCC features
Epoch 1/20
101/101 [==============================] - 283s - loss: 299.6555 - val_loss: 223.3386
Epoch 2/20
101/101 [==============================] - 263s - loss: 234.0082 - val_loss: 223.4137
Epoch 3/20
101/101 [==============================] - 260s - loss: 233.3815 - val_loss: 222.8910
Epoch 4/20
101/101 [==============================] - 264s - loss: 233.2315 - val_loss: 226.2374
Epoch 5/20
101/101 [==============================] - 261s - loss: 233.0499 - val_loss: 223.1657
Epoch 6/20
101/101 [==============================] - 262s - loss: 232.9850 - val_loss: 222.8622
Epoch 7/20
101/101 [==============================] - 264s - loss: 233.0063 - val_loss: 224.3826
Epoch 8/20
101/101 [==============================] - 261s - loss: 233.0306 - val_loss: 222.3476
Epoch 9/20
101/101 [==============================] - 262s - loss: 232.7922 - val_loss: 222.7110
Epoch 10/20
101/101 [==============================] - 261s - loss: 232.7517 - val_loss: 222.6574
Epoch 11/20
101/101 [==============================] - 262s - loss: 233.0781 - val_loss: 223.4762
Epoch 12/20
101/101 [==============================] - 263s - loss: 232.8227 - val_loss: 222.2776
Epoch 13/20
101/101 [==============================] - 261s - loss: 233.0050 - val_loss: 225.1248
Epoch 14/20
101/101 [==============================] - 260s - loss: 233.0632 - val_loss: 223.2428
Epoch 15/20
101/101 [==============================] - 261s - loss: 232.7974 - val_loss: 221.2334
Epoch 16/20
101/101 [==============================] - 261s - loss: 232.8635 - val_loss: 223.2759
Epoch 17/20
101/101 [==============================] - 262s - loss: 232.9474 - val_loss: 220.7706
Epoch 18/20
101/101 [==============================] - 263s - loss: 232.8470 - val_loss: 223.5131
Epoch 19/20
101/101 [==============================] - 261s - loss: 232.8623 - val_loss: 223.4732
Epoch 20/20
101/101 [==============================] - 262s - loss: 232.8314 - val_loss: 222.8387

(IMPLEMENTATION) Model 4: Bidirectional RNN + TimeDistributed Dense

Read about the Bidirectional wrapper in the Keras documentation. For your next architecture, you will specify an architecture that uses a single bidirectional RNN layer, before a (TimeDistributed) dense layer. The added value of a bidirectional RNN is described well in this paper.

One shortcoming of conventional RNNs is that they are only able to make use of previous context. In speech recognition, where whole utterances are transcribed at once, there is no reason not to exploit future context as well. Bidirectional RNNs (BRNNs) do this by processing the data in both directions with two separate hidden layers which are then fed forwards to the same output layer.

Before running the code cell below, you must complete the bidirectional_rnn_model function in sample_models.py. Feel free to use SimpleRNN, LSTM, or GRU units. When specifying the Bidirectional wrapper, use merge_mode='concat'.

In [16]:
model_4 = bidirectional_rnn_model(input_dim=161, # change to 13 if you would like to use MFCC features
                                  units=200)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
the_input (InputLayer)       (None, None, 161)         0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, None, 400)         579200    
_________________________________________________________________
time_distributed_4 (TimeDist (None, None, 29)          11629     
_________________________________________________________________
softmax (Activation)         (None, None, 29)          0         
=================================================================
Total params: 590,829
Trainable params: 590,829
Non-trainable params: 0
_________________________________________________________________
None

Please execute the code cell below to train the neural network you specified in input_to_softmax. After the model has finished training, the model is saved in the HDF5 file model_4.h5. The loss history is saved in model_4.pickle. You are welcome to tweak any of the optional parameters while calling the train_model function, but this is not required.

In [17]:
train_model(input_to_softmax=model_4, 
            pickle_path='model_4.pickle', 
            save_model_path='model_4.h5', 
            spectrogram=True) # change to False if you would like to use MFCC features
Epoch 1/20
101/101 [==============================] - 260s - loss: 385.6249 - val_loss: 228.0993
Epoch 2/20
101/101 [==============================] - 264s - loss: 232.9343 - val_loss: 224.7966
Epoch 3/20
101/101 [==============================] - 265s - loss: 233.0141 - val_loss: 223.8822
Epoch 4/20
101/101 [==============================] - 263s - loss: 232.8795 - val_loss: 221.4249
Epoch 5/20
101/101 [==============================] - 263s - loss: 232.7045 - val_loss: 222.0958
Epoch 6/20
101/101 [==============================] - 264s - loss: 233.0212 - val_loss: 224.5067
Epoch 7/20
101/101 [==============================] - 263s - loss: 232.8923 - val_loss: 224.2478
Epoch 8/20
101/101 [==============================] - 266s - loss: 232.8256 - val_loss: 222.8118
Epoch 9/20
101/101 [==============================] - 265s - loss: 232.8941 - val_loss: 222.3292
Epoch 10/20
101/101 [==============================] - 265s - loss: 232.6294 - val_loss: 224.3733
Epoch 11/20
101/101 [==============================] - 262s - loss: 232.8698 - val_loss: 222.3533
Epoch 12/20
101/101 [==============================] - 265s - loss: 232.8434 - val_loss: 222.2620
Epoch 13/20
101/101 [==============================] - 262s - loss: 232.9374 - val_loss: 222.3328
Epoch 14/20
101/101 [==============================] - 264s - loss: 232.8260 - val_loss: 223.6158
Epoch 15/20
101/101 [==============================] - 267s - loss: 232.9704 - val_loss: 223.5902
Epoch 16/20
101/101 [==============================] - 261s - loss: 232.7302 - val_loss: 221.7624
Epoch 17/20
101/101 [==============================] - 265s - loss: 232.7211 - val_loss: 224.3096
Epoch 18/20
101/101 [==============================] - 266s - loss: 232.5758 - val_loss: 223.4360
Epoch 19/20
101/101 [==============================] - 265s - loss: 233.0511 - val_loss: 221.9609
Epoch 20/20
101/101 [==============================] - 264s - loss: 232.8166 - val_loss: 222.9856

(OPTIONAL IMPLEMENTATION) Models 5+

If you would like to try out more architectures than the ones above, please use the code cell below. Please continue to follow the same convention for saving the models; for the $i$-th sample model, please save the loss at model_i.pickle and saving the trained model at model_i.h5.

In [ ]:
## (Optional) TODO: Try out some more models!
### Feel free to use as many code cells as needed.

Compare the Models

Execute the code cell below to evaluate the performance of the drafted deep learning models. The training and validation loss are plotted for each model.

In [18]:
from glob import glob
import numpy as np
import _pickle as pickle
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style(style='white')

# obtain the paths for the saved model history
all_pickles = sorted(glob("results/*.pickle"))
# extract the name of each model
model_names = [item[8:-7] for item in all_pickles]
# extract the loss history for each model
valid_loss = [pickle.load( open( i, "rb" ) )['val_loss'] for i in all_pickles]
train_loss = [pickle.load( open( i, "rb" ) )['loss'] for i in all_pickles]
# save the number of epochs used to train each model
num_epochs = [len(valid_loss[i]) for i in range(len(valid_loss))]

fig = plt.figure(figsize=(16,5))

# plot the training loss vs. epoch for each model
ax1 = fig.add_subplot(121)
for i in range(len(all_pickles)):
    ax1.plot(np.linspace(1, num_epochs[i], num_epochs[i]), 
            train_loss[i], label=model_names[i])
# clean up the plot
ax1.legend()  
ax1.set_xlim([1, max(num_epochs)])
plt.xlabel('Epoch')
plt.ylabel('Training Loss')

# plot the validation loss vs. epoch for each model
ax2 = fig.add_subplot(122)
for i in range(len(all_pickles)):
    ax2.plot(np.linspace(1, num_epochs[i], num_epochs[i]), 
            valid_loss[i], label=model_names[i])
# clean up the plot
ax2.legend()  
ax2.set_xlim([1, max(num_epochs)])
plt.xlabel('Epoch')
plt.ylabel('Validation Loss')
plt.show()

Question 1: Use the plot above to analyze the performance of each of the attempted architectures. Which performs best? Provide an explanation regarding why you think some models perform better than others.

Answer: Interesting results overall. All models except model_0 were tested on the Spectogram features not MFCC. By looking to the left graph (training loss) we can see that all models are behaving in the similar way. At the very beginning of the training each model starts by decreasing loss dramaticly and suddenly at the mid of 3rd epoch it keeps its loss value without decreasing any further.

If we take into account both graphs (Training and validation losses) the best model is CNN model (2nd model, green lines). That's why we provide the most information to the RNN part of the network. CNN is there to find the best features for each input and use that to 'help' rnn in further training.

Analyse of each model:

Model 0: Given model - It was the only model trained on MFCC features. This model was the worst of them all according to the Training and validation accuracy. This model is very shellow as well and for the complicated task such as speech recognition we should be using deeper models.

Model 1: Fully connected layer after rnn: This model is a small upgrade on the model 0, in the model 0 we use outputs from RNN and from those we are calculating loss and probs for the spoken words. This is not good in practices as we have seen. The idea of the Model 1 is to add a fully connected layer between RNN part and output probs. In this way we are getting more information which we use to get better predictions. This add-on really helped in the matter of decreasing loss. As we can see on the left graph, model imediatelly started with loss of about 300. But this model didn't improve over time at both training loss and validation loss.

Model 2: CNN model: In this model we have added CNN layer at the beginning (before RNN layer). This way we are analysing Spectogram or MFCC features with convolutional kernels. With this strategy we are able to get only important features and give them to the RNN part of the network. This strategy showed the best results of all models tested. Training loss kept decreasing over time and also validation loss decreased a lot. This results helped me in decideing what layers/architecture to use in the Final model.

Model 3: Deep RNN Network: This model is made to have many RNN layers. In the testing (above) I have used 2 RNN layers + fully connected layer after the last RNN layer. This model showed the same training results as our Model 1. At the beginning of the training process it started to decrease loss and after 2.5 epochs it just stoped and kept that amount of loss until the end of the training. Validation loss stayed constant through out whole training process.

Model 4: Bidirectional RNN layers: This model uses Bidir_Rnn layers, which is logical use in this use case. With using Bidir_rnn we are able to analyse input signal from the beginning to the end and from the end to the beginning. This model showed similar results to models 1 and 3.

(IMPLEMENTATION) Final Model

Now that you've tried out many sample models, use what you've learned to draft your own architecture! While your final acoustic model should not be identical to any of the architectures explored above, you are welcome to merely combine the explored layers above into a deeper architecture. It is NOT necessary to include new layer types that were not explored in the notebook.

However, if you would like some ideas for even more layer types, check out these ideas for some additional, optional extensions to your model:

  • If you notice your model is overfitting to the training dataset, consider adding dropout! To add dropout to recurrent layers, pay special attention to the dropout_W and dropout_U arguments. This paper may also provide some interesting theoretical background.
  • If you choose to include a convolutional layer in your model, you may get better results by working with dilated convolutions. If you choose to use dilated convolutions, make sure that you are able to accurately calculate the length of the acoustic model's output in the model.output_length lambda function. You can read more about dilated convolutions in Google's WaveNet paper. For an example of a speech-to-text system that makes use of dilated convolutions, check out this GitHub repository. You can work with dilated convolutions in Keras by paying special attention to the padding argument when you specify a convolutional layer.
  • If your model makes use of convolutional layers, why not also experiment with adding max pooling? Check out this paper for example architecture that makes use of max pooling in an acoustic model.
  • So far, you have experimented with a single bidirectional RNN layer. Consider stacking the bidirectional layers, to produce a deep bidirectional RNN!

All models that you specify in this repository should have output_length defined as an attribute. This attribute is a lambda function that maps the (temporal) length of the input acoustic features to the (temporal) length of the output softmax layer. This function is used in the computation of CTC loss; to see this, look at the add_ctc_loss function in train_utils.py. To see where the output_length attribute is defined for the models in the code, take a look at the sample_models.py file. You will notice this line of code within most models:

model.output_length = lambda x: x

The acoustic model that incorporates a convolutional layer (cnn_rnn_model) has a line that is a bit different:

model.output_length = lambda x: cnn_output_length(
        x, kernel_size, conv_border_mode, conv_stride)

In the case of models that use purely recurrent layers, the lambda function is the identity function, as the recurrent layers do not modify the (temporal) length of their input tensors. However, convolutional layers are more complicated and require a specialized function (cnn_output_length in sample_models.py) to determine the temporal length of their output.

You will have to add the output_length attribute to your final model before running the code cell below. Feel free to use the cnn_output_length function, if it suits your model.

In [62]:
# specify the model
model_end = final_model(input_dim=13,
                        filters=200,
                        kernel_size=11, 
                        conv_stride=2,
                        conv_border_mode='valid',
                        units=250,
                        activation='relu',
                        cell=GRU,
                        dropout_rate=1,
                        number_of_layers=2)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
the_input (InputLayer)       (None, None, 13)          0         
_________________________________________________________________
layer_1_conv (Conv1D)        (None, None, 200)         28800     
_________________________________________________________________
conv_batch_norm (BatchNormal (None, None, 200)         800       
_________________________________________________________________
rnn_1 (GRU)                  (None, None, 250)         338250    
_________________________________________________________________
bt_rnn_1 (BatchNormalization (None, None, 250)         1000      
_________________________________________________________________
final_layer_of_rnn (GRU)     (None, None, 250)         375750    
_________________________________________________________________
bt_rnn_final (BatchNormaliza (None, None, 250)         1000      
_________________________________________________________________
time_distributed_28 (TimeDis (None, None, 29)          7279      
_________________________________________________________________
softmax (Activation)         (None, None, 29)          0         
=================================================================
Total params: 752,879
Trainable params: 751,479
Non-trainable params: 1,400
_________________________________________________________________
None

Please execute the code cell below to train the neural network you specified in input_to_softmax. After the model has finished training, the model is saved in the HDF5 file model_end.h5. The loss history is saved in model_end.pickle. You are welcome to tweak any of the optional parameters while calling the train_model function, but this is not required.

In [63]:
from keras.optimizers import RMSprop
train_model(input_to_softmax=model_end, 
            pickle_path='model_end.pickle',
            save_model_path='model_end.h5', 
            spectrogram=False)
Epoch 1/20
101/101 [==============================] - 145s - loss: 246.7676 - val_loss: 215.9699
Epoch 2/20
101/101 [==============================] - 144s - loss: 187.5169 - val_loss: 179.8064
Epoch 3/20
101/101 [==============================] - 152s - loss: 151.9236 - val_loss: 146.9361
Epoch 4/20
101/101 [==============================] - 152s - loss: 133.9369 - val_loss: 135.4676
Epoch 5/20
101/101 [==============================] - 148s - loss: 122.4069 - val_loss: 134.8893
Epoch 6/20
101/101 [==============================] - 148s - loss: 113.4147 - val_loss: 128.9967
Epoch 7/20
101/101 [==============================] - 142s - loss: 106.1782 - val_loss: 128.1952
Epoch 8/20
101/101 [==============================] - 152s - loss: 99.3979 - val_loss: 126.2738
Epoch 9/20
101/101 [==============================] - 155s - loss: 93.1851 - val_loss: 126.5522
Epoch 10/20
101/101 [==============================] - 148s - loss: 87.5127 - val_loss: 124.9496
Epoch 11/20
101/101 [==============================] - 150s - loss: 81.7004 - val_loss: 125.9966
Epoch 12/20
101/101 [==============================] - 145s - loss: 76.3927 - val_loss: 125.7233
Epoch 13/20
101/101 [==============================] - 151s - loss: 71.1534 - val_loss: 129.8055
Epoch 14/20
101/101 [==============================] - 150s - loss: 66.3418 - val_loss: 134.3553
Epoch 15/20
101/101 [==============================] - 150s - loss: 61.6126 - val_loss: 137.9712
Epoch 16/20
101/101 [==============================] - 150s - loss: 57.3928 - val_loss: 137.4264
Epoch 17/20
101/101 [==============================] - 147s - loss: 53.5605 - val_loss: 142.9806
Epoch 18/20
101/101 [==============================] - 148s - loss: 49.3938 - val_loss: 147.0497
Epoch 19/20
101/101 [==============================] - 150s - loss: 46.2359 - val_loss: 153.1326
Epoch 20/20
101/101 [==============================] - 150s - loss: 43.3314 - val_loss: 157.6372

Question 2: Describe your final model architecture and your reasoning at each step.

Answer: While creating the Final model I have tested many different architectures. Firstly I have started with Deep Bidirectional RNN, I have used 2 and than 3 Bidir layers. But this architecture didn't work well for me. It stayed on 500 loss through whole training. My second attempt was to add CNN to that architecture and that improved the model a bit - it got to about 230 loss (Both training and validation loss).

My third attempt was to add dilate param into Conv layer and because of that my conv_strides = 1. This didn't improve the network.

The next step was to create just deep rnn with 5 GRU layers with dropout of 40%. This network was really hard to train and it did improve but for really small amount (from 760 loss to 710 in 20 epochs).

Then I have changed SGD optimizer to RMSprop optimizer with learning rate to 0.02 (same as we use in SGD). This move didn't show any improvements overall.

I have looked at the results of the tested architectures (above, Models 1-4) and started to put a few parts together to get this architecture that I am using now.

I have used Conv layer from Model 2, and structure from Deep RNN model. Combined them together, tried the same model with and without dropout. It seemed to work better without dropout so I trained it without dropout at all.

Training of the final model analysis: The model was training nicely. It kept decreasing Training and validation loss over time. But there is one thing. If we look at the validation loss we can see that it stoped decreasing and started to increase. This means that the network had started to overfit to the training set.

There are a few things we can do to prevent this from happening. Early-stopping, we could have stoped training of the network after the 12. epoch.

Regularization techniques such Dropout would help too. But dropout with the same setup of parameters didn't show good results, so tweaking other parameters too would help in the case of using dropout.

STEP 3: Obtain Predictions

We have written a function for you to decode the predictions of your acoustic model. To use the function, please execute the code cell below.

In [79]:
import numpy as np
from data_generator import AudioGenerator
from keras import backend as K
from utils import int_sequence_to_text
from IPython.display import Audio

def get_predictions(index, partition, input_to_softmax, model_path):
    """ Print a model's decoded predictions
    Params:
        index (int): The example you would like to visualize
        partition (str): One of 'train' or 'validation'
        input_to_softmax (Model): The acoustic model
        model_path (str): Path to saved acoustic model's weights
    """
    # load the train and test data
    data_gen = AudioGenerator(spectrogram=False)
    data_gen.load_train_data()
    data_gen.load_validation_data()
    
    # obtain the true transcription and the audio features 
    if partition == 'validation':
        transcr = data_gen.valid_texts[index]
        audio_path = data_gen.valid_audio_paths[index]
        data_point = data_gen.normalize(data_gen.featurize(audio_path))
    elif partition == 'train':
        transcr = data_gen.train_texts[index]
        audio_path = data_gen.train_audio_paths[index]
        data_point = data_gen.normalize(data_gen.featurize(audio_path))
    else:
        raise Exception('Invalid partition!  Must be "train" or "validation"')
        
    # obtain and decode the acoustic model's predictions
    input_to_softmax.load_weights(model_path)
    prediction = input_to_softmax.predict(np.expand_dims(data_point, axis=0))
    output_length = [input_to_softmax.output_length(data_point.shape[0])] 
    pred_ints = (K.eval(K.ctc_decode(
                prediction, output_length)[0][0])+1).flatten().tolist()
    
    # play the audio file, and display the true and predicted transcriptions
    print('-'*80)
    Audio(audio_path)
    print('True transcription:\n' + '\n' + transcr)
    print('-'*80)
    print('Predicted transcription:\n' + '\n' + ''.join(int_sequence_to_text(pred_ints)))
    print('-'*80)

Use the code cell below to obtain the transcription predicted by your final model for the first example in the training dataset.

In [80]:
get_predictions(index=0, 
                partition='train',
                input_to_softmax=model_end, 
                model_path='results/model_end.h5')
--------------------------------------------------------------------------------
True transcription:

mister quilter is the apostle of the middle classes and we are glad to welcome his gospel
--------------------------------------------------------------------------------
Predicted transcription:

mister quilter is the apoustle of the midtle classes and wear glad to weltem his gospbe
--------------------------------------------------------------------------------

Use the next code cell to visualize the model's prediction for the first example in the validation dataset.

In [82]:
get_predictions(index=0, 
                partition='validation',
                input_to_softmax=model_end, 
                model_path='results/model_end.h5')
--------------------------------------------------------------------------------
True transcription:

stuff it into you his belly counselled him
--------------------------------------------------------------------------------
Predicted transcription:

stof itin to yu his pely comntiat  im
--------------------------------------------------------------------------------

One standard way to improve the results of the decoder is to incorporate a language model. We won't pursue this in the notebook, but you are welcome to do so as an optional extension.

If you are interested in creating models that provide improved transcriptions, you are encouraged to download more data and train bigger, deeper models. But beware - the model will likely take a long while to train. For instance, training this state-of-the-art model would take 3-6 weeks on a single GPU!