Hands-On AI Part 23: Deep Learning for Music Generation 2—Implementing the Model

Published: 10/30/2017   Last Updated: 10/30/2017

A Tutorial Series for Software Developers, Data Scientists, and Data Center Managers

At this point in the tutorial, all the relevant datasets have been found, collected, and preprocessed. For more information about these steps please check out the earlier articles in this series. The BachBot1 model was used to harmonize the melody. This article describes the processes of defining, training, testing, and modifying BachBot.

Defining a Model

In the previous article, (Deep Learning for Music Generation 1-Choosing a Model and Data Preprocessing), it was explained that the problem of automatic composition could be reduced to a problem of sequence prediction. In particular, the model should predict the most probable next note, given the previous notes. This type of problem is best suited for a long short-term memory (LSTM) neural network. Formally, the model should predict P(xt+1 | xt, ht-1), a probability distribution of the possible next notes (xt+1) given the current token (xt), and the previous hidden state (ht-1). Interestingly, this is the exact same operation performed by recurrent neural network (RNN) language models.

In composition, the model is initialized by the START token (see the previous article for more about the encoding scheme), and then picks the next most-likely token to follow it. After this, it continues to pick the most probable next token using the current note and the previous hidden state until it generates the END token. There are temperature controls, which introduce a degree of randomness to prevent BachBot from composing the same piece over and over again.


In training a prediction model, there is typically a function that should be minimized, called loss, that describes the difference of the model’s prediction to the ground truth. BachBot chose to minimize cross entropy loss between the predicted distribution (xt+1) and the actual target distribution. Cross entropy loss is a good starting point for a wide range of tasks, but in some cases you may have your own loss function. Another valid approach is to try different loss functions and keep the model that minimizes the actual loss in validation.


In training the RNN, BachBot used to correct the token as xt+1, instead of the prediction of the model. This process, known as teacher forcing, is used to aid convergence, as the model’s predictions will naturally be poor in the beginning of training. In contrast, during validation and composition, the prediction of the model (xt+1) should be reused as input for the next prediction.

Other Considerations

Practical techniques that were used in this model to improve performance, and are common in LSTM networks, are gradient norm clipping, dropout, batch normalization, and truncated backpropagation through time (BPTT).

Gradient norm clipping mitigates the problem of the exploding gradient (the counterpart to the vanishing gradient problem, which was solved by using an LSTM memory cell architecture). When gradient norm clipping is used, gradients that exceed a certain threshold are clipped or scaled.

Dropout is a technique that causes certain neurons to randomly turn off (dropout) during training. This prevents overfitting and improves generalization. Overfitting is a problem that occurs when the model becomes optimized for the training dataset, and is less applicable to samples outside of the training dataset. Dropout often worsens training loss, but improves validation loss (more on this later).

Computing the gradient of an RNN on a sequence of length 1000 costs the equivalent of a forward and backward pass on a 1000 layer feedforward network. Truncated BPTT is used to reduce the cost of updating parameters in the training process. This means that errors are only propagated a fixed number of time steps backward. Note that learning long-term dependencies are still possible when using BPTT, as the hidden states have already been exposed to many previous time steps.


The parameters that are relevant in RNN/LSTM models are:

  • The number of layers. As this increases, the model may become more powerful but slower to train. Also, having too many layers may result in overfitting.
  • The hidden state dimension. Increasing this may improve model capacity, but can cause overfitting.
  • Dimension of vector embeddings
  • Sequence length/number of frames before truncating BPTT.
  • Dropout probability. The probability that a neuron drops out at each update cycle.

Finding the optimal set of parameters will be discussed later in the article.

Implementation, Training and Testing

Choosing a Framework

Nowadays, there are many frameworks that help to implement machine learning models in a variety of languages (even JavaScript*!). Some popular frameworks are scikit-learn*, TensorFlow*, and Torch*.

Torch3 was selected as the framework for the BachBot project. TensorFlow was tried first, however it used unrolled RNNs at the time, which overflowed the graphics processing unit’s (GPU’s) RAM. Torch is a scientific computing framework that runs on the speedy language LuaJIT*. Torch has great neural network and optimization libraries.

Implementing and Training the Model

Implementation will clearly vary depending on the language and framework you end up choosing. To see how LSTMs were implemented using Torch in BachBot, check out the scripts used to train and define BachBot. These are available on Feynman Liang’s GitHub* site 2

A good starting place in navigating the repository is  1-train.zsh. From there you should be able to find your way to bachbot.py.

Specifically, the essential script that defines the model is LSTM.lua. The script that trains the model is train.lua.

Hyperparameter Optimization

To find the best hyperparameter settings, a grid search was used on the following grid.

Table 1. Parameter grid used in BachBot grid search 1.



Number of layers 1 2 3 4    
Hidden state dimension 128 256 384 512    
Dimension of vector embeddings 16 32 64      
Sequence length 64 128 256      
Dropout probability 0.0 0.1 0.2 0.3 0.4 0.5

A grid search is an exhaustive search over all the possible combinations of parameters. Other suggested hyperparameter optimizations are random search and Bayesian optimization.

The optimal hyperparameter set found by the grid search was: number of layers = 3, hidden state dimension = 256, dimension of vector embeddings = 32, sequence length = 128, and dropout = 0.3.

This model achieved 0.324 cross entropy loss in training, and 0.477 cross entropy loss in validation. Plotting the training curve shows that training converges after 30 iterations (≈28.5 minutes on a single GPU) 1.

Plotting training and validation losses can also illustrate the effect of each hyperparameter. Of particular interest is dropout probability:

Figure 2. Training curves for various dropout settings1.

From Figure 2 we can see that dropout indeed prevents overfitting, as although dropout = 0.0 has the lowest training loss, it has the highest validation loss; whereas higher dropout probabilities lead to higher training losses but lower validation losses. The lowest validation loss in BachBot’s case was when the dropout probability was 0.3.

Alternate Evaluation (optional)

For some models, especially for creative applications such as music composition, loss may not be the most appropriate measure of success. Instead, a better measure could be subjective human evaluation.

The goal of the BachBot project was to automatically compose music that is indistinguishable from Bach’s own compositions. To evaluate this, an online survey was conducted. The survey was framed as a challenge to see whether the user could distinguish between BachBot’s and Bach’s compositions.

The results showed that people who took the challenge (759 participants, varying skill levels) could only accurately discriminate between the two samples 59 percent of the time. This is only 9 percent above random guessing! Take The BachBot Challenge yourself!

Adapting the Model to Harmonization

BachBot can now compute P(xt+1 | xt, ht-1), the probability distribution of the possible next notes given the current note and the previous hidden state. This sequential prediction model can then be adapted into one that harmonizes a melody. This adapted harmonization model is required for harmonizing the emotion-modulated melody for the slideshow music project.

In harmonization, a predefined melody is provided (typically the soprano line), and the model must then compose music for the other parts. A greedy best-first search under the constraint that melody notes are fixed is used for this task. Greedy algorithms involve making choices that are locally optimal. Thus, the simple strategy used for harmonization is described as follows:

Let xt be the tokens in the proposed harmonization. At time step t, if the note is given as the melody, xt equals the given note. Otherwise xt is the most likely next note as predicted by the model. The code for this adaptation can be found on Feynman Liang’s GitHub: HarmModel.lua, harmonize.lua.

Below is an example of BachBot’s harmonization of Twinkle, Twinkle, Little Star, using the above strategy.

Figure 3. The BachBot harmonization of Twinkle, Twinkle, Little Star (in the soprano line). Alto, tenor and bass parts were filled in by BachBot 1.

In this example, the melody to Twinkle, Twinkle, Little Star is provided in the soprano line. The alto, tenor and bass parts are then filled by BachBot using the harmonization strategy. This is what that sounds like.

Despite the BachBot’s decent performance on this task, there are certain limitations to this model. Specifically, it doesn’t look ahead in the melody and uses only the current melody note and past context to generate notes. When people harmonize melodies, they can examine the whole melody, which makes it easier to infer appropriate harmonizations. The fact that this model can’t do that may result in surprises from future constraints, which cause mistakes. To solve this, a beam search may be used.

Beam searches explore multiple trajectories. For example, instead of only taking the most probable note (what is currently being done) it may take the four or five most probable, and explore each of these notes. Exploring multiple options can help the model recover from mistakes. Beam searches are commonly used in natural language processing applications to generate sentences.

Emotion-modulated melodies can now be put through this harmonization model to be completed. The way this is done is detailed in the final article describing application deployment.


This article used BachBot as a case study in discussing the considerations of building a creative deep learning model. Specifically, this article discussed techniques that improve generalization, and accelerate training for RNN/LSTM models, hyperparameter optimization, evaluation of the model, and ways to adapt a sequence prediction model for completion (or generation).

All of the parts of the Slideshow Music project are now complete. The final articles in this series will discuss how these parts are put together to form the final product. That is, they will discuss how the emotion-modulated melodies may be provided as an input to BachBot’s harmonization model, and deployment of the completed application.

References and Links

  1. Liang, F. (2016) BachBot. Available at GitHub*
  2. Collobert, R., Farabet, C., Kavukcuoglu, K., & Chintala, S. (2017). Torch (Version 7). Retrieved from Torch.


Prev: Deep Learning for Music Generation - Choosing a Model and Preprocessing Next: TensorFlow* Serving for AI API and Web App Deployment

View All Tutorials ›


Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.