The purpose of this project is to use deep learning techniques to create a music generator that can create unique and melodious music. There are now many different formats of training data (MIDI, .mp3, .wav) that have been used, as well as different types of models such as GANs, Google’s WaveNet, and different types of RNNs. For the sake of simplicity, it was decided to narrow down the focus specifically to using LSTM-based model architectures to generate music in the ABC notation. The ABC notation is a text-based music notation format. Each entry in it begins with identifiers for the index (X), title (T), author/s (S), meter/time signature (M), key (K), and unit note length (L). This is followed by the actual transcription, that uses 7 letters (A to G) and symbols corresponding to features such as key, note length, raised or lowered octave, flat and sharp to represent the song. Multiple examples of this notation are provided in the EDA. The actual data for this project came from the Nottingham Music Database that has 1037 different folk tunes as part of their “Nottingham Collection”.
In this paper, researchers built two different types of RNNs (with the same architecture) - char-RNN, which operates over a vocabulary of single characters, and folk-RNN, which operates over a vocabulary of transcription tokens, and was trained on single complete transcriptions. They trained these RNNs on ~23,000 different transcriptions in the ABC notation. To generate samples, a “seed” value was provided, which was then used to sample from a probability distribution output as to what the next value should be, which was used as the input value for the next and so on. Unlike previous works, the networks created here contain thousands of units, and generated thousands of transcriptions for evaluation. The transcriptions generated were evaluated by using descriptive statistics comparing the training transcriptions and the generated samples (including comparing distributions), examining the adherence of the generated transcriptions to conventions found in the training transcriptions with respect to musical theory, and by sharing the transcriptions online on a forum and asking people to rate them and provide feedback. This paper was chosen as it uses a similar dataset with the same format albeit with a lot more samples for training. The authors have a good understanding of musical theory and used statistical analysis of the generated samples to find that their experiment produced very good results.
In this paper, three-layered LSTM networks were used to train on ABC notation data. The data that was used in this research was monophonic, meaning that all the transcriptions were composed using a single instrument. The model(s) were evaluated using accuracy and loss, and how the generated samples sounded to the authors. While this evaluation strategy is lacklustre compared to the other papers, it is worth replicating the work done here because the authors did mention that they expect the model to perform better if it was trained on polyphonic, or multi-instrument data, which the Nottingham Collection that is being used here comprises of.
This paper is similar to paper 1, except it focuses specifically on the generation of jazz chord progressions and rock music drum tracks. Just as with the research being done as a part of this project, the authors simplify the process by using text data rather than representations of musical symbols or numeric values (such as binary vectors to represent pitch and chords). The data format differed for each type of task, with a .xlab format for the chord progressions, and MIDI format being used for the drum tracks. Examples of each can be found in the paper. Both tasks were evaluated by assessing whether the generated samples has learned the relationships that they were meant to from the training data. For chord progressions, they found that the model was able to learn local structures of chords and bars and local relationships between flags and chords. They also found that the generated samples lie well within jazz grammar. For drum tracks, however, this was not the case and the authors felt a more complex network would be required. This paper was chosen as it used different text formats to the ones being used in this project, thus it is worth exploring if the findings from this paper generalize well to other similar problems. Moreover, the way their model was able to capture intricate musical theory relationships makes it worthwhile to apply here.
The success of the model is dependent on generated samples that we can’t verify against test or validation data because we want them to be unique, while also maintaining the format and not generating garbage values. Moreover, we want the generated samples to sound melodious, however since music is subjective it is difficult to quantify what is “good” or “bad” without having it rated by a sample of people or experts. Nonetheless, we can analyse the generated samples on some key basics, such as whether or not they’re adhering to the correct format, statistical analysis of the distribution of different elements of the ABC notation against the training data, and how unique each generated sample is. These basics guided the 3 main pillars of evaluation for this project, which are as follows:
- Error Rate: The error rate is calculated as the number of erroneous rows divided by the total number of rows (which is 100). Erroneous rows are determined by checking for null or empty strings in ‘M’, ‘K’ and the transcriptions, checking for any characters other than letters in ‘K’ and checking for letters erroneously present in ‘M’.
- Distribution Similarity: The percentages of different unique value counts in the both the training and generated dataset are printed and plotted side by side for visualization.
- Average Maximum Similarity Score: The similarity score is calculated using the LCS algorithm (with some edits in sequence matching). Each generated sample is compared against every sample in the training set, and the maximum similarity from this comparison is then taken and averaged across all the generated samples to get the average maximum similarity score. The maximum for each is taken because we want to see if the generated sequences are overfitting to specific sequences in the training data, and then averaging all those maximums across the entire generated dataset. This is the main performance metric of this project, and the others are there to guide the tuning and ensuring there is no underfitting/overfitting.
The EDA guided the evaluation framework for this project, as outlined above. The evaluation of the models was done on generated samples. In order to ensure that all the generated sequences for all the models had a level on consistency, the seed index value and the input sequence length were both kept constant at 76 and 500 respectively. 100 samples were generated for each model, and the various evaluation functions were then run to ascertain how the model performed with respect to the pillars of evaluation outlined above. The models were subsequently tuned to get the lowest average maximum similarity scores, however this was done while keeping the error rate low to ensure the models don’t generate rubbish. More information can be found in the notebooks.
A lot of experimentation and tuning was conducted for this project – a lot of which was done with the first model. Subsequently, the things learnt from the previous models helped shape and improve the tuning procedure.
For Model 1, we experimented with reducing the batch size to give the model a more regularizing effect, which improved the performance and reduced overfitting. Since Paper 1 used a learning rate schedule for the optimizer, we tried using Adam and RMSProp without any learning rate specified, since they tend to be adaptive, however the results were not much better. Different regularization techniques were then tried, first increasing the dropout, then using L1 and L2 regularization with various values of lambda. Kernel regularization was not fruitful at all however increasing dropout to a maximum of 0.7 did improve performance. The ‘poisson’ loss was also tried, and while the results were positive in terms of uniqueness, it generated lots of garbage values and lots of samples generated were out of tune. Combinations of reducing the complexity and increasing dropout were tried, but the results got stuck in a loop of either a really high error rate and a small average maximum similarity score or vice versa.
For Model 2, we tried increasing the number of epochs, then increasing the complexity with more hidden layers. Experimentation was done with the dropout as well as doubling the hidden units, something we didn’t do with Model 1. On the whole, the performance improvements from experimentation as average.
The baseline model performance for model 3 was pretty bad, with a high error rate and high average maximum similarity score, so we started with increasing the model’s complexity, then tuning the dropout, and then finally reducing the model’s complexity (in terms of the hidden units) to tune the and get the lowest average maximum similarity score.
The final model chosen with the best performance is the tuned version of Model 3, which although had a high error rate of 49%, had the lowest average maximum similarity score of 0.22. The generated samples for this model can be found under Data/Generated_Samples/Model_3/Dropout0.4_DecreaseComplextyHiddenUnits/ The biggest drawback of the ultimate judgment is that the model that produces the most unique (and in my opinion the best sounding) tracks is erroneous half the time. For our particular use case, which is just generating something that sounds good and unique, it is not a problem, however if this were to be used commercially or in industry that would be problematic. There were many limitations with this project, the most important being the number of samples used for training. Projects such as Paper 1 used ~23,000 samples, whereas we only used 1,036. While more ABC notation files are available online, only this collection was chosen in the interest of time. Furthermore, the tunes in the Nottingham Collection are all of the same genre, making it easier for the model to learn and for us to assess its performance. A larger number of training samples could in theory result in a smaller error rate for the final model. Paper 1 and Paper 3 were written by authors with a good understanding of musical theory and could thus evaluate the generated samples on more than just statistical quantifications and similarity scores. With more musical theory knowledge or someone to consult with, we can assess how well the models learned things like chord progressions, tonal consistency, pauses between notes, etc. The error rate that was derived from the generated samples was generalized to finding just one error in the entire row, however we can go even deeper and quantify specifically the number of erroneous ‘M’, ‘K’ and ‘transcription’ values to understand the source of the errors better, which in turn could improve the model tuning approach. The generated samples were very sensitive to the seed index value, which, while we kept it constant for all the models, could mean that given the right seed value some models may perform better than others. It would be interesting to see how the models stack up when we generate 100 samples for each seed. This was too time consuming to consider for this project though.
- Sturm, B., Santos, J., Ben-Tal, O. and Korshunova, I., 2021. Music transcription modelling and composition using deep learning. [online] arXiv.org. Available at: https://arxiv.org/abs/1604.08723
- Ingale, V., Mohan, A., Adlakha, D., Kumar, K. and Gupta, M., 2021. Music Generation using Three-layered LSTM. [online] arXiv.org. Available at: https://arxiv.org/abs/2105.09046.
- Choi, K., Fazekas, G. and Sandler, M., 2021. Text-based LSTM networks for Automatic Music Composition. [online] arXiv.org. Available at: https://arxiv.org/abs/1604.05358 [Accessed 19 October 2021].
- James Allwright, B., 2021. The ABC Music project - The Nottingham Music Database. [online] Abc.sourceforge.net. Available at: http://abc.sourceforge.net/NMD/