Automatic rap lyric generation tool
- invoke
- tensorflow
- juman
- kytea
- chainer
# pip install beautifulsoup
python getlyrics.py -v > output.tsv
- Extract lyrics archive, then run the following command to obtain a file
data/juman_input.txt
:
python preprocess.py -crawl data/lyrics_shonan_s27_raw.tsv
- Feed the cleaned crawled corpus to juman:
juman < data/juman_input.txt > data/juman_out.txt
- Process the juman output file:
python preprocess.py -juman data/juman_out.txt
The preprocessing step is finished. You will have three files in the /data
folder:
string_corpus.txt
as a string corpus file for LSTM training (one sentence per line), each song is separated from the previous one by one linehiragana_corpus.txt
as a hiragana corpus file for FFNN training (one sentence per line), each song is separated from the previous one by one linedaihyou_vocab.p
file as a vocabulary file (keys correspond to surface forms, values to 代表表記) - this is used to lookup the embeddings during the LSTM training
- Training
inv train model
- Testing
inv test model
- Training
run the command below at the directory chainer_model
python train_lstm_lm.py (--gpu 0)
You should use gpu to train (this code is very slow on cpu)
- Generating lines
python generate_seq.py --model trained_model -O output_file N 10000
Make term-rhyme table using data/string_corpus.txt
and data/hiragana_corpus.txt
python features/make_term_vowel_table.py -v --unknown-terms <path-to-unknown-terms:optional> > <path-to-output-table>
data/term_vowel_table.csv
: term to vowel table (each row hasterm,vowels
)data/unknown_terms.txt
: terms that did not have hiragana form indata/hiragana_corpus.txt
. Currently they are filtered out from the table above
python NextLine.py -f data/sample_nextline_prediction_candidates.txt
After the processing, you will have the result test_lyrics.txt
.
Note: You may need to comment out the lines below in NextLine.py
if __name__ == "__main__":
...
temp.pop(0)
temp.pop(-1)