Skip to content

Tutorial: Your first nalpline

Gustavo Rosa edited this page Apr 13, 2021 · 3 revisions

Every code starts with some imports, correct?

import tensorflow as tf

from nalp.corpus import TextCorpus
from nalp.datasets import LanguageModelingDataset
from nalp.encoders import IntegerEncoder
from nalp.models.generators import RNNGenerator

Afterward, we can instantiate the first step and load the input data, which is represented by the corpus.

# Creating a character TextCorpus from file
corpus = TextCorpus(from_file='data/text/chapter1_harry.txt', corpus_type='char')

With the corpus in hands, we can now create an encoder and encode the tokens.

# Creating an IntegerEncoder, learning encoding and encoding tokens
encoder = IntegerEncoder()
encoder.learn(corpus.vocab_index, corpus.index_vocab)
encoded_tokens = encoder.encode(corpus.tokens)

From the encoded tokens, we will build the language modelling dataset, which is the task we are aiming to accomplish.

# Creating Language Modeling Dataset
dataset = LanguageModelingDataset(encoded_tokens, max_contiguous_pad_length=10, batch_size=64, shuffle=True)

Finally, we can create the recurrent-based model, compile the optimizer, loss and metrics, and train on the instantiated dataset.

# Creating the RNN
rnn = RNNGenerator(encoder=encoder, vocab_size=corpus.vocab_size, embedding_size=256, hidden_size=512)

# As NALP's RNNs are stateful, we need to build it with a fixed batch size
rnn.build((64, None))

# Compiling the RNN
rnn.compile(optimizer=tf.optimizers.Adam(learning_rate=0.001),
            loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
            metrics=[tf.metrics.SparseCategoricalAccuracy(name='accuracy')])

# Fitting the RNN
rnn.fit(dataset.batches, epochs=200)

There you go! Just sequentially put this instruction in a single file and run it. You can also get the file in examples/models/generators/train_rnn.py. Stay focus and you should be ready for everything.

Clone this wiki locally