Natural Language Processing

Bottom Top Approach Of Learning

Text Preprocessing Level 1- Tokenization,Lemmatization,StopWords,POS
Text Preprocessing Level 2- Bag Of Words, TFIDF, Unigrams,Bigrams,n-grams
Text Preprocessing- Gensim,Word2vec,AvgWord2vec
Solve Machine Learning Usecases
Get the Understanding Of Artificial Neural Network
Understanding Recurrent Neural Networks, LSTM,GRU
Text Preprocessing Level 3- Word Embeddings, Word2vec
Bidirectional LSTM RNN, Encoders And Decoders, Attention Models
Transformers
BERT

Text Processing Level - 1

Tokenization

Tokenization is the process of converting the whole data or a paragraph into sentences and words.

Library used :

NLTK

Paragraph -> Sentences

sentences = nltk.sent_tokenize('paragraph')

Paragraph -> Words

words = nltk.word_tokenize('paragraph')

Stemming

"Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language" - Wikipedia

From the defintion we came to know that we are deriving a root word from the words though it's not a legal word.

Example :

Trouble , Troubled , Troubling

When we stem the above words the output will be troubl. Sounds strange isn't it?.

Stemming just removes the common prefix , suffix and give us the stem word which may or may not be legal one.

Libraries used

NLTK , PorterStemmer

Stopwords

Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc.

Libraries used

NLTK , stopwords

With the knowledge we so far gained we can tokenize a parangraph into sentences. Then convert those sentences into words then we can perfrom the Stemming process.

Look at the Stemming note book

Lemmatization

"Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form" - Wikipedia

In simpler words it's an extension of Stemming. The output of the lemmatization will be a meaningful legal word.

Libraries used

NLTK , WordNetLemmatizer

Check the implementation Lemmatization note book

Text Processing Level - 2

Bag of Words

Suppose we have 3 sentences :

Sentence I : good boy

Sentence II : good girl

Sentence III : boy girl

Considering the above 3 sentences we count the frequency of each of the word and then converting those into a vector.

word	frequency
good	2
boy	2
girl	2

Converting to Vector

	good	boy	girl
Sentence I	1	1	0
Sentence II	1	0	1
Sentence III	0	1	1

Libraries used

NLTK , stopwords , re , sklearn

Process

Get the input
tokenize the word into sentences
lemmatize the sentences
Use Regular Expression to eliminate the symbols and numbers
Using the CountVectorizer pass the processed data
The final output will be an array like we have seen above.

Check the implementation Bag of Words NoteBook

TF - IDF

TF - Term Frequency

IDF - Inverse Document Frequency

TF

Refer the above example in Bag of Words.

The main disadvantange of the BOW is that it won't give us which word is the vital word. It will give us simply an vector yet powerful.

TF = No.of. frequency of the word / total no of words

IDF

IDF = log(No.of.sentences / No.of.Sentences contains the word)

Final table will be an product of TF * IDF

Libraries used

NLTK , stopwords , re , sklearn

Process

Get the input
tokenize the word into sentences
lemmatize the sentences
Use Regular Expression to eliminate the symbols and numbers
Using the TfidfVectorizer pass the processed data
The final output will be an array like we have seen above.

Check the implementation TFIDF NoteBook

Text Processing Level - 3

Word2Vec

The main problem with the BOW and TF-IDF is that they doesn't give us any semantic information. To overcome this we use Word2Vec.

In BOW and TF-IDF the words are represented in an array while in Word2Vec each word in the sentence is represented as a vector of 32 or more dimension.

Using the dimensions it keeps the semantic information with the other words. Eg. Man and Woman will be near in terms of dimensions.

In simpler words related words are stored as near as possible using the vector dimensions.

It's a brief topic if you want to learn more please refer google.

Libraries used

NLTK , stopwords , re , sklearn, gensim

Process

Cleaning the data
Tokenizing
Building the model
Comparing the vocabularies

Check the implementation Word2Vec NoteBook

Word Embedding

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

From the above statement there arises a question, What type of representaion?. Ofcourse the answer is Vector representation as we seen earlier.

In word embedding each word is defined by an dimension of upto 300. Each dimension represents a category. It was pre-trained by Google with 3 billion words and 300 dimesions.

This is how similar words are grouped together as much as possibe. With this understanding let's dive deeper.

One Hot Code

Like Bag of Words the words are represented in a Vector that's called One Hot Code. The size of the one hot depends on the Vocabulary size. We can define of our Vocabulary size. In the vocabulary the words are arranged in a ascending order.

Unlike in Bag of words we got a number representing the word in the dictionary. In the case of we got a Sparse matrix. Once we are done with this step we are good to go for Keras word embedding layer.

  vocab_size = 1000
  
  onehot_representation = [one_hot(word, vocab_size) for word in sentence]

Pad Sequencing

All the neural networks require to have inputs that have the same shape and size. However, when we pre-process and use the texts as inputs for our model.

In other words, naturally, some of the sentences are longer or shorter. We need to have the inputs with the same size, this is where the padding is necessary.

Then we need to do padding, since every sentence in the text has not the same number of words, we can also define maximum number of words for each sentence, if a sentence is longer then we can drop some words.

#import pad_sequences
from tensorflow.keras.preprocessing.sequence import pad_sequences

sequences=tokenizer.texts_to_sequences(sentences)
padded=pad_sequences(sequences,padding="post",truncating=”post”,maxlen=8)

padding=”post”

Add the zeros at the end of the sequence to make the samples in the same size. The argument can be either pre or post depending on the argument it will add 0's at the respective position.

maxlen=8

this input defines the maximum number of words in your sentences, the default maximum length of sentences is defined by the longest sentence. When a sentence exceeds the number of max words, then it will drop the words and by default setting, it will drop the words at the beginning of the sentence.

truncating=”post”

setting this truncating parameter as post means that when a sentence exceeds the number of maximum words drop the last words in the sentence instead of the default setting which drops the words from the beginning of the sentence.

Word Embedding Layer

Keras offers an Embedding layer that can be used for neural networks on text data.

It requires that the input data be integer encoded, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras.

The Embedding layer is defined as the first hidden layer of a network. It must specify 3 arguments.

It must specify 3 arguments:

input_dim

This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.

output_dim

This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem.

input_length

This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.

sent_len = 8
model = Sequential()
model.add(Embedding(vocab_size,10,input_length=sent_len))

Once we are done we can predict the vector of the given sentence by

print(model.predict(your_doc))

To read more about Word Embedding please refer Jason Brownlee

Chech the implementation Word Embedding NoteBook

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
LSTM Stock price prediction		LSTM Stock price prediction
Stock price NLP		Stock price NLP
pictures		pictures
Bag_of_Words.ipynb		Bag_of_Words.ipynb
Lemmatization.ipynb		Lemmatization.ipynb
README.md		README.md
Stemming_.ipynb		Stemming_.ipynb
TF_IDF.ipynb		TF_IDF.ipynb
Tokenization.ipynb		Tokenization.ipynb
Word2Vec.ipynb		Word2Vec.ipynb
Word_Embedding.ipynb		Word_Embedding.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Natural Language Processing

Text Processing Level - 1

Tokenization

Stemming

Stopwords

Lemmatization

Text Processing Level - 2

Bag of Words

TF - IDF

Text Processing Level - 3

Word2Vec

Word Embedding

Pad Sequencing

Word Embedding Layer

About

Releases

Packages

Languages

AlbusDracoSam/NLP

Folders and files

Latest commit

History

Repository files navigation

Natural Language Processing

Text Processing Level - 1

Tokenization

Stemming

Stopwords

Lemmatization

Text Processing Level - 2

Bag of Words

TF - IDF

Text Processing Level - 3

Word2Vec

Word Embedding

Pad Sequencing

Word Embedding Layer

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages