- In this project, I tried to train a Sentiment Analyzer/Text Classifier for Beyazperde movie critics.
There are lots of resources for English Sentiment Analysis but in Turkish, we have limited resources
for Sentiment Analyzing. In order to increase resources about Turkish Sentiment Analyzing, I started
to this project.
- Sentiment Analyzing is a branch of Natural Language Processing. In this field’s projects usually
there are some unlabeled data and, you try to predict which class they belong to. In order to implement
this process, there are some Sentiment Analyzing steps.
- The first rule is to get adequate dataset to train your model efficiently. Here, I have sample movie
critics from Beyazperde. You can find it from this link.
- This step is the crucial step for any kind of Machine Learning model training. Real life data is
not always clean. So, you must process your dataset as possible as. In Machine Learning, there
is a ratio that is, data preprocessing/cleaning is 80% and modelling is 20% of overall work. So, I
also splitted data preprocessing into sub steps.
- Nan values is not useful for training model
- If the sizes are unbalanced the model overfits while prediction
- Stopwords and punctuations are unnecessary for training model. For more stopwords, please contact with me
- This corrects the miswritten words and throws meaningless words away
- This removes the suffixes and gives us the root of each word
- In this step, you choose a Machine Learning algorithm for Sentiment Analyzing/Text Classification.
All algorithms can be used, but I chose the Multinomial Naive Bayes algorithm. Since, it gives
better accuracy scores on Sentiment Analyzing/Text Classification. This algorithm assumes that
the presence of a particular feature in a class is unreletad to the presence of any other feature.
I also, splitted data classification into some sub steps.
2.3.1. SPLITTING TRAIN AND TEST DATA:
- Usually, we partition the dataset into 80% as training and 20% as testing data
- There are some methods such as Bag Of Words, Count Vectorizer and Tfidf Vectorizer. I chose Tfidf Vectorizer.
2.3.3. GRIDSEARCHCV:
- This method enables us to find the best hyperparameter for the model
- The model learns by fitting and analyzes the sentiment by predicting
2.3.5. OBSERVING ACCURACY, F1, PRECISION AND RECALL SCORES:
- This scores are useful for comparing model’s success
2.3.6. OBSERVING CONFUSION MATRIX AND PREDICTION PROBABILITIES
- This gives us an intuition of how confidently the model makes the predictions
2.3.7. TEN-FOLD CROSS VALIDATION:
- This enables us to train our model with different samples of the same dataset so that, we can check if it
learned correctly or not
- In this step, I create a pipeline for the model. Pipelining prevents us from repeating all steps again
and again. With the help of pipelining, when I give any raw unlabeled data, at first, the model preprocess
it and then, makes prediction. So, it makes our model reusable.
- Pickling a model means transforming it into binary form. It makes our model portable. When you want to
use the model in different projects, by just loading this pickled file, you can use the model and get
predictions wherever you want.
- The model score can be improved by increasing the number of "Turkish Stopwords" or "Beyazperde Dataset".
In both cases, model will be trained more efficiently.
- In this project, I learned the concept of Text Classification/Sentiment Analyzing. It also provided
me knowledge base for Natural Language Processing. Since, getting and preprocessing the dataset is
the crucial part of any Machine Learning model training.
- I completed this project in cooperation with Verius Technology Company.The training dataset (Beyazperde)
and data preprocessing tools (normalization, stemming) are provided me by them. I also used python libraries
such as: Sklearn, Pandas, Numpy, Nltk.