This is a program that takes an input and tries it's best to determine if the input is negative or positive.
Accuracy is now 89.7% which is quite good. Dataset size approx 2000 reviews in total.
The dataset used is not my own and belongs to the Association of Computational Linguistics (ACL), 2007
More detailed readme also coming up!
This is a ongoing part-time project so it might take a while to update this.
In statistics, Naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features.
Where:
- A is our word.
- B is the either positive or negative.
In short, this program takes a string and tries to determine if it is positive or negative, based on probability. For each word in a sentence, it calculates the probability of the word being positive or negative. And the highest probability wins.
The naive part is the assumption that each word is examined as an independent probability.
This is a supervised machine learning model.
Supervised means that it will not learn on it's own, it can only learn by data it has been fitted to.
This classifier is based on electronics reviews from amazon.
- Maybe binomial approach (Single word bad, 2 word sequence good?)
- Dataset improvements, is there a better one?
- Language processing improvements
- Better regexing or parsing
- Better lemmatization (with tags?)
To build database, run:
$ python3 build_dataset.py
Once the database has been built, you don't have to build it again.
Then you can run the classifier:
$ python3 byers.py "text-to-classify"
(Option --proba) Displays probabilities of text being neg / pos.
(Option --benchmark) Displays classifier accuracy in % measured by an independent dataset.
You will need python3
$ pip install python3
And NLTK and nltk.wordnet
$ pip3 install nltk
$ python3
>>> import nltk
>>> nltk.download("wordnet")
And scikit-learn
$ pip3 install scikit