Bayesian Classifier

This is a program that takes an input and tries it's best to determine if the input is negative or positive.
Accuracy is now 89.7% which is quite good. Dataset size approx 2000 reviews in total.

The dataset used is not my own and belongs to the Association of Computational Linguistics (ACL), 2007

More detailed readme also coming up!
This is a ongoing part-time project so it might take a while to update this.

Bayesian What?

In statistics, Naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features.

Where:

A is our word.
B is the either positive or negative.

In short, this program takes a string and tries to determine if it is positive or negative, based on probability. For each word in a sentence, it calculates the probability of the word being positive or negative. And the highest probability wins.
The naive part is the assumption that each word is examined as an independent probability.

This is a supervised machine learning model.
Supervised means that it will not learn on it's own, it can only learn by data it has been fitted to.
This classifier is based on electronics reviews from amazon.

Improvements

Maybe binomial approach (Single word bad, 2 word sequence good?)
Dataset improvements, is there a better one?
Language processing improvements
Better regexing or parsing
Better lemmatization (with tags?)

How to run

To build database, run:

$ python3 build_dataset.py

Once the database has been built, you don't have to build it again.

Then you can run the classifier:

$ python3 byers.py "text-to-classify"
(Option --proba) Displays probabilities of text being neg / pos.
(Option --benchmark) Displays classifier accuracy in % measured by an independent dataset.

Requirements

You will need python3

$ pip install python3

And NLTK and nltk.wordnet

$ pip3 install nltk
$ python3
>>> import nltk
>>> nltk.download("wordnet")

And scikit-learn

$ pip3 install scikit

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
dataset		dataset
README.md		README.md
build_dataset.py		build_dataset.py
byers.py		byers.py
dataset.p		dataset.p
dataset_handler.py		dataset_handler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bayesian Classifier

Bayesian What?

Improvements

How to run

Requirements

About

Releases

Packages

Languages

samulieronen/bayesian_classifier

Folders and files

Latest commit

History

Repository files navigation

Bayesian Classifier

Bayesian What?

Improvements

How to run

Requirements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages