GitHub - cbd/classifier: An Erlang Bayesian Filter and Text Classifier

Overview

Classifier is a bayesian text analyzer and classifier. Its goal is provide a simple way to decide if a given text is considered legit or spam, based on a sample pool of texts you provide to the program.

Its logic is based upon Paul Graham's A Plan For Spam.

Contact Us

For questions or general comments regarding the use of this library, please use our public hipchat room.

If you find any bugs or have a problem while using this library, please open an issue in this repo (or a pull request :)).

And you can check all of our open-source projects at inaka.github.io

Features

You can tweak the parameters.
You can manually flag spams and false positives.
The program learns overtime from new texts and updates the sample pools.

Configuration

you need to define classifier as a rebar dep or have some other way of including it in erlang’s path.

To configure classifier you use an application variable (probably in your app.config):

{classifier, [  
  {update_probabilities_timeout, 300000}, %% milliseconds  
  {default_probability, 0.4},  
  {threshold_probability, 0.9},  
  {max_text_tokens, 5},  
  {minimun_appearances, 5}  
]}

All the config params have a default value, so you can skip some or all of them in your config

Usage

First of all you need to start the app:

application:start(classifier)

The next step is training the classifier. You can train it whenever you want and as many times as you want. You need to train it before the first time you start using your app.
There're three ways to train it:

Passing a dir

classifier:train(Dir)

Where Dir is a path to some folder that contains two folders called pos and neg where there're files with texts to be analyzed. You can find an example in priv/test dir.

Passing a text
```
classifier:train({Tag, text, Text})
```
Where Tag is 'pos' or 'neg' and Text is a string to be putted on the Tag side.

Passing a text list

classifier:train({Tag, text_list, Texts})

Where Tag is 'pos' or 'neg' and Texts is a list of strings to be pushed on the Tag side.

Now you can ask the classifier to analyze and classify some text:

1> classifier:classify(Text).
acceptable
2> classifier:classify(AnotherText).
unacceptable

Every time the classifier classify a text it learns about the result pushing the text analyzed on its pool

Future Features

Persist the info
Multiprocess to classify text and to update the state

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
deps		deps
include		include
priv/test		priv/test
src		src
Emakefile		Emakefile
Makefile		Makefile
README.md		README.md
rebar.config		rebar.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Contact Us

Features

Configuration

Usage

Future Features

About

Releases

Packages

Languages

cbd/classifier

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages