Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created a static site for documentation #115

Merged
merged 3 commits into from
Jan 12, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
Gemfile.lock
pkg
*.rdb
docs/_site/
docs/.sass-cache/
docs/.jekyll-metadata
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ MAINTAINER Sawood Alam <https://github.com/ibnesayeed>
ENV LANG C.UTF-8

RUN apt update && apt install -y libgsl0-dev && rm -rf /var/lib/apt/lists/*
RUN gem install narray nmatrix gsl
RUN gem install narray nmatrix gsl jekyll github-pages

RUN cd /tmp \
&& wget http://download.redis.io/redis-stable.tar.gz \
Expand Down
3 changes: 3 additions & 0 deletions docs/_config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
title: Classifier Reborn
description: Classifier Reborn is a general classifier module to allow Bayesian and other types of classifications.
theme: jekyll-theme-cayman
198 changes: 198 additions & 0 deletions docs/bayes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
---
layout: default
---

# Bayesian Classifier

Bayesian Classifiers are accurate, fast, and have modest memory requirements.

**Note:** *Classifier only supports UTF-8 characters.*

## Basic Usage

```ruby
require 'classifier-reborn'

classifier = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting'
classifier.train_interesting "Here are some good words. I hope you love them."
classifier.train_uninteresting "Here are some bad words, I hate you."
classifier.classify "I hate bad words and you." #=> 'Uninteresting'

classifier_snapshot = Marshal.dump classifier
# This is a string of bytes, you can persist it anywhere you like

File.open("classifier.dat", "w") {|f| f.write(classifier_snapshot) }

# This is now saved to a file, and you can safely restart the application
data = File.read("classifier.dat")
trained_classifier = Marshal.load data
trained_classifier.classify "I love" #=> 'Interesting'
```

## Redis Backend

Alternatively, a [Redis](https://redis.io/) backend can be used for persistence. The Redis backend has some advantages over the default Memory backend.

* The training data remains safe in case of application crash.
* A shared model can be trained and used for classification from more than one applications (from one or more hosts).
* It scales better than local Memory.

These advantages come with an inherent performance cost though.
In our benchmarks we found the Redis backend (running on the same machine) about 40 times slower for training and classification than the default Memory backend (see [the benchmarks](https://github.com/jekyll/classifier-reborn/pull/98) for more details).

To enable Redis backend, use the dependency injection during the classifier initialization as illustrated below.

```ruby
require 'classifier-reborn'

redis_backend = ClassifierReborn::BayesRedisBackend.new
classifier = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting', backend: redis_backend

# Perform training and classification using the classifier instance
```

The above code will connect to the local Redis instance with the default configurations.
The Redis backend accepts the same arguments for initialization as the [redis-rb](https://github.com/redis/redis-rb) library.
The following example illustrates connection to a Redis instance with custom configurations.

```ruby
require 'classifier-reborn'

redis_backend = ClassifierReborn::BayesRedisBackend.new {host: "10.0.1.1", port: 6380, db: 2}
# Or
# redis_backend = ClassifierReborn::BayesRedisBackend.new url: "redis://user:secret@10.0.1.1:6380/2"
classifier = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting', backend: redis_backend

# Perform training and classification using the classifier instance
```

## Beyond the Basics

Beyond the basic example, the constructor and trainer can be used in a more flexible way to accommodate non-trival applications.
Consider the following program.

```ruby
#!/usr/bin/env ruby

require 'classifier-reborn'

training_set = DATA.read.split("\n")
categories = training_set.shift.split(',').map{|c| c.strip}

# pass :auto_categorize option to allow feeding previously unknown categories
classifier = ClassifierReborn::Bayes.new categories, auto_categorize: true

training_set.each do |a_line|
next if a_line.empty? || '#' == a_line.strip[0]
parts = a_line.strip.split(':')
classifier.train(parts.first, parts.last)
end

puts classifier.classify "I hate bad words and you" #=> 'Uninteresting'
puts classifier.classify "I hate javascript" #=> 'Uninteresting'
puts classifier.classify "javascript is bad" #=> 'Uninteresting'

puts classifier.classify "all you need is ruby" #=> 'Interesting'
puts classifier.classify "i love ruby" #=> 'Interesting'

puts classifier.classify "which is better dogs or cats" #=> 'dog'
puts classifier.classify "what do I need to kill rats and mice" #=> 'cat'

__END__
Interesting, Uninteresting
interesting: here are some good words. I hope you love them
interesting: all you need is love
interesting: the love boat, soon we will be taking another ride
interesting: ruby don't take your love to town

uninteresting: here are some bad words, I hate you
uninteresting: bad bad leroy brown badest man in the darn town
uninteresting: the good the bad and the ugly
uninteresting: java, javascript, css front-end html
#
# train categories that were not pre-described
#
dog: dog days of summer
dog: a man's best friend is his dog
dog: a good hunting dog is a fine thing
dog: man my dogs are tired
dog: dogs are better than cats in soooo many ways

cat: the fuzz ball spilt the milk
cat: got rats or mice get a cat to kill them
cat: cats never come when you call them
cat: That dang cat keeps scratching the furniture
```

## Knowing the Score

When you ask a Bayesian classifier to classify text against a set of trained categories it does so by generating a score (as a Float) for each possible category.
The higher the score the closer the fit your text has with that category.
The category with the highest score is returned as the best matching category.

In `ClassifierReborn` the methods `classifications` and `classify_with_score` give you access to the calculated scores.
The method `classify` only returns the best matching category.

Knowing the score allows you to do some interesting things.
For example, if your application is to generate tags for a blog post you could use the `classifications` method to get a hash of the categories and their scores.
You would sort on score and take only the top three or four categories as your tags for the blog post.

You could within your application establish the smallest acceptable score and only use those categories whose score is greater than or equal to your smallest acceptable score as your tags for the blog post.

What if you only use the `classify` method?
It does not show you the score of the best category.
How do you know that the best category is really any good?

You can use the threshold.

## Using the Threshold

Some applications can have only one category.
The application wants to know if the text being classified is of that category or not.
For example consider a list of normal free text responses to some question or maybe a URL string coming to your web application.
You know what a normal response looks like, but you have no idea how people might misuse the response.
So what you want to do is create a Bayesian classifier that just has one category, for example, `Good` and you want to know whether your text is classified as `Good` or `Not Good`.
Or suppose you just want the ability to have multiple categories and a `None of the Above` as a possibility.

### Setting up a Threshold

When you initialize the `ClassifierReborn::Bayes` classifier there are several options which can be set that control threshold processing.

```ruby
b = ClassifierReborn::Bayes.new(
'good', # one or more categories
enable_threshold: true, # default: false
threshold: -10.0 # default: 0.0
)
b.train_good 'Good stuff from Dobie Gillis'
# ...
text = 'Bad junk from Maynard G. Krebs'
result = b.classify text
if result.nil?
STDERR.puts "ALERT: This is not good: #{text}"
let_loose_the_dogs_of_war! # method definition left to the reader
end
```

In the `classify` method when the best category for the text has a score that is either less than the established threshold or is Float::INIFINITY, a nil category is returned.
When you see a nil value returned from the `classify` method it means that none of the trained categories (regardless or how many categories were trained) has a score that is above or equal to the established threshold.

### Threshold-related Convenience Methods

```ruby
b.threshold # get the current threshold
b.threshold = -10.0 # set the threshold
b.threshold_enabled? # Boolean: is the threshold enabled?
b.threshold_disabled? # Boolean: is the threshold disabled?
b.enable_threshold # enables threshold processing
b.disable_threshold # disables threshold processing
```

Using these convenience methods your applications can dynamically adjust threshold processing as required.

## References

* [Naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
* [Introduction to Bayesian Filtering](http://web.archive.org/web/20131205153329/http://www.process.com/precisemail/bayesian_filtering.htm)
* [Bayesian filtering](http://en.wikipedia.org/wiki/Bayesian_filtering)
* [A Plan for Spam](http://www.paulgraham.com/spam.html)
91 changes: 91 additions & 0 deletions docs/development.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
layout: default
---

# Development and Contributions

This library is released under the terms of the [LGPL-2.1](https://github.com/jekyll/classifier-reborn/blob/master/LICENSE).
Any derivative work or usage of the library needs to be compatible with LGPL-2.1 or write to the author to ask for permission.
See LICENSE text for more details.

## Code of Conduct

In order to have a more open and welcoming community, `Classifier Reborn` adheres to the `Jekyll`
[code of conduct](https://github.com/jekyll/jekyll/blob/master/CONDUCT.markdown) adapted from the `Ruby on Rails` code of conduct.

Please adhere to this code of conduct in any interactions you have in the `Classifier` community.
If you encounter someone violating these terms, please let [Chase Gilliam](https://github.com/Ch4s3) know and we will address it as soon as possible.

## Development Environment

To make changes in the gem locally clone the repository or your fork.

```bash
$ git clone git@github.com:jekyll/classifier-reborn.git
$ cd classifier-reborn
$ bundle install
$ gem install redis
$ rake # To run tests
```

Some tests should be skipped if the Redis server is not running on the development machine.
To test all the test cases first [install Redis](https://redis.io/topics/quickstart) then run the server and perform tests.

```bash
$ redis-server --daemonize yes
$ rake # To run tests
$ rake bench # To run benchmarks
```

Kill the `redis-server` daemon when done.

## Development using Docker

Provided that [Docker](https://docs.docker.com/engine/installation/) is installed on the development machine, clone the repository or your fork.
From the directory of the local clone build a Docker image locally to setup the environment loaded with all the dependencies.

```bash
$ git clone git@github.com:jekyll/classifier-reborn.git
$ cd classifier-reborn
$ docker build -t classifier-reborn .
```

To run tests on the local code (before or after any changes) mount the current working directory inside the container at `/usr/src/app` and run the container without any arguments.
This step should be repeated each time a change in the code is made and a test is desired.

```bash
$ docker run --rm -it -v "$PWD":/usr/src/app classifier-reborn
```

A rebuild of the image would be needed only if the `Gemfile` or other dependencies change.
To run tasks other than test or to run other commands access the Bash prompt of the container.

```bash
$ docker run --rm -it -v "$PWD":/usr/src/app classifier-reborn bash
root@[container-id]:/usr/src/app# redis-server --daemonize yes
root@[container-id]:/usr/src/app# rake # To run tests
root@[container-id]:/usr/src/app# rake bench # To run benchmarks
root@[container-id]:/usr/src/app# pry
[1] pry(main)> require 'classifier-reborn'
[2] pry(main)> classifier = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting'
```

## Documentation

To make changes to this documentation and to run it locally, run a Docker container with the following command.

```bash
$ docker run --rm -it -v "$PWD":/usr/src/app -w /usr/src/app/docs -p 4000:4000 classifier-reborn jekyll s -H 0.0.0.0
```

If the server runs as expected then the documentation should be available at [http://localhost:4000/](http://localhost:4000/).

## Authors and Contributors

* [Lucas Carlson](mailto:lucas@rufy.com)
* [David Fayram II](mailto:dfayram@gmail.com)
* [Cameron McBride](mailto:cameron.mcbride@gmail.com)
* [Ivan Acosta-Rubio](mailto:ivan@softwarecriollo.com)
* [Parker Moore](mailto:email@byparker.com)
* [Chase Gilliam](mailto:chase.gilliam@gmail.com)
* and [many more](https://github.com/jekyll/classifier-reborn/graphs/contributors)...
75 changes: 75 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
---
layout: default
---

# Getting Started

Classifier Reborn is a fork of [cardmagic/classifier](https://github.com/cardmagic/classifier) under more active development.
The library is released under the [LGPL-2.1](https://github.com/jekyll/classifier-reborn/blob/master/LICENSE).
Currently, it has [Bayesian](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) and [Latent Semantic Indexer (LSI)](https://en.wikipedia.org/wiki/Latent_semantic_analysis) classifiers implemented.

Here is a quick example to illustrate the usage.

```bash
$ gem install classifier-reborn
$ irb
irb(main):001:0> require 'classifier-reborn'
irb(main):002:0> classifier = ClassifierReborn::Bayes.new 'Ham', 'Spam'
irb(main):003:0> classifier.train_ham "Sunday is a holiday. Say no to work on Sunday!"
irb(main):004:0> classifier.train_spam "You are the lucky winner! Claim your holiday prize."
irb(main):005:0> classifier.classify "What's the plan for Sunday?"
#=> "Ham"
```

Here is a line-by-line explaination of what we just did:

* Installed the `classifier-reborn` gem (assuming that [Ruby](https://www.ruby-lang.org/en/) is installed already).
* Started the Interactive Ruby Shell (IRB).
* Loaded the `classifier-reborn` gem in the interactive Ruby session.
* Created an instance of `Bayesian` classifier with two classes `Ham` and `Spam`.
* Trained the classifier with an example of `Ham`.
* Trained the classifier with an example of `Spam`.
* Asked the classifier to classify a text and got the response as `Ham`.

## Installation

To use `classifier-reborn` in your Ruby application add the following line into your application's `Gemfile`.

```ruby
gem 'classifier-reborn'
```

Then from your application's folder run the following command to install the gem and its dependencies.

```bash
$ bundle install
```

Alternatively, run the following command to manually install the gem.

```bash
$ gem install classifier-reborn
```

## Dependencies

The only runtime dependency of this gem is Roman Shterenzon's `fast-stemmer` gem. This should install automatically with RubyGems. Otherwise manually install it as following.

```bash
gem install fast-stemmer
```

To speed up `LSI` classification by at least 10x consider installing following libraries.

* [GSL - GNU Scientific Library](http://www.gnu.org/software/gsl)
* [Ruby/GSL Gem](https://rubygems.org/gems/gsl)

Note that `LSI` will work without these libraries, but as soon as they are installed, classifier will make use of them. No configuration changes are needed, we like to keep things ridiculously easy for you.

## Further Readings

For more information read the following documentation topics.

* [Bayesian Classifier](bayes)
* [Latent Semantic Indexer (LSI)](lsi)
* [Development and Contributions](development)
Loading