Despite the fact that the standard Perceptron provides a good result in its simplicity, it presents some criticalities. Suppose, for example, that we want to classify the XOR function. It is immediately evident that it is impossible to draw a plane that divides the positive examples from the negative ones without committing any error. It follows that the algorithm, not being data linearly separable, will continue to generate each time a different plan and the final one will be randomly determined by the moment in which the stop occurs after a certain number of iterations.
Suppose now to train the Perceptron and obtain, after a few iterations, a satisfactory classifier that correctly predicts the next 5000 submitted data points. If the last datum is classified incorrectly, the plan must be updated despite its previous accuracy. To limit these situations, whenever a plan has to change, the number c of correct consecutive classifications will be saved. In this way, during testing, it will be possible to possible to determine the sign of an example by weighing the contribution of each plan, according to the formula:
The experiments revealed, as expected, dependence on the order in which the data were shown as input. This implies that, for the same problem, different seeds can generate very different performance for the standard version, while the voted one remains stable. More details, schematized as the table below, can be found in the final report.
- Scikit-Learn to obtain the 20 Newsgroup dataset and various functionalities to transform the text into a numeric input.
- Numpy to perform vectorized operations.
- Memory Profiler useful to keep trace of memory occupation.
- Pretty Table for a nice confusion matrix formatting.
Experiments can be launched from the test.py
file, containing three category couples as an example. In general, it is possible to choose them from the following list:
- comp.os.ms-windows.misc
- comp.sys.ibm.pc.hardware
- comp.sys.mac.hardware
- comp.windows.x
- rec.autos
- rec.motorcycles
- rec.sport.baseball
- rec.sport.hockey
- sci.crypt
- sci.electronics
- sci.med
- sci.space
- misc.forsale
- talk.politics.misc
- talk.politics.guns
- talk.politics.mideast
- talk.religion.misc
- alt.atheism
- soc.religion.christian
In the two main functions it is possible to change the max_iter
and seed
parameters, in order to affect the number of cycles on the training data and to obtain different scenarios according to the shuffling.
perceptron.test_default(categories, max_iter=10, seed=8)
perceptron.test_voted(categories, max_iter=10, seed=8)
Although it is not recommended, within the util.py
class it is possible to include additional elements of the original text such as headers, footers and quotes by removing the last attribute.
train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True,
random_state=seed, remove=('headers', 'footers', 'quotes'))
test = fetch_20newsgroups(subset='test', categories=categories,
remove=('headers', 'footers', 'quotes'))
If you want to get a graphical detail of the memory usage you need to run
mprof run test.py
mprof plot
- Large Margin Classification Using the Perceptron Algorithm by Y. Freund and R.E. Schapire
- Working with text data from Scikit-Learn documentation
- A Course in Machine Learning by Hal Daumé