This repository represents the practical assignment within the "Machine Learning" course. The project aims to investigate the adaptability/adequacy of various classification algorithms in the context of solving the spam email detection problem, using the Ling-Spam dataset available here.
- Document the attributes and labels of the dataset, as well as the process of extracting them from the textual representation. Highlight the clues in the file titles (in the form of the "spm" prefix) indicating spam messages.
- Utilize the 9 folders (from part1 to part9) for training and keep one folder for testing (part10) from each category (wood, bars, stop, wood_stop).
- Choose and implement an algorithm, among those studied, that you consider suitable for solving the spam classification problem.
- Justify the algorithm choice in a LaTeX report, both theoretically and experimentally. Include a comparison with other candidate algorithms.
- Implement and present results using the Leave-One-Out cross-validation strategy, including a statistical graph.
- Add to the report a graph illustrating the algorithm's performance on the test dataset in terms of accuracy obtained. The accuracy should be significantly better than trivial strategies (random guessing or constant class selection). Include comparative graphs if you tested multiple algorithms.
- Explain any relevant experiment detail, either in text or through graphs. Investigate improved variants of the algorithm studied in the seminar to implement and enhance accuracy.