Mushroom Edibility Classifier

author: Benjamin Frizzell, Hankun Xiao, Yichi Zhang, Mingyang Zhang

This project is part of a data analysis demonstration for DSCI 522 (Data Science Workflows), a course in the Master of Data Science program at the University of British Columbia.

Our aim is to use machine learning to classify mushrooms’ edibility, i.e., whether they are poisonous or edible.

About

In this project, a Support Vector Classifier was built and tuned to identify mushroom edibility. A mushroom is classified as edible or poisonous based on attributes such as color, habitat, class, and others. The final classifier performed quite well on unseen test data, achieving a final overall accuracy of 0.99 and an F2 (beta = 2) score of 0.99. Furthermore, we used a confusion matrix to evaluate the accuracy of classifying mushrooms as poisonous or edible. The model made 12,174 correct predictions out of 12,214 test observations. However, there were 17 false negatives (predicting a poisonous mushroom as edible) and 23 false positives (predicting an edible mushroom as poisonous). The model’s performance shows promise for practical implementation, prioritizing safety by minimizing false negatives that could result in consuming poisonous mushrooms. While false positives may lead to unnecessarily discarding safe mushrooms, they pose no safety risk. Further development is needed to improve the model’s utility, focusing on enhancing performance and analyzing cases of incorrect predictions.

The dataset used in this project is the Secondary Mushroom Dataset created by Wagner, D., Heider, D., and Hattab, G., from the UCI Machine Learning Repository. This dataset contains 61,069 hypothetical mushrooms with caps based on 173 species (353 mushrooms per species). Each mushroom is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended (the latter class was combined with the poisonous class).

Report

The final report can be found here.

Dependencies

Usage

Note: this project is fully reproducible using our Docker container

Before you start

Clone the repository to your local machine from GitHub.
Launch Docker Desktop and ensure it is running in the background (for Windows and macOS users).

To replicate the analysis

Navigate to the root directory of this project in your terminal.
Launch the JupyterLab environment using the following command:

docker compose up

In the terminal output, find a URL that starts with http://127.0.0.1:8888/lab?token=... (see the highlighted text in the terminal below for an example). Copy and paste this URL into your browser. Ensure no other JupyterLab environment is running at the same time.

To run the analysis, open a terminal in the Jupyterlab environment and run the following command:

make all

The report will be generated using the available tables, models, etc. If you wish to run the analysis from scratch, run the following command prior to the above command:

make clean-all

NOTE: Models will be trained and produced from scratch, which can take some time.

To exit and clean up the Environment

Press Ctrl + C, then type docker compose rm in the terminal where you launched the container to shut down and clean up the resources.

Developer notes

Developer dependencies

conda (version 24.11.0 or higher)
conda-lock (version 2.5.7 or higher)

Adding a new dependency

Add the required dependency to the environment.yml file in a new branch.
Use the following command to regenerate the conda-linux-64.lock file:

conda-lock -k explicit --file environment.yml -p linux-64

Locally rebuild the Docker image to confirm that it builds and functions as expected.
Push your updates to GitHub. This will trigger an automated build and push of the new Docker image to Docker Hub, tagged with the SHA of the commit containing the changes.
Update the docker-compose.yml file in your branch to reference the new container image, ensuring the tag is correctly updated.
Open a pull request to merge your changes into the main branch.

License

The project is licensed under:

Codebase, Reports and Visualizations: Creative Commons Attribution-NonCommercial-NoDerivatives (CC BY-NC-ND 4.0). See LICENSE.md.

References

Dheeru, D., & Karra Taniskidou, E. (2017). Secondary Mushroom Dataset. UCI Machine Learning Repository. Retrieved from https://archive.ics.uci.edu/dataset/848/secondary+mushroom+dataset

Scikit-learn developers. (n.d.). QuantileTransformer. Scikit-learn. Retrieved November 21, 2024, from https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.QuantileTransformer.html

Hunter, J. D. (2007). Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering, 9(3), 90–95.

McKinney, W. (2010). Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference, 51–56.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., … Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825), 357–362.

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., … van der Walt, S. J. (2020). SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17, 261–272.

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
.github/workflows		.github/workflows
data		data
img		img
report		report
results		results
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
conda-linux-64.lock		conda-linux-64.lock
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mushroom Edibility Classifier

About

Report

Dependencies

Usage

Before you start

To replicate the analysis

To exit and clean up the Environment

Developer notes

Developer dependencies

Adding a new dependency

License

References

About

Releases 4

Packages

Contributors 5

Languages

License

UBC-MDS/mushroom_classifier

Folders and files

Latest commit

History

Repository files navigation

Mushroom Edibility Classifier

About

Report

Dependencies

Usage

Before you start

To replicate the analysis

To exit and clean up the Environment

Developer notes

Developer dependencies

Adding a new dependency

License

References

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 5

Languages

Packages