In this project, a Support Vector Classifier was built and tuned to identify mushrooms edibility. A mushroom is classified as edible or poisonous with given color, habitat, class, and others. The final classifier performed quite well on unseen test data, with a final overall accuracy of 0.99 and $F_{\beta}$ score with $\beta = 2$ of 0.99. Furthermore, we use confusion matrix to show the accuracy of classification poisonous or edible mushroom. The model makes 12174 correct predictions out of 12214 test observations. 17 mistakes were predicting a poisonous mushroom as edible (false negative), while 23 mistakes were predicting a edible mushroom as poisonous (false positive). The model’s performance shows promise for implementation, prioritizing safety by minimizing false negatives that could result in consuming poisonous mushrooms. While false positives may lead to unnecessarily discarding safe mushrooms, they pose no safety risk. Further development is needed to make this model useful. Research should focus on improving performance and analyzing cases of incorrect predictions.
Mushrooms are the most common food which is rich in vitamins and minerals. However, not all mushrooms can be consumed directly, most of them are poisonous and identifying edible or poisonous mushroom through the naked eye is quite difficult. Our aim is to using machine learning to identify mushrooms edibility. In this project, three methods are used to detect the edibility of mushrooms: Support Vector Classifier (SVC), K-Nearest Neighbors (KNN), and Logistic Regression.
The dataset used in this project is the Secondary Mushroom Dataset created by Wagner, D., Heider, D., & Hattab, G. from UCI Machine Learning Repository. This dataset contains 61069 hypothetical mushrooms with caps based on 173 species (353 mushrooms per species). Each mushroom is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended (the latter class was combined with the poisonous class).
The mushroom dataset is balanced with 56% of poisonous mushroom and 44% of edible mushroom. All variables were standardized and variables with more than 15% missing values are dropped, because imputing a variable that has a significant proportion of missing data might introduce too much noise or bias, making it unreliable. Data was splitted with 80% being partitioned into the training set and 20% being partitioned into the test set. Three classification models including Support Vector Classifier (SVC), K-Nearest Neighbors (KNN), and Logistic Regression are used to predict whether a mushroom is edible or poisonous. The fine tuned Support Vector Classifier has the best overall performance. The hyperparameter was chosen using 5-fold cross validation with $F_{\beta}$ score as the classification metric. $\beta$ was chosen to be set to 2 for the $F_{\beta}$ score to increase the weight on recall during fitting because predicting a mushroom to be edible when it is in fact poisonous could have severe health consequences. Therefore the goal is to prioritize the minimization of false negatives. The Python programming language (Van Rossum and Drake 2009) and the following Python packages were used to perform the analysis: Matplotlib (Hunter, 2007), Pandas (McKinney, 2010), Scikit-learn (Pedregosa et al., 2011), NumPy (Harris et al., 2020), SciPy (Virtanen et al., 2020), UCIMLRepo.
The EDA shows that all numeric columns in the mushroom dataset are nearly normal with some skewness. A robust preprocessing scheme QuantileTransformer is used because it can transform skewed data or heavy-tailed distributions into a more Gaussian-like shape and reduce the impact of outliers.
+OneHotEncoder is applied for categorical features in the mushroom dataset, because each feature does not contains much categories and they are not ordered. It is critical to keep all important information in the features. Since ring type feature has many missing values, it was filled in with a "Missing" class. Treating missing values as a distinct category provides a way to model the absence of data directly. This can be valuable because missingness itself might carry information.
# fetch dataset as pandas DataFrames
+secondary_mushroom=fetch_ucirepo(id=848)
+X=secondary_mushroom.data.features
+y=secondary_mushroom.data.targets
+
+
+
+
+
+
+
+
+
+
+
+
+
Before splitting the data into test and training sets, we want to check for missing values in each column to determine whether they can be used in our model.
+
+
+
+
+
+
+
+
+
In [3]:
+
+
+
# Check the missing values
+missing_values=X.isnull().sum().reset_index()
+missing_values.columns=['Column','Missing Count']
+
+# Highlight values with a gradient
+styled_missing=missing_values.style.format(
+ precision=0
+).background_gradient(
+ subset=['Missing Count'],
+ cmap='YlOrRd'
+).set_caption("Missing Values by Column")
+
+# Display the styled DataFrame
+display(styled_missing)
+
After examining the data set, we decided to drop columns with a high proportion of missing values (over 15%), which include cap-surface, gill-attachment, gill-spacing, stem-root, stem-surface, veil-type, veil-color, and spore-print-color.
+
+
+
+
+
+
+
+
+
In [5]:
+
+
+
# Split the data test and training set
+
+X_train,X_test,y_train,y_test=train_test_split(
+ X,y,test_size=0.2,random_state=123
+)
+
----------------------------------------
+
+Frequency and Percentage for 'cap-color':
+
+
+
+
+
+
+
+
+
+
+
+
Frequency
+
Percentage
+
+
+
cap-color
+
+
+
+
+
+
+
n
+
19399
+
39.71
+
+
+
y
+
6841
+
14.00
+
+
+
w
+
6149
+
12.59
+
+
+
g
+
3490
+
7.14
+
+
+
e
+
3225
+
6.60
+
+
+
o
+
2943
+
6.02
+
+
+
r
+
1429
+
2.92
+
+
+
p
+
1391
+
2.85
+
+
+
u
+
1370
+
2.80
+
+
+
k
+
999
+
2.04
+
+
+
b
+
969
+
1.98
+
+
+
l
+
650
+
1.33
+
+
+
+
+
+
+
+
+
----------------------------------------
+
+Frequency and Percentage for 'does-bruise-or-bleed':
+
+
+
+
+
+
+
+
+
+
+
+
Frequency
+
Percentage
+
+
+
does-bruise-or-bleed
+
+
+
+
+
+
+
f
+
40424
+
82.74
+
+
+
t
+
8431
+
17.26
+
+
+
+
+
+
+
+
+
----------------------------------------
+
+Frequency and Percentage for 'gill-color':
+
+
+
+
+
+
+
+
+
+
+
+
Frequency
+
Percentage
+
+
+
gill-color
+
+
+
+
+
+
+
w
+
14878
+
30.45
+
+
+
n
+
7716
+
15.79
+
+
+
y
+
7655
+
15.67
+
+
+
p
+
4738
+
9.70
+
+
+
g
+
3310
+
6.78
+
+
+
f
+
2825
+
5.78
+
+
+
o
+
2312
+
4.73
+
+
+
k
+
1909
+
3.91
+
+
+
r
+
1134
+
2.32
+
+
+
e
+
825
+
1.69
+
+
+
u
+
807
+
1.65
+
+
+
b
+
746
+
1.53
+
+
+
+
+
+
+
+
+
----------------------------------------
+
+Frequency and Percentage for 'stem-color':
+
+
+
+
+
+
+
+
+
+
+
+
Frequency
+
Percentage
+
+
+
stem-color
+
+
+
+
+
+
+
w
+
18445
+
37.75
+
+
+
n
+
14410
+
29.50
+
+
+
y
+
6283
+
12.86
+
+
+
g
+
2071
+
4.24
+
+
+
o
+
1762
+
3.61
+
+
+
e
+
1603
+
3.28
+
+
+
u
+
1167
+
2.39
+
+
+
f
+
849
+
1.74
+
+
+
p
+
837
+
1.71
+
+
+
k
+
676
+
1.38
+
+
+
r
+
430
+
0.88
+
+
+
l
+
182
+
0.37
+
+
+
b
+
140
+
0.29
+
+
+
+
+
+
+
+
+
----------------------------------------
+
+Frequency and Percentage for 'has-ring':
+
+
+
+
+
Out[8]:
+
+
+
+
+
+
+
Frequency
+
Percentage
+
+
+
has-ring
+
+
+
+
+
+
+
f
+
36564
+
74.84
+
+
+
t
+
12291
+
25.16
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
The initial X_train assessment has demonstrated no missing values within remaining features except for the ring-type (1,998). However, the proportion of missing values in this feature is reasonable, and simply dropping this column could result in loss of potentially valuable information, introduction of biases etc., which might reduce the overall accuracy of the classifier. Therefore, we decided to retain this column and perform imputation on ring-type in the data preprocessing phase.
To understand the numeric features in the data set, we plotted histograms for each numeric column in X_train, which helps identify the distribution patterns as well as detecting any skewness or outliers. The numeric columns being plotted are cap-diameter, stem-height, and stem-width.
----------------------------------------
+
+Frequency and Percentage for 'ring-type':
+
+
+
+
+
+
+
+
+
+
+
+
Frequency
+
Percentage
+
+
+
ring-type
+
+
+
+
+
+
+
f
+
38562
+
82.30
+
+
+
e
+
1968
+
4.20
+
+
+
z
+
1735
+
3.70
+
+
+
r
+
1145
+
2.44
+
+
+
l
+
1127
+
2.41
+
+
+
p
+
1031
+
2.20
+
+
+
g
+
1003
+
2.14
+
+
+
m
+
286
+
0.61
+
+
+
+
+
+
+
+
+
----------------------------------------
+
+Frequency and Percentage for 'habitat':
+
+
+
+
+
+
+
+
+
+
+
+
Frequency
+
Percentage
+
+
+
habitat
+
+
+
+
+
+
+
d
+
35401
+
72.46
+
+
+
g
+
6327
+
12.95
+
+
+
l
+
2508
+
5.13
+
+
+
m
+
2346
+
4.80
+
+
+
h
+
1611
+
3.30
+
+
+
w
+
290
+
0.59
+
+
+
p
+
279
+
0.57
+
+
+
u
+
93
+
0.19
+
+
+
+
+
+
+
+
+
----------------------------------------
+
+Frequency and Percentage for 'season':
+
+
+
+
+
+
+
+
+
+
+
+
Frequency
+
Percentage
+
+
+
season
+
+
+
+
+
+
+
a
+
24116
+
49.36
+
+
+
u
+
18322
+
37.50
+
+
+
w
+
4255
+
8.71
+
+
+
s
+
2162
+
4.43
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Based on the histograms, here are our findings for each feature being plotted.
+
+
cap-diameter: The distribution is highly skewed to the right, with most values concentrated between 0 and 10 cm. There are also some outliers sitting at around 40 to 60 cm.
+
+
stem-height: Slightly right-skewed distribution. The majority of mushrooms have stem heights between 4 and 10 cm, with few having stem heights over 20 cm.
+
+
stem-width: Another heavily right-skewed distribution, with the majority of mushrooms having stem width below 20 cm, and a some rare cases exceeding 50 cm.
+
+
+
The skewness observed across the 3 numeric features will be addressed in the preprocessing phase with QuantileTransformer from sklearn.preprocessing which maps data to a normal distribution while retaining the relative rank of values, making them more suitable for models sensitive to feature distributions, such as SVC and LogisticRegression.
To understand the categorical features in the data set, we analyzed their frequency and percentage distributions, providing insights into the variability and class imbalance that might occur for each feature.
+
+
+
+
+
+
+
+
+
In [8]:
+
+
+
categorical_columns=X_train.select_dtypes(include='object')# Select only categorical columns
+
+# Calculate frequency and percentage for each categorical features
+forcolumnincategorical_columns.columns:
+ print(f"Frequency and Percentage for '{column}':")
+
+ # Frequency
+ frequency=X_train[column].value_counts()
+ # Percentage
+ percentage=round(X_train[column].value_counts(normalize=True)*100,2)
+
+ # Combine into one DataFrame
+ freq_percent_df=pd.DataFrame({
+ "Frequency":frequency,
+ "Percentage":percentage
+ })
+
+ # Highlight values with a gradient
+ styled_df=freq_percent_df.style.format(
+ precision=2
+ ).background_gradient(
+ subset=['Percentage'],
+ cmap='YlOrRd'
+ )
+
+ # Display the styled DataFrame
+ display(styled_df)
+ print("-"*40,'\n')
+
+
+
+
+
+
+
+
+
+
+
+
+
----------------------------------------
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Based on the Frequency and Percentage distributions, here are our findings:
+
+
cap-shape: The most common cap shape is x (convex), comprising 43.97% of the data. Other shapes like f (flat) and s (sunken) are also prevalent, while c (conical) is the least common with 2.95% appearance.
+
+
cap-color: The most frequently appeared color is n (brown), with 39.71% of the data. Other colors like y (yellow), w (white), and g (gray) are also well-represented, while rare colors like b (buff) and l (blue) appear in less than 2% of the data.
+
+
does-bruise-or-bleed: The majority of the mushrooms are f (do not bruise or bleed), while their counterpart make up 17.26% of the data.
+
+
gill-color: The most common gill color is w (white), with 30.45% of the data. Other colors such as n (brown) and y (yellow) are also frequent, while rare gill colors like e (red), b (buff) and u (purple) appear in less than 2% of the data.
+
+
stem-color: w (white) and n (brown) are the dominating stem colors, accounting for 37.75% and 29.5% of the data, respectively. Other colors like r (green), l (blue) and b (buff) are less frequent, appearing in less than 1% of the observations.
+
+
has-ring: Most mushrooms are f (do not have a ring), with 74.84% observations. The remaining 25.16% mushrooms are t (have a ring).
+
+
ring-type: f (none) is the most common ring type, accounting for 82.3% of the data. Other types like e (evanescent) and z (zone) are less frequent, while rare types like m (movable) occur in less than 1% of the data.
+
+
habitat: The predominant habitat is d (woods), with 72.46% appearance. Other habitats such as g (grasses) and l (leaves) are less common, while w (waste), p (paths), and u (urban) only make up less than 1% of the data individually.
+
+
season: Most mushrooms grow in a (autumn), comprising 49.36% of the data, followed by u (summer) at 37.5%. The other two seasons w (winter) and s (spring) are less frequent.
+
+
+
Categorical features will be encoded into binary format in the following preprocessing phase with OneHotEncoder. Since we are dealing with a mix of binary and non-binary categorical features, for features like does-bruise-or-bleed and has-ring that have two unique values, they will be handled with drop='if_binary' argument to reduce redundancy while still capturing the information.
The target variable class represents whether a mushroom is p (poisonous) or e (edible). Understanding the distribution of the target helps assessing class balance, which might have impact on models' performance.
+
+
+
+
+
+
+
+
+
In [9]:
+
+
+
# Frequency
+frequency=y_train.value_counts()
+# Percentage
+percentage=round(y_train.value_counts(normalize=True)*100,2)
+
+# Combine into one DataFrame
+freq_percent_df=pd.DataFrame({
+ "Frequency":frequency,
+ "Percentage":percentage
+})
+freq_percent_df
+
+
+
+
+
+
+
+
+
+
+
Out[9]:
+
+
+
+
+
+
+
+
Frequency
+
Percentage
+
+
+
class
+
+
+
+
+
+
+
p
+
27143
+
55.56
+
+
+
e
+
21712
+
44.44
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Based on the Frequency and Percentage distribution, here are our findings:
+
+
p (Poisonous): There are 27,143 instances of poisonous mushrooms, accounting for 55.56% of the data.
+
+
e (Edible): There are 21,712 instances of edible mushrooms, constituting 44.44% of the data.
+
+
+
Using $F_{\beta}$, precision, recall, or confusion matrix to evaluate the model's performance is advisable in the following procedure.
Three classification models including Support Vector Classifier (SVC), K-Nearest Neighbors (KNN), and Logistic Regression are used to predict whether a mushroom is edible or poisonous. Predicting a mushroom to be edible when it is in fact poisonous could have severe health consequences. Therefore the best model should prioritize the minimization of this error. To do this, we can evaluate models on an $F_{\beta}$ score with $\beta = 2$.
+
+
+
+
+
+
+
+
+
In [10]:
+
+
+
# loading in some models
+fromsklearn.neighborsimportKNeighborsClassifier
+fromsklearn.svmimportSVC
+fromsklearn.linear_modelimportLogisticRegression
+
+
+
+
+
+
+
+
+
+
+
In [11]:
+
+
+
# importing required preprocessors, pipelines, etc.
+fromsklearn.imputeimportSimpleImputer
+fromsklearn.preprocessingimportQuantileTransformer,OneHotEncoder
+fromsklearn.composeimportmake_column_transformer
+fromsklearn.pipelineimportmake_pipeline
+
+# converting targets to Series objects to avoid warnings
+y_train=y_train.squeeze()
+y_test=y_test.squeeze()
+
+# random state for reproducability
+SEED=123
+
+# feature sets for each transformation
+numeric_cols=['cap-diameter','stem-height','stem-width']
+categorical_cols=['does-bruise-or-bleed','has-ring','cap-shape','cap-color','gill-color','stem-color','habitat','season']
+impute_cols=['ring-type']
+
+# creating transformers
+numeric_transformer=QuantileTransformer(output_distribution='normal',random_state=SEED)
+categorical_transformer=OneHotEncoder(drop='if_binary',handle_unknown='ignore',sparse_output=False)
+impute_transformer=make_pipeline(
+ SimpleImputer(strategy='constant',fill_value='missing'),
+ categorical_transformer
+)
+
+# final preprocessor
+preprocessor=make_column_transformer(
+ (numeric_transformer,numeric_cols),
+ (impute_transformer,impute_cols),
+ (categorical_transformer,categorical_cols)
+)
+preprocessor
+
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# compilng hyperparameters and scores of best models into one dataframe
+cols=['params','mean_fit_time','mean_test_accuracy','std_test_accuracy','mean_test_f2_score','std_test_f2_score']
+final_results=pd.concat(
+ [pd.DataFrame(result.cv_results_).query('rank_test_f2_score == 1')[cols]for_,resultincv_results.items()]
+)
+final_results.index=['Logisic Regression','KNN','SVC']
+final_results
+
+
+
+
+
+
+
+
+
+
+
Out[19]:
+
+
+
+
+
+
+
+
params
+
mean_fit_time
+
mean_test_accuracy
+
std_test_accuracy
+
mean_test_f2_score
+
std_test_f2_score
+
+
+
+
+
Logisic Regression
+
{'logisticregression__C': 0.05784745785308777}
+
0.353165
+
0.747313
+
0.002863
+
0.780611
+
0.003412
+
+
+
KNN
+
{'svc__C': 20.74024196289186, 'svc__gamma': 0....
+
41.681297
+
0.996582
+
0.000226
+
0.997112
+
0.000218
+
+
+
SVC
+
{'kneighborsclassifier__n_neighbors': 327}
+
0.128557
+
0.932310
+
0.001731
+
0.937919
+
0.001838
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
After tuning the hyperparameter, the Logistic Regression model has the mean accuracy of 0.75 and mean $F_{\beta}$ score of 0.78 on the validation set. The KNN model has the mean accuracy of 0.74 and mean $F_{\beta}$ score of 0.75. The SVC outperforms both Logistic Regression and KNN significantly in both accuracy of 0.99 and $F_{\beta}$ score of 0.99. Thus, SVC is the ideal choice to identify edible or poisonous mushroom (recall is the highest priority).
+
+
+
+
+
+
+
+
+
In [20]:
+
+
+
best_model=cv_results['svc'].best_estimator_
+best_model.fit(X_train,y_train)
+
+# confusion matrix of test results
+ConfusionMatrixDisplay.from_estimator(
+ best_model,
+ X_train,
+ y_train
+)
+
+
+
+
+
+
+
+
+
+
+
Out[20]:
+
+
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x30728d220>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
In [21]:
+
+
+
# Finally, report the test score and confusion matrix
+y_test_predict=best_model.predict(X_test)
+
+test_f2_score=fbeta_score(y_test,y_test_predict,beta=2,pos_label='p')
+test_accuracy=accuracy_score(y_test,y_test_predict)
+print(f'Test F2-Score: {test_f2_score}\nTest Accuracy: {test_accuracy}')
+
+
+
+
+
+
+
+
+
+
+
+
+
Test F2-Score: 0.9973021849337405
+Test Accuracy: 0.9967250695922711
+
+
+
+
+
+
+
+
+
+
+
In [22]:
+
+
+
# Plotting confusion matrix for test set
+ConfusionMatrixDisplay.from_predictions(
+ y_test,
+ y_test_predict
+)
+
+
+
+
+
+
+
+
+
+
+
Out[22]:
+
+
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x309e63dd0>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
The prediction model performed quite well on test data, with a final overall accuracy of 0.99 and $F_{\beta}$ score of 0.99. The model only makes 40 mistakes out of 12214 test samples. 17 mistakes were predicting a poisonous mushroom as edible (false negative), while 23 mistakes were predicting a edible mushroom as poisonous (false positive). The model’s performance is promising for implementation, as false negatives represent potential safety risks and these errors could lead to consuming poisonous mushrooms, it is minimized to protect users. On the other hand, false positives are less harmful, they may lead to discarding safe mushrooms unnecessarily but do not endanger safety.
+
+
+
+
+
+
+
+
+
+
+
While the overall performance of the SVC model are impressive, efforts could focus on further reducing false negatives to enhance the safety of predictions. It might be important to take a closer look at the 40 misclassified observations to identify specific features contributing to these misclassifications. Implementing feature engineering on those features such as encoding rare categories differently can enhance the model’s power and reduce the misclassification cases. Additionally, trying other classifiers like Decision Tree and Random Forest which are less sensitive to scaling or irrelevant features might improve the prediction.
Hunter, J. D. (2007). Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering, 9(3), 90–95.
+
McKinney, W. (2010). Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference, 51–56.
+
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
+
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., … Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825), 357–362.
+
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., … van der Walt, S. J. (2020). SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17, 261–272.
+
+
+
+
+
+
+
diff --git a/notebooks/.ipynb_checkpoints/Load_Data_and_EDA-checkpoint.ipynb b/notebooks/.ipynb_checkpoints/Load_Data_and_EDA-checkpoint.ipynb
new file mode 100644
index 0000000..c1c34a3
--- /dev/null
+++ b/notebooks/.ipynb_checkpoints/Load_Data_and_EDA-checkpoint.ipynb
@@ -0,0 +1,3289 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "1b40693f-6d55-4951-96a5-48a10ccb6773",
+ "metadata": {},
+ "source": [
+ "# Mushroom Edibility Classification Using Feature-Based Machine Learning Approach"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "18590e2e-138a-4ed1-821b-8a1850fdce9b",
+ "metadata": {},
+ "source": [
+ "by Benjamin Frizzell, Hankun Xiao, Essie Zhang, Mason Zhang 2024/11/23"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "81a65442-e81c-4e9d-9755-885bb2aebac9",
+ "metadata": {},
+ "source": [
+ "#### Import Library"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "0d2500b5-022e-4ff0-818c-ad1013efb69d",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "from ucimlrepo import fetch_ucirepo \n",
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "import pandera as pa\n",
+ "from pandera import Check\n",
+ "from deepchecks import Dataset\n",
+ "import json\n",
+ "import logging\n",
+ "import matplotlib.pyplot as plt\n",
+ "from sklearn.model_selection import train_test_split"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f49175e2-5eab-4b03-816c-f20995c50c96",
+ "metadata": {},
+ "source": [
+ "## Summary"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a1f39a05-24c5-4a5e-a34c-830e8efeee78",
+ "metadata": {},
+ "source": [
+ "In this project, a Support Vector Classifier was built and tuned to identify mushrooms edibility. A mushroom is classified as edible or poisonous with given color, habitat, class, and others. The final classifier performed quite well on unseen test data, with a final overall accuracy of 0.99 and $F_{\\beta}$ score with $\\beta = 2$ of 0.99. Furthermore, we use confusion matrix to show the accuracy of classification poisonous or edible mushroom. The model makes 12174 correct predictions out of 12214 test observations. 17 mistakes were predicting a poisonous mushroom as edible (false negative), while 23 mistakes were predicting a edible mushroom as poisonous (false positive). The model’s performance shows promise for implementation, prioritizing safety by minimizing false negatives that could result in consuming poisonous mushrooms. While false positives may lead to unnecessarily discarding safe mushrooms, they pose no safety risk. Further development is needed to make this model useful. Research should focus on improving performance and analyzing cases of incorrect predictions."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5626172b-fdbc-486d-ba3d-550432375290",
+ "metadata": {},
+ "source": [
+ "## Introduction"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d6ca1f13-3cd2-4bc6-bfae-4d0d6440e1b7",
+ "metadata": {},
+ "source": [
+ "Mushrooms are the most common food which is rich in vitamins and minerals. However, not all mushrooms can be consumed directly, most of them are poisonous and identifying edible or poisonous mushroom through the naked eye is quite difficult. Our aim is to using machine learning to identify mushrooms edibility. In this project, three methods are used to detect the edibility of mushrooms: Support Vector Classifier (SVC), K-Nearest Neighbors (KNN), and Logistic Regression. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "921597ef-c12e-4c4c-b8bf-b1eb20e90814",
+ "metadata": {},
+ "source": [
+ "## Methods"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a0920cdf-10c1-4151-bbf3-a689486257dd",
+ "metadata": {},
+ "source": [
+ "### Data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "549db552-150b-4744-ba90-497604b5b601",
+ "metadata": {},
+ "source": [
+ "The dataset used in this project is the Secondary Mushroom Dataset created by Wagner, D., Heider, D., & Hattab, G. from UCI Machine Learning Repository. This dataset contains 61069 hypothetical mushrooms with caps based on 173 species (353 mushrooms per species). Each mushroom is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended (the latter class was combined with the poisonous class)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "809c0908-7030-437e-bb4c-56bdf0066119",
+ "metadata": {},
+ "source": [
+ "### Analysis"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "614cbeca-8401-49b3-8eed-cce9dc41d292",
+ "metadata": {},
+ "source": [
+ "The mushroom dataset is balanced with 56% of poisonous mushroom and 44% of edible mushroom. All variables were standardized and variables with more than 15% missing values are dropped, because imputing a variable that has a significant proportion of missing data might introduce too much noise or bias, making it unreliable. Data was splitted with 80% being partitioned into the training set and 20% being partitioned into the test set. Three classification models including Support Vector Classifier (SVC), K-Nearest Neighbors (KNN), and Logistic Regression are used to predict whether a mushroom is edible or poisonous. The fine tuned Support Vector Classifier has the best overall performance. The hyperparameter was chosen using 5-fold cross validation with $F_{\\beta}$ score as the classification metric. $\\beta$ was chosen to be set to 2 for the $F_{\\beta}$ score to increase the weight on recall during fitting because predicting a mushroom to be edible when it is in fact poisonous could have severe health consequences. Therefore the goal is to prioritize the minimization of false negatives. The Python programming language (Van Rossum and Drake 2009) and the following Python packages were used to perform the analysis: Matplotlib (Hunter, 2007), Pandas (McKinney, 2010), Scikit-learn (Pedregosa et al., 2011), NumPy (Harris et al., 2020), SciPy (Virtanen et al., 2020), UCIMLRepo."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "58647071-18ff-44cb-9f2d-fd243555cff0",
+ "metadata": {},
+ "source": [
+ "## Results & Discussion"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f94469b3-143e-4a67-8c22-5bfa73baeccc",
+ "metadata": {},
+ "source": [
+ "The EDA shows that all numeric columns in the mushroom dataset are nearly normal with some skewness. A robust preprocessing scheme `QuantileTransformer` is used because it can transform skewed data or heavy-tailed distributions into a more Gaussian-like shape and reduce the impact of outliers.\n",
+ "`OneHotEncoder` is applied for categorical features in the mushroom dataset, because each feature does not contains much categories and they are not ordered. It is critical to keep all important information in the features. Since ring type feature has many missing values, it was filled in with a \"Missing\" class. Treating missing values as a distinct category provides a way to model the absence of data directly. This can be valuable because missingness itself might carry information."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7558f2ed-854e-492b-8a71-7e37cdecf1f3",
+ "metadata": {},
+ "source": [
+ "#### Load Data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "20a3d74b-174b-420e-9745-6d68f9d7da5f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# fetch dataset as pandas DataFrames\n",
+ "secondary_mushroom = fetch_ucirepo(id=848) \n",
+ "X = secondary_mushroom.data.features \n",
+ "y = secondary_mushroom.data.targets "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e4fa6da5-7876-43ba-b5ff-e12a66c78c75",
+ "metadata": {},
+ "source": [
+ "##### Before splitting the data into test and training sets, we want to check for missing values in each column to determine whether they can be used in our model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "eeee4dd9-6fa9-47d3-b86f-e128e791a96e",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "
\n"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# Check the missing values\n",
+ "missing_values = X.isnull().sum().reset_index()\n",
+ "missing_values.columns = ['Column', 'Missing Count']\n",
+ "\n",
+ "# Highlight values with a gradient\n",
+ "styled_missing = missing_values.style.format(\n",
+ " precision=0\n",
+ ").background_gradient(\n",
+ " subset=['Missing Count'],\n",
+ " cmap='YlOrRd'\n",
+ ").set_caption(\"Missing Values by Column\")\n",
+ "\n",
+ "# Display the styled DataFrame\n",
+ "display(styled_missing)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "302deb34-f082-41b2-bd10-7b20ba0b3dbd",
+ "metadata": {},
+ "source": [
+ "The initial `X_train` assessment has demonstrated no missing values within remaining features except for the `ring-type` . However, the proportion of missing values in this feature is reasonable, and simply dropping this column could result in loss of potentially valuable information, introduction of biases etc., which might reduce the overall accuracy of the classifier. Therefore, we decided to retain this column and perform imputation on `ring-type` in the data preprocessing phase. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9dd6a42f-b507-4d81-aa68-1274d20872c1",
+ "metadata": {},
+ "source": [
+ "##### Part 2: The distribution of numeric features"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "dc6a5a2b-84f8-402a-ac8c-0bb727ec5c13",
+ "metadata": {},
+ "source": [
+ "To understand the numeric features in the data set, we plotted histograms for each numeric column in `X_train`, which helps identify the distribution patterns as well as detecting any skewness or outliers. The numeric columns being plotted are `cap-diameter`, `stem-height`, and `stem-width`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "c7c4cb2e-0f89-44fd-8faa-11088dd290e2",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "",
+ "text/plain": [
+ "
"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "numeric_columns = X_train.select_dtypes(include='number') # Select only numeric columns\n",
+ "\n",
+ "for column in numeric_columns.columns:\n",
+ " plt.figure(figsize=(5,5))\n",
+ " plt.hist(X_train[column], bins=15, edgecolor='black', alpha=0.7)\n",
+ " plt.title(f'Histogram of {column}')\n",
+ " plt.xlabel(column)\n",
+ " plt.ylabel('Frequency')\n",
+ " plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "717bbe9d-18ee-4463-9a52-9128566851d5",
+ "metadata": {},
+ "source": [
+ "Based on the histograms, here are our findings for each feature being plotted.\n",
+ "\n",
+ "1. `cap-diameter`: The distribution is highly skewed to the right, with most values concentrated between 0 and 10 cm. There are also some outliers sitting at around 40 to 60 cm. \n",
+ "\n",
+ "2. `stem-height`: Slightly right-skewed distribution. The majority of mushrooms have stem heights between 4 and 10 cm, with few having stem heights over 20 cm.\n",
+ "\n",
+ "3. `stem-width`: Another heavily right-skewed distribution, with the majority of mushrooms having stem width below 20 cm, and a some rare cases exceeding 50 cm.\n",
+ "\n",
+ "The skewness observed across the 3 numeric features will be addressed in the preprocessing phase with `QuantileTransformer` from `sklearn.preprocessing` which maps data to a normal distribution while retaining the relative rank of values, making them more suitable for models sensitive to feature distributions, such as `SVC` and `LogisticRegression`. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1b80b242-d2c4-48ba-9124-0f1a75233cf7",
+ "metadata": {},
+ "source": [
+ "##### Part 3: The distribution of categorical features"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fc0e83a8-e064-416d-9043-c9a340d24f18",
+ "metadata": {},
+ "source": [
+ "To understand the categorical features in the data set, we analyzed their frequency and percentage distributions, providing insights into the variability and class imbalance that might occur for each feature. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "id": "c5d9010f-313a-4abf-a50c-e55c3b64b186",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Frequency and Percentage for 'cap-shape':\n"
+ ]
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "
\n"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "---------------------------------------- \n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "categorical_columns = X_train.select_dtypes(include='object') # Select only categorical columns\n",
+ "\n",
+ "# Calculate frequency and percentage for each categorical features\n",
+ "for column in categorical_columns.columns:\n",
+ " print(f\"Frequency and Percentage for '{column}':\")\n",
+ " \n",
+ " # Frequency\n",
+ " frequency = X_train[column].value_counts()\n",
+ " # Percentage\n",
+ " percentage = round(X_train[column].value_counts(normalize=True) * 100, 2)\n",
+ " \n",
+ " # Combine into one DataFrame\n",
+ " freq_percent_df = pd.DataFrame({\n",
+ " \"Frequency\": frequency,\n",
+ " \"Percentage\": percentage\n",
+ " })\n",
+ "\n",
+ " # Highlight values with a gradient\n",
+ " styled_df = freq_percent_df.style.format(\n",
+ " precision=2\n",
+ " ).background_gradient(\n",
+ " subset=['Percentage'],\n",
+ " cmap='YlOrRd'\n",
+ " )\n",
+ "\n",
+ " # Display the styled DataFrame\n",
+ " display(styled_df)\n",
+ " print(\"-\" * 40, '\\n')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "12823bac-ddf4-47d7-bda9-785b424a1837",
+ "metadata": {},
+ "source": [
+ "Based on the Frequency and Percentage distributions, here are our findings:\n",
+ "\n",
+ "1. `cap-shape`: The most common cap shape is `x` (convex), comprising 43.97% of the data. Other shapes like `f` (flat) and `s` (sunken) are also prevalent, while `c` (conical) is the least common with 2.95% appearance.\n",
+ "\n",
+ "2. `cap-color`: The most frequently appeared color is `n` (brown), with 39.71% of the data. Other colors like `y` (yellow), `w` (white), and `g` (gray) are also well-represented, while rare colors like `b` (buff) and `l` (blue) appear in less than 2% of the data.\n",
+ "\n",
+ "3. `does-bruise-or-bleed`: The majority of the mushrooms are `f` (do not bruise or bleed), while their counterpart make up 17.26% of the data.\n",
+ "\n",
+ "4. `gill-color`: The most common gill color is `w` (white), with 30.45% of the data. Other colors such as `n` (brown) and `y` (yellow) are also frequent, while rare gill colors like `e` (red), `b` (buff) and `u` (purple) appear in less than 2% of the data.\n",
+ "\n",
+ "5. `stem-color`: `w` (white) and `n` (brown) are the dominating stem colors, accounting for 37.75% and 29.5% of the data, respectively. Other colors like `r` (green), `l` (blue) and `b` (buff) are less frequent, appearing in less than 1% of the observations.\n",
+ "\n",
+ "6. `has-ring`: Most mushrooms are `f` (do not have a ring), with 74.84% observations. The remaining 25.16% mushrooms are `t` (have a ring).\n",
+ "\n",
+ "7. `ring-type`: `f` (none) is the most common ring type, accounting for 82.3% of the data. Other types like `e` (evanescent) and `z` (zone) are less frequent, while rare types like `m` (movable) occur in less than 1% of the data.\n",
+ "\n",
+ "8. `habitat`: The predominant habitat is `d` (woods), with 72.46% appearance. Other habitats such as `g` (grasses) and `l` (leaves) are less common, while `w` (waste), `p` (paths), and `u` (urban) only make up less than 1% of the data individually.\n",
+ "\n",
+ "9. `season`: Most mushrooms grow in `a` (autumn), comprising 49.36% of the data, followed by `u` (summer) at 37.5%. The other two seasons `w` (winter) and `s` (spring) are less frequent.\n",
+ "\n",
+ "Categorical features will be encoded into binary format in the following preprocessing phase with `OneHotEncoder`. Since we are dealing with a mix of binary and non-binary categorical features, for features like `does-bruise-or-bleed` and `has-ring` that have two unique values, they will be handled with `drop='if_binary'` argument to reduce redundancy while still capturing the information. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a8a13abe-906b-4230-8772-2d799a51a857",
+ "metadata": {},
+ "source": [
+ "##### Part 4: The distribution of the target"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e175de25-1893-40f0-a794-1153470d7230",
+ "metadata": {},
+ "source": [
+ "The target variable `class` represents whether a mushroom is `p` (poisonous) or `e` (edible). Understanding the distribution of the target helps assessing class balance, which might have impact on models' performance."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "9e47fdb0-f94a-4777-a3c1-954e7d62202a",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
Frequency
\n",
+ "
Percentage
\n",
+ "
\n",
+ "
\n",
+ "
target
\n",
+ "
\n",
+ "
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
p
\n",
+ "
26996
\n",
+ "
55.41
\n",
+ "
\n",
+ "
\n",
+ "
e
\n",
+ "
21726
\n",
+ "
44.59
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Frequency Percentage\n",
+ "target \n",
+ "p 26996 55.41\n",
+ "e 21726 44.59"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ " # Frequency\n",
+ "frequency = y_train.value_counts()\n",
+ "# Percentage\n",
+ "percentage = round(y_train.value_counts(normalize=True) * 100, 2)\n",
+ "\n",
+ "# Combine into one DataFrame\n",
+ "freq_percent_df = pd.DataFrame({\n",
+ " \"Frequency\": frequency,\n",
+ " \"Percentage\": percentage\n",
+ "})\n",
+ "freq_percent_df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "10f25100-5adc-46c3-8503-6faebcc29100",
+ "metadata": {},
+ "source": [
+ "Based on the Frequency and Percentage distribution, here are our findings:\n",
+ "\n",
+ "1. `p` (Poisonous): There are 27,143 instances of poisonous mushrooms, accounting for 55.56% of the data.\n",
+ "\n",
+ "2. `e` (Edible): There are 21,712 instances of edible mushrooms, constituting 44.44% of the data.\n",
+ "\n",
+ "Using $F_{\\beta}$, precision, recall, or confusion matrix to evaluate the model's performance is advisable in the following procedure. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e6000a8d-7e7b-4fd1-aa27-b3c8074e6d91",
+ "metadata": {},
+ "source": [
+ "#### Preprocessing and Model Building\n",
+ "\n",
+ "Three classification models including Support Vector Classifier (SVC), K-Nearest Neighbors (KNN), and Logistic Regression are used to predict whether a mushroom is edible or poisonous. Predicting a mushroom to be edible when it is in fact poisonous could have severe health consequences. Therefore the best model should prioritize the minimization of this error. To do this, we can evaluate models on an $F_{\\beta}$ score with $\\beta = 2$."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "id": "2b26607e-2448-452d-8e0b-e25c56ceba44",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# loading in some models\n",
+ "from sklearn.neighbors import KNeighborsClassifier\n",
+ "from sklearn.svm import SVC\n",
+ "from sklearn.linear_model import LogisticRegression"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "32e69dc9-c9c6-4734-8f01-37a6813ef1d8",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.