Skip to content

Commit

Permalink
Merge pull request #41 from UBC-MDS/create-qmd
Browse files Browse the repository at this point in the history
Create literate documents
  • Loading branch information
Abdul-Rahmann authored Dec 8, 2024
2 parents a7674ff + 783b3a0 commit 5420ee6
Show file tree
Hide file tree
Showing 9 changed files with 1,979 additions and 86 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,9 @@ dmypy.json
# Cython debug symbols
cython_debug/

# Ignore the libs directory in wine_quality_files
report/wine_quality_eda_files/libs/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
Expand Down
File renamed without changes.
251 changes: 165 additions & 86 deletions conda-linux-64.lock

Large diffs are not rendered by default.

775 changes: 775 additions & 0 deletions docs/index.html

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,6 @@ dependencies:
- mamba
- pandera=0.21.0
- click==8.1.7
- quarto==1.5.56
- pip:
- deepchecks==0.18.1
21 changes: 21 additions & 0 deletions references.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
@article{cortez2009modeling,
title={Modeling wine preferences by data mining from physicochemical properties},
author={Cortez, Paulo and Cerdeira, Antunes and Almeida, Fernando and Matos, Telmo and Reis, Jos{\'e}},
journal={Decision Support Systems},
volume={47},
number={4},
pages={547--553},
year={2009},
publisher={Elsevier}
}

@article{boulesteix2007partial,
title={Partial least squares: A versatile tool for the analysis of high-dimensional genomic data},
author={Boulesteix, Anne-Laure and Strimmer, Korbinian},
journal={Briefings in Bioinformatics},
volume={8},
number={1},
pages={32--44},
year={2007},
publisher={Oxford University Press}
}
775 changes: 775 additions & 0 deletions report/wine_quality_eda.html

Large diffs are not rendered by default.

Binary file added report/wine_quality_eda.pdf
Binary file not shown.
239 changes: 239 additions & 0 deletions report/wine_quality_eda.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,239 @@
---
title: "Wine Quality Prediction"
author:
- name: Chukwunonso Ebele-Muolokwu
- name: Ci Xu
- name: Samuel Adetsi
- name: Shashank Hosahalli Shivamurthy
format:
html:
toc: true
toc-depth: 2
number-sections: true
pdf:
toc: true
toc-depth: 2
number-sections: true
output-dir: "../"
bibliography: ../references.bib
crossref:
fig-title: Figure
tbl-title: Table
execute:
echo: false
warning: false
message: false
jupyter: python3
---

![Such an adorable couple](../img/wine2.jpg){#fig-wine width=100%}

# Summary

This project aims to analyze patterns in wine data through exploratory data analysis (EDA) and develop predictive models to classify wines or predict their quality. The analysis includes uncovering relationships between key features and their influence on wine quality, visualizing distributions and correlations, and identifying significant predictors. Predictive models such as logistic regression and random forests are developed and optimized using cross-validation and hyperparameter tuning.

By leveraging machine learning techniques, we evaluated model performance with metrics like accuracy and F1-score, providing actionable insights for enhancing wine quality. The results offer a data-driven approach to understanding wine characteristics and their impact on quality, benefiting decision-making in winemaking and marketing.

# Introduction

## Background Information

The quality of wine plays a crucial role in the wine industry, as it directly affects consumer satisfaction, pricing, and demand. Traditionally, wine quality is determined through sensory analysis by trained experts, who evaluate factors such as taste, aroma, and texture. However, these evaluations are inherently subjective, costly, and time-consuming.

With advancements in data analysis and machine learning, it is now possible to model and predict wine quality using objective, measurable features. These features include chemical and physical attributes such as acidity, sugar levels, alcohol content, and more, which directly influence the sensory properties of wine.

## Research Question

The primary question we sought to answer in this project is: "Can the quality of wine be effectively predicted based on its measurable physicochemical properties? Additionally, which features are most influential in determining wine quality?"

This project aimed to explore whether measurable data about wine's chemical and physical properties could provide a reliable means of assessing its quality. By identifying the most important predictors of wine quality, we can gain insights into the production processes that have the greatest impact on consumer satisfaction.

## Methodology Overview

We utilized the Wine Quality Dataset from the UCI Machine Learning Repository, which contains information about Portuguese "Vinho Verde" wine. [@cortez2009modeling]

Our analysis involved:
- Data cleaning and preprocessing
- Exploratory data analysis
- Classification modeling using Decision Tree
- Hyperparameter tuning
- Model evaluation and feature importance analysis

# Data Preparation and Exploration

```{python}
import pandas as pd
import numpy as np
import altair as alt
import janitor
from ucimlrepo import fetch_ucirepo
import pandera as pa
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import os
alt.data_transformers.enable("vegafusion")
# Load and prepare data
wine_df = pd.read_csv('../data/raw/wine_quality_combined.csv')
wine_df = wine_df.clean_names().drop_duplicates()
```

## Dataset Characteristics

```{python}
total_samples = len(wine_df)
features = wine_df.columns.drop('quality').tolist()
unique_quality_levels = sorted(wine_df['quality'].unique())
```

Our dataset contains `{python} total_samples` wine samples where

- 4,898 observations are of white wines
- 1,599 observations are of red wines
- `{python} len(features)` numerical input features representing physicochemical attributes

![Distribution of all the features](../data/img/eda.png){#fig-distributions width=100%}

```{python}
# Visualization of feature distributions
columns = wine_df.columns.to_list()
dist_chart = alt.Chart(wine_df).mark_bar().encode(
x=alt.X(alt.repeat('repeat'), bin=alt.Bin(maxbins=40)),
y=alt.Y('count()')
).repeat(
repeat=columns,
columns=3
).properties(
title="Feature Distributions"
)
```

@fig-distributions shows the distribution of various features in our dataset.

# Model Development

```{python}
# Train-test split
train_df, test_df = train_test_split(wine_df, test_size=0.2, random_state=123)
X_train = train_df.drop(columns='quality')
y_train = train_df['quality']
X_test = test_df.drop(columns='quality')
y_test = test_df['quality']
# Hyperparameter tuning
param_grid = {
'max_depth': [3, 5, 10, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 5],
'max_features': [None, 'sqrt', 'log2']
}
tree_model = DecisionTreeClassifier(random_state=16)
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(
estimator=tree_model,
param_grid=param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
best_tree_model = grid_search.best_estimator_
# Predictions and evaluation
y_test_pred = best_tree_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
```

To develop the decision tree classifier, we initialized a base model using `DecisionTreeClassifier` with a fixed random seed (`random_state=16`) to ensure reproducibility. A hyperparameter tuning process was conducted using `GridSearchCV` to identify the optimal configuration. The grid search evaluated various combinations of hyperparameters, including `max_depth`, `max_features`, `min_samples_leaf`, and `min_samples_split`, over a 5-fold cross-validation.

The best-performing hyperparameters identified were:
- `max_depth`: `{python} grid_search.best_params_['max_depth']`
- `max_features`: `{python} grid_search.best_params_['max_features']`
- `min_samples_leaf`: `{python} grid_search.best_params_['min_samples_leaf']`
- `min_samples_split`: `{python} grid_search.best_params_['min_samples_split']`

The model was optimized using the accuracy metric (`scoring='accuracy'`) and leveraged parallel processing for efficiency (`n_jobs=-1`).


## Model Performance

The Decision Tree model [@boulesteix2007partial] achieved a test accuracy of `{python} f"{test_accuracy:.2%}"`
.

```{python}
# Classification report
report = classification_report(y_test, y_test_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose().round(2)
```

| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| 3 | 0.00 | 0.00 | 0.00 | 4 |
| 4 | 0.29 | 0.24 | 0.26 | 51 |
| 5 | 0.63 | 0.66 | 0.65 | 413 |
| 6 | 0.66 | 0.65 | 0.65 | 567 |
| 7 | 0.61 | 0.58 | 0.60 | 228 |
| 8 | 0.36 | 0.43 | 0.40 | 37 |
| 9 | 0.00 | 0.00 | 0.00 | 0 |
| **Accuracy** | | | 0.62 | 1300 |
| **Macro Avg** | 0.37 | 0.37 | 0.36 | 1300 |
| **Weighted Avg** | 0.62 | 0.62 | 0.62 | 1300 |

Table: Classification report. {#tbl-classification}

![Confusion Matrix](../data/img/confusion.png){#fig-classification}

@fig-classification provides the confusion matrix of the model.

The classification report in Table @tbl-classification provides a summary of the model's performance across different classes. Notably, the highest F1-scores are observed for classes 5 and 6, indicating the model performs well in these categories. However, performance is poor for class 3 and class 9, with precision, recall, and F1-scores all registering as zero. This suggests potential issues with class imbalance or inadequate representation in the dataset. Overall accuracy is 62%, with weighted averages for precision, recall, and F1-score also at 62%.

## Feature Importance

```{python}
feature_importances = pd.DataFrame({
'Feature': X_train.columns,
'Importance': best_tree_model.feature_importances_
}).sort_values(by='Importance', ascending=False)
importance_chart = alt.Chart(feature_importances).mark_bar().encode(
x=alt.X('Importance:Q', title='Importance'),
y=alt.Y('Feature:N', sort='-x', title='Feature'),
tooltip=['Feature', 'Importance']
).properties(
title='Feature Importance'
)
```

![The most important features](../data/img/features.png){#fig-feature-importance}

The feature importance plot highlights the relative significance of each feature in the model. The most influential feature is `alcohol`, followed by `volatile_acidity` and `sulphates`. These features contribute significantly to the predictive performance of the model, while other features like `fixed_acidity` and `pH` have minimal impact. This information can be used to focus on the most important variables for further analysis or model refinement.


# Discussion

## Key Findings

Our analysis revealed that:
- Top predictive features include `density`, `volatile` `acidity`, and `alcohol` content as seen in @fig-feature-importance.
- The model achieved moderate predictive performance
- Physicochemical properties provide insights into wine quality

## Limitations and Future Work

Future research could explore:

1. Ensemble methods for improved accuracy
2. Incorporating sensory attributes
3. Investigating additional domain-specific features

# Conclusion

This project demonstrates the potential of machine learning in understanding wine quality through objective, data-driven analysis. While our model provides valuable insights, there remains significant opportunity for refinement and deeper exploration.

# References

0 comments on commit 5420ee6

Please sign in to comment.