Wine Quality Prediction

Project Board

Summary

This project explores the relationship between physicochemical properties of wines and their quality ratings, aiming to predict wine quality and identify key factors influencing it using machine learning models such as Decision Trees. Through exploratory data analysis (EDA), we examine patterns, distributions, and correlations, addressing challenges such as class imbalances in wine quality ratings. The Decision Tree model is evaluated using metrics like accuracy, precision, recall, and feature importance to uncover significant predictors, such as density, alcohol, and volatile_acidity. The primary goal is to build an interpretable machine learning pipeline that provides actionable insights for winemakers to optimize production processes and for consumers to make informed choices. Additionally, the project sets the foundation for future work, including incorporating sensory attributes, addressing dataset imbalances, and leveraging more advanced ensemble methods for better predictions.

Contributors:

Chukwunonso Ebele-Muolokwu
Samuel Adetsi
Shashank Hosahalli Shivamurthy
Ci Xu

Reproducible Computational Environment

This project ensures a reproducible computational environment using Conda. Follow the steps below to set up the environment for this project.

Prerequisites

Install Miniconda or Anaconda.
Clone this repository:

git clone https://github.com/UBC-MDS/522-wine-quality-32.git
cd 522-wine-quality-32

Setting Up the Environment

Option 1: Using `environment.yaml`

This is the recommended method to set up the environment.

Create the Conda environment:

conda env create -f environment.yml

Activate the environment:

conda activate 522_milestone_env

Verify the environment setup:

python -c "import pandas as pd; print('Environment set up successfully!')"

Option 2: Using Platform-Specific Lock Files

If you want to ensure reproducibility across different operating systems, use platform-specific lock files.

Install conda-lock:

pip install conda-lock

Create the environment using the lock file for your platform:
- For Linux/macOS/Windows:

conda-lock install --name 522_milestone_env conda-lock.yml

Activate the environment:

conda activate 522_milestone_env

Option 3: Using Docker Container

Running the analysis

Navigate to the root of this project on your computer using the command line and enter the following command:

docker compose up

In the terminal, look for a URL that starts with http://127.0.0.1:8888/lab?token= (for an example, see the highlighted text in the terminal below). Copy and paste that URL into your browser.

To run the analysis, open analysis.ipynb in Jupyter Lab you just launched and under the "Kernel" menu click "Restart Kernel and Run All Cells...".

Pipeline Steps

Each pipeline step is defined in the Makefile. Below are the individual targets and how to use them:

1. Download Dataset

Download the raw wine quality dataset:

make data

Output: data/raw/wine_data.csv

2. Process and Validate Data

Process the raw data and generate the processed training and testing datasets, along with a validation report:

make process

Inputs: data/raw/wine_data.csv
Outputs:
- data/processed/wine_train.csv
- data/processed/wine_test.csv
- report/validation_report.html

3. Train the Model

Train a Decision Tree model on the processed data:

make train

Inputs:
- data/processed/wine_train.csv
- data/processed/wine_test.csv
- Output: data/model/wine_model.pkl

4. Generate Plots

Create visualizations for feature importance and wine quality distribution:

make plot

Inputs:
- data/model/wine_model.pkl
- data/processed/wine_train.csv
- data/processed/wine_test.csv
- Outputs:
  - data/img/feature_importance.png
  - data/img/quality_distribution.png

5. Generate the Final Report

Render the analysis report using Quarto:

make report

Inputs:
- data/img/feature_importance.png
- data/img/quality_distribution.png
- report/wine_quality_eda.qmd
- Output: report/wine_quality_eda.html

6. Run the Entire Pipeline

Run all steps in the pipeline:

make all

This command ensures that all intermediate files are created and up to date.

7. Clean Up Generated Files

Remove all generated files to reset the pipeline:

make clean

8. Retrain and Regenerate Everything

Clean the pipeline and rerun all steps:

make retrain

Updating the Environment

If you add new dependencies:

Update environment.yaml.
Rebuild the environment:

conda env update -f environment.yaml --prune

For Docker, rebuild the container:

docker compose build

Cleaning Up

Remove the Conda Environment:

conda env remove -n 522_milestone_env

Remove Docker Resources:

docker compose down --remove-orphans

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
.github		.github
analysis		analysis
data		data
docs		docs
img		img
report		report
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
conda-linux-64.lock		conda-linux-64.lock
conda-lock.yml		conda-lock.yml
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wine Quality Prediction

Project Board

Summary

Contributors:

Reproducible Computational Environment

Prerequisites

Setting Up the Environment

Option 1: Using `environment.yaml`

Option 2: Using Platform-Specific Lock Files

Option 3: Using Docker Container

Running the analysis

Pipeline Steps

1. Download Dataset

2. Process and Validate Data

3. Train the Model

4. Generate Plots

5. Generate the Final Report

6. Run the Entire Pipeline

7. Clean Up Generated Files

8. Retrain and Regenerate Everything

Updating the Environment

Cleaning Up

About

Releases 4

Packages

Contributors 4

Languages

License

UBC-MDS/522-wine-quality-32

Folders and files

Latest commit

History

Repository files navigation

Wine Quality Prediction

Project Board

Summary

Contributors:

Reproducible Computational Environment

Prerequisites

Setting Up the Environment

Option 1: Using environment.yaml

Option 2: Using Platform-Specific Lock Files

Option 3: Using Docker Container

Running the analysis

Pipeline Steps

1. Download Dataset

2. Process and Validate Data

3. Train the Model

4. Generate Plots

5. Generate the Final Report

6. Run the Entire Pipeline

7. Clean Up Generated Files

8. Retrain and Regenerate Everything

Updating the Environment

Cleaning Up

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 4

Languages

Option 1: Using `environment.yaml`

Packages