- Yixuan Gao
- Bryan Lee
- Wangkai Zhu
- Timothy Singh
The aim of this project is to build a classification model to predict the quality of red wine based on its physiochemical properties. The project addresses a multi-class classification problem where the target variable, wine quality, is an integer ranging from 0 (poor quality) to 10 (high quality).
Several models were evaluated, including:
- K-Nearest Neighbors (KNN)
- Support Vector Machine with Radial Basis Function kernel (SVM RBF)
- Naive Bayes
- Logistic Regression
- Decision Tree
The methodology included hyperparameter tuning and 5-fold cross-validation to ensure optimal model performance. The best-performing model was determined based on accuracy and other relevant metrics.
The best-performing model was the RBF SVM, which achieved a validation score of approximately 66% and a test set accuracy of around 58%. While the model demonstrated reasonable competence in predicting wines with mediocre quality ratings (5 or 6), its performance declined significantly for wines of higher or lower quality. The confusion matrices suggest challenges in differentiating certain classes, with class imbalances likely impacting performance (e.g., classes 3 and 8 appear to have many true negatives but no true positives). This indicates that the model struggles to handle outliers and extreme cases effectively.
The dataset used for this project is the Red Wine Quality Dataset from the UC Irvine Machine Learning Repository. It consists of 1,599 observations with 11 continuous features, such as fixed acidity, volatile acidity, citric acid, and alcohol content.
The dataset is referenced from the work of Paulo Cortez et al. (details here).
The final report can be found here
- Clone repository - (https://github.com/UBC-MDS/wine_quality_predictor_group1).
- Open Docker Hub and ensure the Docker Hub application is open and logged in with the correct credentials.
- In your terminal use
cd wine_quality_predictor_group1
to switch to the newly created repository. - When with your terminal is in the new repository, run the command
docker-compose up
.- The first time you will need to pull the image which may take a few minutes to load.
- Once the image loads, on your terminal click the link which starts with http://127.0.0.1, it will contain your token information for the Docker Container
- Make sure you have no other instances of Jupyter Lab is opened on port 8888, as clicking this link will open a Jupyter Lab on this port.
- To run the analysis, and generate all files necessary open a terminal within Jupyter Lab, and type the command
make all
. - This will take a minute or two to run, and when completed final reports can be found within the
reports
directory. - Should all generated files need to be cleared, the command
make clean
can be used. - The
pytest
command can also be used within the terminal to ensure all scripts and functions run as intended. - After closing the container, run the command
docker-compose rm
in the desktop terminal to clean up the container.
The following are the scripts in this project:
This script downloads or reads data stored in a .zip
file and saves it locally.
<url>
: URL from internet to download.zip
file (E.g. https://archive.ics.uci.edu/static/public/186/wine+quality.zip).<write_to>
: Path to save the downloaded data (E.g.data/raw
).
This script cleans the dataset by removing duplicates and handling missing values.
<input_path>
: Path to the raw data file (E.g.data/raw/raw_data.csv
).<output_path>
: Path to save the cleaned data (E.g.data/processed/cleaned_data.csv
).<log-path>
: Path to saves results/logs of data cleaning.
This script validates the data against the predefined schema.
<input_path>
: Path to the cleaned data (E.g.data/processed/cleaned_data.csv
).
This script gets the cleaned data and applies train-test split.
4 csv files are created in a new train_test_path
:
- X_train.csv
- X_test.csv
- y_train.csv
- y_test.csv
The EDA plots are saved as individual .png
files. Charts should appear in the order below:
target_distribution_plot.png
correlation_heatmap.png
feature_distributions.png
feature_pairplots.png
<clean_data_path>
: Path to the cleaned data (E.g. clean_data_path=data/processed/cleaned_data.csv)<train_test_path>
: Path to save the train-test splits of the data set. (E.g. data/processed/)<figures_path>
: Path to save the figures generated from EDA. (E.g. results/figures/)<tables_path>
: Path to save the tables generated from EDA. (E.g. results/tables/)
This script creates a preprocessor, and performs 5-fold cross validation on different models. The scores from this cross-valiation are saved, as well as the model with the best evaluation score.
<train_data_path>
: Relative path to retrieve training data.<scores_path>
: Relative path to save training and validation scores.<preprocessor_path>
: Relative path to save the preprocessor as.pickle
file.<model_path>
: Relative path to save best performing model as.pickle
file.
This script takes an SVC pipeline and tunes the model with RandomSearchCV.
<model_path>
: Path to the retrieve pre-trained model file (.pickle
).<best_model_path>
: Path to save the fine-tuned model (.pickle
).<X_train_path>
: Path to the training features (.CSV
).<y_train_path>
: Path to the training labels (.CSV
).<X_test_path>
: Path to the testing features (.CSV
).<y_test_path>
: Path to the testing labels (.CSV
).<params_output_path>
: Path to save the best parameters (.CSV
).
This script finds the accuracy of the model for predictions on the testing set. It also creates and saves confusion matrices using the One vs Rest method of scoring.
<tuned_model_path>
: Relative path to the tuned model after hyperparameter tuning.<test_split_path>
: Relative path to testing split of the data set.<test_accuracy_path>
: Relative path to save test accuracy.<figures_path>
: Path to save any figures from evaluation.
Python and packages listed in environment.yml
file. This has been used in the creation of conda-linux-64.lock
file which is used in creation of the Docker container.
This project is licensed under the terms described in the LICENSE.md
file, under MIT License and Creative Commons License.