All the content in this repository is available as a static website here
Authors: Sam Tan, Bruce Xu, Duy Anh Dang, Donghoon Shin
This project is dedicated to identifying predictive factors for the development of diabetes. The raw dataset is accessible on Kaggle.
We evaluated the performance of several models, including OLS, Decision Tree, Random Forest Classification, and K Nearest Neighbor (KNN), to predict whether an individual would develop diabetes. Our findings showed that the Random Forest model provided the highest accuracy, at 0.866. The KNN model, in contrast, demonstrated notably lower accuracy than the other three classification models, reaching only an accuracy of 0.837 on the test set.
Given its superior performance, we decided to delve deeper into the Random Forest model in an attempt to optimize its accuracy. We explored the impact of removing duplicate records but found that this didn't enhance test accuracy. This finding aligns with our expectation, as the presence of duplicate data in the original dataset was not due to manual errors. Rather, repeated patterns of data indicated that certain records appeared more frequently in the dataset.
Installation Instructions
To set up a new conda environment with the necessary dependencies, run make env. Activate the environment with conda activate diabetes_analysis. Use the diabetes-analysis kernel to run the Jupyter Notebook.
To utilize the custom functions designed specifically for this project, you'll need to install the diabetes_analysis_tools package. After setting the working directory to this repository using cd .., run pip install -e diabetes_analysis_pkg/. Note: please restart the kernel after installation. This is because the running kernel does not continuously monitor changes in the installed packages.
data/
contains different datasets in csv formatsdiabetes_012_health_indicators_BRFSS2015.csv
diabetes_binary_5050split_health_indicators_BRFSS2015.csv
diabetes_binary_health_indicators_BRFSS2015.csv
clean.csv
is the cleaned version of the data ready for analysis.xtrain.csv
is the training dataset with featuresytrain.csv
is the training dataset with labelsxtest.csv
is the testing dataset with featuresytest.csv
is the testing dataset with labels
figures/
contains generated figures from running the notebooks innotebooks/
notebooks/
contains the jupyter notebooks for data analysismain.ipynb
summarizes and discusses the findings and outcomes of our analysisEDA.ipynb
conducts exploratory data analysis and data visualization
diabetes_analysis_pkg
contains the package, which has all the tailor-made functions for this projectdiabetes_analysis_tools
is the name of the packagesetup.py
required to create python packagepyproj.tml
required to create python packagesetup.cfg
required to create python package
_config.yml
required for JupyterBookconf.py
required for JupyterBook_toc.yml
is the table of contents for JupyterBookenvironment-jupyterbook.txt
packages for the book build in Github Actionsenvironment.yml
is the conda environment for this repo.Makefile
make commands for easy executionLICENSE
contains the license used by the repoREADME.md
current document
make
env
creates and configures the environmentremove-env
remove the environmentupdate-env
update the environmenthtml
build the JupyterBook normallyhtml-hub
build the JupyterBook so that you can view it on the hub with the URL proxy trick: https://stat159.datahub.berkeley.edu/user-redirect/proxy/8000/index.htmlclean
clean up the generated figures and _build folders.all
run all the notebooks (*.ipynb
innotebooks/
andmain.ipynb
)