I completed EDA for the binary diabetes dataset. I did some data cleaning work, dropped duplicated data, and created several visualizations to help the audience get a better understanding of the dataset. I worked on the section of generating different classification models on the binary dataset. I compared their accuracies and conducted optimization on the random forest model and reorganized our repo structure to separate data, figures, notebooks, etc.
I started main.ipynb and did ols regression on variables of interest: diabetes. I explained the OLS regression and its significance of the dataset and model.
I made our work visable by publishing it online as a jupyterbook, with the help of a github workflow; Compose the README file with detailed descriptions of the project and the structure of the repository; Create the package; Improve Makefile commands based on Donghoon's work so the environment could be installed in one line; Update the Environment.yml with the correct verison for numpy.
I made our codebase reproducible by making environment.yml with makefile that creates conda environment and ipykernal. Also, I added scientific analysis of which features predict diabetes with logistic regression analysis.