diff --git a/README.md b/README.md index c828e64..31562c8 100644 --- a/README.md +++ b/README.md @@ -61,9 +61,10 @@ python scripts/fit_classifier.py \ --x_training_data=data/processed/X_train.csv \ --y_training_data=data/processed/y_train.csv \ --pipeline_to=results/models \ + --preprocessor_to=results/models \ --results_to=results/tables -python scripts/evaluation.py \ +python scripts/evaluate.py \ --x_test_data=data/processed/X_test.csv \ --y_test_data=data/processed/y_test.csv \ --pipeline_from=results/models/LogisticRegression_classifier_pipeline.pickle \ @@ -72,11 +73,9 @@ python scripts/evaluation.py \ ``` -3. To run the analysis, open src/age_group_classification.ipynb in Jupyter Lab you just launched and under the "Kernel" menu click "Restart Kernel and Run All Cells..." - ### Clean up -1. To shut down the container and clean up the resources, type `Cntrl` + `C` in the terminal where you launched the container, and then type `docker compose rm` +1. To shut down the container and clean up the resources, type `Ctrl` + `C` in the terminal where you launched the container, and then type `docker compose rm` ## Developer notes @@ -85,15 +84,15 @@ python scripts/evaluation.py \ - `conda-lock` (version 2.5.7 or higher) ### Adding a new dependency -1. Add the dependency to the environment.yml file on a new branch. +1. Add the dependency to the `environment.yml` file on a new branch. -2. Run conda-lock -k explicit --file environment.yml -p linux-64 to update the conda-linux-64.lock file. +2. Run `conda-lock -k explicit --file environment.yml -p linux-64` to update the conda-linux-64.lock file. 3. Re-build the Docker image locally to ensure it builds and runs properly. 4. Push the changes to GitHub. A new Docker image will be built and pushed to Docker Hub automatically. It will be tagged with the SHA for the commit that changed the file. -5. Update the docker-compose.yml file on your branch to use the new container image (make sure to update the tag specifically). +5. Update the `docker-compose.yml` file on your branch to use the new container image (make sure to update the tag specifically). 6. Send a pull request to merge the changes into the main branch. diff --git a/logs/validation_errors.log b/logs/validation_errors.log index d66f167..8be2f5d 100644 --- a/logs/validation_errors.log +++ b/logs/validation_errors.log @@ -1,4 +1,4 @@ -2024-12-04 15:20:05,096 - +2024-12-08 16:45:22,956 - { "SCHEMA": { "COLUMN_NOT_IN_SCHEMA": [ diff --git a/notebooks/age_group_classification.ipynb b/notebooks/age_group_classification.ipynb index 6c5bd57..20774ad 100644 --- a/notebooks/age_group_classification.ipynb +++ b/notebooks/age_group_classification.ipynb @@ -1776,7 +1776,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python [conda env:522_fmj] *", + "display_name": "Python [conda env:522_fmj]", "language": "python", "name": "conda-env-522_fmj-py" }, diff --git a/reports/age_group_classification.pdf b/reports/age_group_classification.pdf index 37e6346..0bcbbbd 100644 Binary files a/reports/age_group_classification.pdf and b/reports/age_group_classification.pdf differ diff --git a/reports/age_group_classification.qmd b/reports/age_group_classification.qmd index 115f14c..8b63b33 100644 --- a/reports/age_group_classification.qmd +++ b/reports/age_group_classification.qmd @@ -31,7 +31,7 @@ While taking care of elders is a core value of many cultures, this is not a hall Formally, the question this project seeks to answer is: Can information about the health and nutritional status of Americans be used to predict whether they are adults or seniors? -The dataset used to answer this question is the National Health and Nutrition Health Survey 2013-2014 (NHANES) Age Prediction Subset (@national_health_and_nutrition_health_survey_2013-2014_(nhanes)_age_prediction_subset_887). It was originally prepared for a research paper on predicting diabetes and cardiovascular disease in patients (@DinhMiertschin2016 and @MukhtarAzwari2021). The dataset's stated purpose was to assess the health and nutritional status of adults and children in the United States (@Papazafiropoulou2024), however respondents were classified as either Adults (respondents under 65 years of age) or Seniors (respondents 65 years of age or older). Respondents were located in the United States and provided data through interviews, physical examinations, and laboratory tests to the National Center for Health Statistics (NCHS) (part of the Centers for Disease Control and Prevention (CDC)). +The dataset used to answer this question is the National Health and Nutrition Health Survey 2013-2014 (NHANES) Age Prediction Subset (@NHANES2019). It was originally prepared for a research paper on predicting diabetes and cardiovascular disease in patients (@DinhMiertschin2016 and @MukhtarAzwari2021). The dataset's stated purpose was to assess the health and nutritional status of adults and children in the United States (@Papazafiropoulou2024), however respondents were classified as either Adults (respondents under 65 years of age) or Seniors (respondents 65 years of age or older). Respondents were located in the United States and provided data through interviews, physical examinations, and laboratory tests to the National Center for Health Statistics (NCHS) (part of the Centers for Disease Control and Prevention (CDC)). The dataset has 10 variables and 2278 rows, with each row representing a respondent. The variables are: @@ -119,13 +119,10 @@ We one-hot encoded categorical features (gender, physical_activity, and diabetic We compared a dummy classifier, logistic regression, and SVC model by mean cross validation score. The cross validation scores for each are below. -#### Mean Cross Validation Score for all 3 models - ```{python} #| label: tbl-cv-dummy -#| tbl-cap: Dummy classifier cross validation scores +#| tbl-cap: Mean cross validation scores -# FORGIVE TO ADD FILE NAME THEN DELETE THIS LINE results = pd.read_csv('../results/tables/model_cv_score.csv') Markdown(results.to_markdown()) @@ -133,12 +130,9 @@ Markdown(results.to_markdown()) ### Testing Best Model on Test Data -Since Logistic Regression had the best mean Cross Validation score, we selected it as our final model. +Since logistic regression had the best mean cross validation score, we selected it as our final model. ```{python} -# TODO: FORGIVE TO ADD FILE NAME, UNCOMMENT CODE BELOW, THEN DELETE THIS LINE AND THE ONE BELOW IT -test_score = 'PLACEHOLDER' - best = pickle.load(open('../results/models/LogisticRegression_classifier_pipeline.pickle', 'rb')) X_test = pd.read_csv('../data/processed/X_test.csv') @@ -154,12 +148,10 @@ The model's accuracy on test data was `{python} test_score`. ![Confusion matrix of the best model on test data](../results/figures/Confusion_matrix.png){#fig-confusion-matrix} - The confusion matrix (@fig-confusion-matrix) showed that while the model score is `{python} test_score`, it did very poorly at recall and quite poorly at precision. ![ROC curve of the best model on test data](../results/figures/ROC.png){#fig-roc} - This performance was reflected in the ROC curve above (@fig-roc). While it could differentiate the positive class "Senior" from the negative class to some extent, the model struggled to achieve both high true positive rates and low false positive rates. ## Discussion diff --git a/reports/references.bib b/reports/references.bib index 5941445..54bf01f 100644 --- a/reports/references.bib +++ b/reports/references.bib @@ -29,10 +29,10 @@ @article{Papazafiropoulou2024 pages = {e122--e128} } -@misc{national_health_and_nutrition_health_survey_2013-2014_(nhanes)_age_prediction_subset_887, - author = {NA, NA}, - title = {{National Health and Nutrition Health Survey 2013-2014 (NHANES) Age Prediction Subset}}, +@misc{NHANES2019, + author = {NHANES}, year = {2019}, + title = {National Health and Nutrition Health Survey 2013-2014 (NHANES) Age Prediction Subset}, howpublished = {UCI Machine Learning Repository}, note = {{DOI}: https://doi.org/10.24432/C5BS66} } \ No newline at end of file diff --git a/results/figures/Confusion_matrix.png b/results/figures/Confusion_matrix.png index 2064455..000a725 100644 Binary files a/results/figures/Confusion_matrix.png and b/results/figures/Confusion_matrix.png differ diff --git a/results/figures/ROC.png b/results/figures/ROC.png index f1ac4fd..cfba621 100644 Binary files a/results/figures/ROC.png and b/results/figures/ROC.png differ diff --git a/results/figures/eda_histogram.png b/results/figures/eda_histogram.png index ae0e4c0..9a300ff 100644 Binary files a/results/figures/eda_histogram.png and b/results/figures/eda_histogram.png differ diff --git a/results/models/LogisticRegression_classifier_pipeline.pickle b/results/models/LogisticRegression_classifier_pipeline.pickle index 5f72ef0..882905b 100644 Binary files a/results/models/LogisticRegression_classifier_pipeline.pickle and b/results/models/LogisticRegression_classifier_pipeline.pickle differ