Merge pull request #90 from UBC-MDS/final-checks

fixed errors in readme bash commands, reference error in qmd, table title in mean cv scores
UBC-MDS · Dec 8, 2024 · 1bd43e7 · 1bd43e7
2 parents 6cdd5bd + 2310a49
commit 1bd43e7
Show file tree

Hide file tree

Showing 10 changed files with 14 additions and 23 deletions.
diff --git a/README.md b/README.md
@@ -61,9 +61,10 @@ python scripts/fit_classifier.py \
    --x_training_data=data/processed/X_train.csv \
    --y_training_data=data/processed/y_train.csv \
    --pipeline_to=results/models \
+   --preprocessor_to=results/models \
    --results_to=results/tables
 
-python scripts/evaluation.py \
+python scripts/evaluate.py \
    --x_test_data=data/processed/X_test.csv \
    --y_test_data=data/processed/y_test.csv \
    --pipeline_from=results/models/LogisticRegression_classifier_pipeline.pickle \
@@ -72,11 +73,9 @@ python scripts/evaluation.py \
 
 ```
 
-3. To run the analysis, open src/age_group_classification.ipynb in Jupyter Lab you just launched and under the "Kernel" menu click "Restart Kernel and Run All Cells..."
-
 ### Clean up
 
-1. To shut down the container and clean up the resources, type `Cntrl` + `C` in the terminal where you launched the container, and then type `docker compose rm`
+1. To shut down the container and clean up the resources, type `Ctrl` + `C` in the terminal where you launched the container, and then type `docker compose rm`
 
 ## Developer notes
 
@@ -85,15 +84,15 @@ python scripts/evaluation.py \
 - `conda-lock` (version 2.5.7 or higher)
 
 ### Adding a new dependency
-1. Add the dependency to the environment.yml file on a new branch.
+1. Add the dependency to the `environment.yml` file on a new branch.
 
-2. Run conda-lock -k explicit --file environment.yml -p linux-64 to update the conda-linux-64.lock file.
+2. Run `conda-lock -k explicit --file environment.yml -p linux-64` to update the conda-linux-64.lock file.
 
 3. Re-build the Docker image locally to ensure it builds and runs properly.
 
 4. Push the changes to GitHub. A new Docker image will be built and pushed to Docker Hub automatically. It will be tagged with the SHA for the commit that changed the file.
 
-5. Update the docker-compose.yml file on your branch to use the new container image (make sure to update the tag specifically).
+5. Update the `docker-compose.yml` file on your branch to use the new container image (make sure to update the tag specifically).
 
 6. Send a pull request to merge the changes into the main branch.
 

diff --git a/logs/validation_errors.log b/logs/validation_errors.log
@@ -1,4 +1,4 @@
-2024-12-04 15:20:05,096 - 
+2024-12-08 16:45:22,956 - 
 {
   "SCHEMA": {
     "COLUMN_NOT_IN_SCHEMA": [

diff --git a/notebooks/age_group_classification.ipynb b/notebooks/age_group_classification.ipynb
@@ -1776,7 +1776,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python [conda env:522_fmj] *",
+   "display_name": "Python [conda env:522_fmj]",
    "language": "python",
    "name": "conda-env-522_fmj-py"
   },

diff --git a/reports/age_group_classification.pdf b/reports/age_group_classification.pdf
diff --git a/reports/age_group_classification.qmd b/reports/age_group_classification.qmd
@@ -31,7 +31,7 @@ While taking care of elders is a core value of many cultures, this is not a hall
 
 Formally, the question this project seeks to answer is: Can information about the health and nutritional status of Americans be used to predict whether they are adults or seniors?
 
-The dataset used to answer this question is the National Health and Nutrition Health Survey 2013-2014 (NHANES) Age Prediction Subset (@national_health_and_nutrition_health_survey_2013-2014_(nhanes)_age_prediction_subset_887). It was originally prepared for a research paper on predicting diabetes and cardiovascular disease in patients (@DinhMiertschin2016 and @MukhtarAzwari2021). The dataset's stated purpose was to assess the health and nutritional status of adults and children in the United States (@Papazafiropoulou2024), however respondents were classified as either Adults (respondents under 65 years of age) or Seniors (respondents 65 years of age or older). Respondents were located in the United States and provided data through interviews, physical examinations, and laboratory tests to the National Center for Health Statistics (NCHS) (part of the Centers for Disease Control and Prevention (CDC)).
+The dataset used to answer this question is the National Health and Nutrition Health Survey 2013-2014 (NHANES) Age Prediction Subset (@NHANES2019). It was originally prepared for a research paper on predicting diabetes and cardiovascular disease in patients (@DinhMiertschin2016 and @MukhtarAzwari2021). The dataset's stated purpose was to assess the health and nutritional status of adults and children in the United States (@Papazafiropoulou2024), however respondents were classified as either Adults (respondents under 65 years of age) or Seniors (respondents 65 years of age or older). Respondents were located in the United States and provided data through interviews, physical examinations, and laboratory tests to the National Center for Health Statistics (NCHS) (part of the Centers for Disease Control and Prevention (CDC)).
 
 The dataset has 10 variables and 2278 rows, with each row representing a respondent. The variables are:
 
@@ -119,26 +119,20 @@ We one-hot encoded categorical features (gender, physical_activity, and diabetic
 
 We compared a dummy classifier, logistic regression, and SVC model by mean cross validation score. The cross validation scores for each are below.
 
-#### Mean Cross Validation Score for all 3 models
-
 ```{python}
 #| label: tbl-cv-dummy
-#| tbl-cap: Dummy classifier cross validation scores
+#| tbl-cap: Mean cross validation scores
 
-# FORGIVE TO ADD FILE NAME THEN DELETE THIS LINE
 results = pd.read_csv('../results/tables/model_cv_score.csv')
 Markdown(results.to_markdown())
 
 ```
 
 ### Testing Best Model on Test Data
 
-Since Logistic Regression had the best mean Cross Validation score, we selected it as our final model.
+Since logistic regression had the best mean cross validation score, we selected it as our final model.
 
 ```{python}
-# TODO: FORGIVE TO ADD FILE NAME, UNCOMMENT CODE BELOW, THEN DELETE THIS LINE AND THE ONE BELOW IT
-test_score = 'PLACEHOLDER'
-
 best = pickle.load(open('../results/models/LogisticRegression_classifier_pipeline.pickle', 'rb'))
 
 X_test = pd.read_csv('../data/processed/X_test.csv')
@@ -154,12 +148,10 @@ The model's accuracy on test data was `{python} test_score`.
 
 ![Confusion matrix of the best model on test data](../results/figures/Confusion_matrix.png){#fig-confusion-matrix}
 
-
 The confusion matrix (@fig-confusion-matrix) showed that while the model score is `{python} test_score`, it did very poorly at recall and quite poorly at precision.
 
 ![ROC curve of the best model on test data](../results/figures/ROC.png){#fig-roc}
 
-
 This performance was reflected in the ROC curve above (@fig-roc). While it could differentiate the positive class "Senior" from the negative class to some extent, the model struggled to achieve both high true positive rates and low false positive rates.
 
 ## Discussion

diff --git a/reports/references.bib b/reports/references.bib
@@ -29,10 +29,10 @@ @article{Papazafiropoulou2024
   pages        = {e122--e128}
 }
 
-@misc{national_health_and_nutrition_health_survey_2013-2014_(nhanes)_age_prediction_subset_887,
-  author       = {NA, NA},
-  title        = {{National Health and Nutrition Health Survey 2013-2014 (NHANES) Age Prediction Subset}},
+@misc{NHANES2019,
+  author       = {NHANES},
   year         = {2019},
+  title        = {National Health and Nutrition Health Survey 2013-2014 (NHANES) Age Prediction Subset},
   howpublished = {UCI Machine Learning Repository},
   note         = {{DOI}: https://doi.org/10.24432/C5BS66}
 }
diff --git a/results/figures/Confusion_matrix.png b/results/figures/Confusion_matrix.png
diff --git a/results/figures/ROC.png b/results/figures/ROC.png
diff --git a/results/figures/eda_histogram.png b/results/figures/eda_histogram.png
diff --git a/results/models/LogisticRegression_classifier_pipeline.pickle b/results/models/LogisticRegression_classifier_pipeline.pickle