Skip to content

Commit

Permalink
Merge pull request #90 from UBC-MDS/final-checks
Browse files Browse the repository at this point in the history
fixed errors in readme bash commands, reference error in qmd, table title in mean cv scores
  • Loading branch information
mdahewlett authored Dec 8, 2024
2 parents 6cdd5bd + 2310a49 commit 1bd43e7
Show file tree
Hide file tree
Showing 10 changed files with 14 additions and 23 deletions.
13 changes: 6 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,9 +61,10 @@ python scripts/fit_classifier.py \
--x_training_data=data/processed/X_train.csv \
--y_training_data=data/processed/y_train.csv \
--pipeline_to=results/models \
--preprocessor_to=results/models \
--results_to=results/tables

python scripts/evaluation.py \
python scripts/evaluate.py \
--x_test_data=data/processed/X_test.csv \
--y_test_data=data/processed/y_test.csv \
--pipeline_from=results/models/LogisticRegression_classifier_pipeline.pickle \
Expand All @@ -72,11 +73,9 @@ python scripts/evaluation.py \

```

3. To run the analysis, open src/age_group_classification.ipynb in Jupyter Lab you just launched and under the "Kernel" menu click "Restart Kernel and Run All Cells..."

### Clean up

1. To shut down the container and clean up the resources, type `Cntrl` + `C` in the terminal where you launched the container, and then type `docker compose rm`
1. To shut down the container and clean up the resources, type `Ctrl` + `C` in the terminal where you launched the container, and then type `docker compose rm`

## Developer notes

Expand All @@ -85,15 +84,15 @@ python scripts/evaluation.py \
- `conda-lock` (version 2.5.7 or higher)

### Adding a new dependency
1. Add the dependency to the environment.yml file on a new branch.
1. Add the dependency to the `environment.yml` file on a new branch.

2. Run conda-lock -k explicit --file environment.yml -p linux-64 to update the conda-linux-64.lock file.
2. Run `conda-lock -k explicit --file environment.yml -p linux-64` to update the conda-linux-64.lock file.

3. Re-build the Docker image locally to ensure it builds and runs properly.

4. Push the changes to GitHub. A new Docker image will be built and pushed to Docker Hub automatically. It will be tagged with the SHA for the commit that changed the file.

5. Update the docker-compose.yml file on your branch to use the new container image (make sure to update the tag specifically).
5. Update the `docker-compose.yml` file on your branch to use the new container image (make sure to update the tag specifically).

6. Send a pull request to merge the changes into the main branch.

Expand Down
2 changes: 1 addition & 1 deletion logs/validation_errors.log
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
2024-12-04 15:20:05,096 -
2024-12-08 16:45:22,956 -
{
"SCHEMA": {
"COLUMN_NOT_IN_SCHEMA": [
Expand Down
2 changes: 1 addition & 1 deletion notebooks/age_group_classification.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1776,7 +1776,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:522_fmj] *",
"display_name": "Python [conda env:522_fmj]",
"language": "python",
"name": "conda-env-522_fmj-py"
},
Expand Down
Binary file modified reports/age_group_classification.pdf
Binary file not shown.
14 changes: 3 additions & 11 deletions reports/age_group_classification.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ While taking care of elders is a core value of many cultures, this is not a hall

Formally, the question this project seeks to answer is: Can information about the health and nutritional status of Americans be used to predict whether they are adults or seniors?

The dataset used to answer this question is the National Health and Nutrition Health Survey 2013-2014 (NHANES) Age Prediction Subset (@national_health_and_nutrition_health_survey_2013-2014_(nhanes)_age_prediction_subset_887). It was originally prepared for a research paper on predicting diabetes and cardiovascular disease in patients (@DinhMiertschin2016 and @MukhtarAzwari2021). The dataset's stated purpose was to assess the health and nutritional status of adults and children in the United States (@Papazafiropoulou2024), however respondents were classified as either Adults (respondents under 65 years of age) or Seniors (respondents 65 years of age or older). Respondents were located in the United States and provided data through interviews, physical examinations, and laboratory tests to the National Center for Health Statistics (NCHS) (part of the Centers for Disease Control and Prevention (CDC)).
The dataset used to answer this question is the National Health and Nutrition Health Survey 2013-2014 (NHANES) Age Prediction Subset (@NHANES2019). It was originally prepared for a research paper on predicting diabetes and cardiovascular disease in patients (@DinhMiertschin2016 and @MukhtarAzwari2021). The dataset's stated purpose was to assess the health and nutritional status of adults and children in the United States (@Papazafiropoulou2024), however respondents were classified as either Adults (respondents under 65 years of age) or Seniors (respondents 65 years of age or older). Respondents were located in the United States and provided data through interviews, physical examinations, and laboratory tests to the National Center for Health Statistics (NCHS) (part of the Centers for Disease Control and Prevention (CDC)).

The dataset has 10 variables and 2278 rows, with each row representing a respondent. The variables are:

Expand Down Expand Up @@ -119,26 +119,20 @@ We one-hot encoded categorical features (gender, physical_activity, and diabetic

We compared a dummy classifier, logistic regression, and SVC model by mean cross validation score. The cross validation scores for each are below.

#### Mean Cross Validation Score for all 3 models

```{python}
#| label: tbl-cv-dummy
#| tbl-cap: Dummy classifier cross validation scores
#| tbl-cap: Mean cross validation scores
# FORGIVE TO ADD FILE NAME THEN DELETE THIS LINE
results = pd.read_csv('../results/tables/model_cv_score.csv')
Markdown(results.to_markdown())
```

### Testing Best Model on Test Data

Since Logistic Regression had the best mean Cross Validation score, we selected it as our final model.
Since logistic regression had the best mean cross validation score, we selected it as our final model.

```{python}
# TODO: FORGIVE TO ADD FILE NAME, UNCOMMENT CODE BELOW, THEN DELETE THIS LINE AND THE ONE BELOW IT
test_score = 'PLACEHOLDER'
best = pickle.load(open('../results/models/LogisticRegression_classifier_pipeline.pickle', 'rb'))
X_test = pd.read_csv('../data/processed/X_test.csv')
Expand All @@ -154,12 +148,10 @@ The model's accuracy on test data was `{python} test_score`.

![Confusion matrix of the best model on test data](../results/figures/Confusion_matrix.png){#fig-confusion-matrix}


The confusion matrix (@fig-confusion-matrix) showed that while the model score is `{python} test_score`, it did very poorly at recall and quite poorly at precision.

![ROC curve of the best model on test data](../results/figures/ROC.png){#fig-roc}


This performance was reflected in the ROC curve above (@fig-roc). While it could differentiate the positive class "Senior" from the negative class to some extent, the model struggled to achieve both high true positive rates and low false positive rates.

## Discussion
Expand Down
6 changes: 3 additions & 3 deletions reports/references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,10 @@ @article{Papazafiropoulou2024
pages = {e122--e128}
}

@misc{national_health_and_nutrition_health_survey_2013-2014_(nhanes)_age_prediction_subset_887,
author = {NA, NA},
title = {{National Health and Nutrition Health Survey 2013-2014 (NHANES) Age Prediction Subset}},
@misc{NHANES2019,
author = {NHANES},
year = {2019},
title = {National Health and Nutrition Health Survey 2013-2014 (NHANES) Age Prediction Subset},
howpublished = {UCI Machine Learning Repository},
note = {{DOI}: https://doi.org/10.24432/C5BS66}
}
Binary file modified results/figures/Confusion_matrix.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified results/figures/ROC.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified results/figures/eda_histogram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified results/models/LogisticRegression_classifier_pipeline.pickle
Binary file not shown.

0 comments on commit 1bd43e7

Please sign in to comment.