- what did we get and where is it
- what did we do to it
- how did we load it into the db
- what additional tables did we create
- what indices did we create
- where are the files and how to run etl again?
- biopsy data load
python load_biopsy_data.py ../../../../misc/notes_labeled_11DEC23.csv biopsy_labels postgres://username:password@nashdb.dssg.io:5432/popsdb
The raw data were given to us in the form of 9 CSV files. We followed the following steps to load and clean the data to be used in the analysis.
Step 1: Copy CSVs to the database as is. We created 9 tables (one for each CSV file) in a database schema named staging
and copied the contents of the CSVs to their respective tables without any changes. To preserve the data in its rawest form possible, we keep all the datatypes to be VARCHAR/TEXT
.
- We use a Python script to programmatically create the staging tables using the structure of the CSV -
etl/creating_staging_tables.py
- Then, we use a Shell script to copy CSV content into the staging tables -
etl/upload_csvs_to_db.sql
$ python etl/creating_staging_tables.py
$ psql -f etl/upload_csvs_to_db.sql
Step 2: Add appropriate data types. Once the data is uploaded to the DB, we assign appropriate datatypes to the columns of the data tables and create an intermediate schema raw
.
- We use a SQL script to create and populate the tables with the appropriate data types (from
staging
) -etl/raw/raw_tables.sql
$ psql -f etl/raw/raw_tables.sql
Step 3: Clean data with integer ids and indexes. Our modeling pipeline requires patients in the cohorts to be identified with integer IDs, and the patient ID included in the raw data is an alphanumeric string. In this stage, we create a table that maps the alphanumerics IDs to integers and add that integer ID to all the data tables. Further, we add index the data tables by the appropriate columns to improve the performance of our SQL queries. The cleaned and indexed table are stored in the clean
schema. This schema is used as the base for all analyses.
- SQL script for creating the integer ID mapping -
etl/create_entity_id_mapping.sql
- SQL script for creating the clean schema tables with indexes and integer IDs -
etl/clean/create_clean_tables.sql
$ psql -f etl/create_entity_id_mapping.sql
$ psql -f etl/clean/create_clean_tables.sql
Step 4: Create helper tables for data analyses.
We create a set of tables based off of the data in the clean
schema to support our analyses. The table and the location of the script used to create table table are given below. All the scripts are in etl/helper/
folder.
- exclusion dates for patients in cohorts
first_exclusion_date.sql
- BMI tables
create_bmi_tables.sql
- FIb4 tables
create_fib4_table.sql
- Prescription helper table
create_prescribing_helper_table.py
- queries
- notebooks
- run.py
python run.py -c ../triage_config_files/c3y_l3y_upd1y_asof1y_incalc_12m_all_features -f ../triage_config_files/feature_groups_12m_all_only/
python run_pieces.py -c ../triage_config_files/c3y_l3y_upd1y_asof1y_incalc_12m_all_features -f ../triage_config_files/feature_groups_12m_all_only/
- config files
- save predictions: python save_predictions.py -c predictions.yaml -d db.yaml
manual modeling files - model runs, results
python manual_modeling.py --train_matrix f258357f1f7b3f855d16963117c1ed8c --test_matrix 20793ed2d7e5b00be7d2f57e78efef92 -m DT2 --only_use_base_guideline_features_with_age_race;
Bias analysis for manual models
Currently lives in a notebook notebooks/manual_modeling_predictions.ipynb
.
TODO - This needs better documentation, and probably should move to a .py script
Guideline Model Creation and Results
We use the triage built features to calculate the guideline models. We use the .sql
file baselines/guidelines_table_using_features_fixed.sql
. The existence of guidelines in patients is stored in anaysis.guideline_conditions_past_all_fixed
.
The evaluations of the different guideline combination models is calculated using baselines/run_guideline_performance.py
. The evaluation results are stored in analysis.guideline_condition_results_fixed
.
TODO Should remove the depracated sql and the related database table. Then update the name of the _fixed
file and the related db table
Bias analysis of guideline models
Currently the code lives in the notebook at notebooks/guideline_features_bias.ipynb
. The results are saved to analysis.guideline_models_bias
.
TODO The current bias analysis is related to the old guidelines table. So, need to do the following: (1) convert the notebook into a .py script, (2) re-run for the 'fixed' version of the guidelines and store the results on the database
- notebooks
- tableau
- on drive
- in repo
- loaded data into clean schema
- created augmented columns (ccsr category), helper tables, and pre-computed most recent type features.
- modeling
- all features