DEEP uses the XGBoost algorithm to identify air toxic combinations associated with health outcomes. The combinations identified using XGBoost were then adjusted for potential confounders to identify early-life multi-air toxic combinations. Our approach identified several combinations of air toxics associated with asthma in Y.C.Li et al..
Full citation:
Li YC, Hsu HL, Chun Y, Chiu PH, Arditi Z, Claudio L, Pandey G, Bunyavanich S. Machine learning-driven identification of early-life air toxic combinations associated with childhood asthma outcomes. J Clin Invest. 2021 Oct 5:e152088. doi: 10.1172/JCI152088. Epub ahead of print. PMID: 34609967.
python==3.7.4
scikit-learn==0.22
xgboost==1.2.0
numpy==1.19.5
pandas==0.25.3
scipy==1.3.1
statsmodels==0.10.1
matplotlib==3.1.0
pydotplus==2.0.2
argparse==1.1
In addition to generate the figure used in our paper, the following R packages are required:
sas7bdat==0.5
metaviz==0.3.1
Step 0: Prepare your data in csv under data
directory, the example of data format is provided in this repository.
Step 1: Applying DEEP to the desired outcome(s), the results will be generated to result_dir
(default = ./result).
python deep_main.py --filename [outcome.csv] --outcome [outcome]
Parameters of deep_main.py
-h, --help show this help message and exit
--outcome OUTCOME, -o OUTCOME
Name of outcome
--filename FILENAME, -p FILENAME
Path of csv data
--binary_outcome BINARY_OUTCOME, -b BINARY_OUTCOME
True if the outcome is labeled in binary, default=True
--result_dir RESULT_DIR, -r RESULT_DIR
desired result directory prefix, default="./result"
--num_tree_print NUM_TREE_PRINT, -ntp NUM_TREE_PRINT
Number of trees for each path printed to result
directory, default = -1 implies printing all of them
Step 2: Merging all the result files from multiple outcomes, and then performing the FDR correction
python merge_multiple_outcomes.py --result_dir [result_directory]
Printing Figure 2 & 3 (Table of False discovery rate (FDR), Odds ratios (OR) of individual pollutants and combinations which are significantly associated with the outcomes):
python profile_table.py --result_dir [result_directory]
Taking the sample data as example, the whole pipeline will be:
python deep_main.py --filename DailyControllerMedication.csv --outcome DailyControllerMedication
python deep_main.py --filename EmergencyRoomVisit.csv --outcome EmergencyRoomVisit
python deep_main.py --filename OvernightHospitalization.csv --outcome OvernightHospitalization
python merge_multiple_outcomes.py --result_dir ./result
python profile_table.py --result_dir ./result
Result:
Due to IRB constraints, we are unable to publicly share the asthma cohort dataset used in our study. Thus, as sample data to test our code, we are providing a randomized dataset based on the original data distribution in this repository. We included the results generated by the randomized data in result
directory, including the sample trees generated from XGBoost and the tables of profiles with their false discovery rates, odds ratios (figure 2 and 3 in our article).
-
In the
deep_main.py
script, it performs the DEEP extraction on a single health outcome, including:- Frequent profile extraction via 100 runs of XGBoost
- Statistical assessment of the frequent profiles with potential confounders.
-
After running DEEP on the multiple health outcomes,
merge_multiple_outcomes.py
merges the results in
1 and performs the FDR correction.
profile_table.py
filters out the individual air toxic and combination of air toxic which are positively associated with the health outcome.profile_table.py
also callsRscript\profile_table.R
to visualize the result, as the figure 2 and 3 in our article.