Enhancing functional interpretability in gene expression data analysis by prior-knowledge incorporation
We developed an integrative approach to feature selection that combines weighted LASSO feature selection and prior biological knowledge in a single step by means of a novel score of biological relevance that summarizes information extracted from popular biological knowledge bases.
We compared the performance of the standard regularized LASSO model and our proposed approach on two application use cases concerning the cancer-related subtype prediction of patients based on gene expression data. The use cases concern the classification of Breast Invasive Carcinoma (BRCA) patients and Colorectal Cancer (CRC) patients in their corresponding cancer subtypes. We also performed two distinct sensitivity analyses to evaluate the impact of incorporating our proposed score of biological relevance into LASSO regularization. We used a controlled dataset with limited correlation among the features for these analyses, considering publicly available RNA-seq profiles of Kidney Renal Clear Cell Carcinoma patients from The Cancer Genome Atlas (TCGA) project. The preprocessed dataset is available here, along with the list of features considered in the controlled dataset. For all datasets analysed, the data are preprocessed as described in the notebooks found here.
To perform the GIS-weighted LASSO in Python using the scikit-learn library, we modified the corresponding functions using the development version of scikit-learn. The modified package is available here. After cloning the repository, build a dedicated environment with:
conda create -n sklearn-env -c conda-forge python=3.9 numpy scipy cython=0.29.33
conda activate sklearn-env
Then, build the scikit-learn package with:
cd scikit-learn-lasso
pip install -v --no-use-pep517 --no-build-isolation -e .
Lastly, install the required packages from requirements.txt
We computed the score of biological relevance using the specific versions of the knowledge bases, which can found here. These versions are the following:
- GO (format-version 1.2, release date: 2023-03-06)
- Reactome (version V85)
- HPO (format-version 1.2, release date: 2023-09-01)
To download and use the updated versions, run this notebook. We used the biomaRt R library (version 2.56.1) to extract all available GO annotation terms. To obtained updated GO annotation terms for each gene run the R script here.
All the code is available here. To replicate the experiments run the following scripts:
experiments_BRCA.py
for experiments on the BRCA datasetexperiments_CRC.py
for experiments on the CRC dataset
The code to replicate the two sensitivity analyses is available here. All the results from the experiments we performed can be found here.