3DPredictor
The paper describing 3DPredictor is out in genome research:
https://genome.cshlp.org/content/early/2019/11/27/gr.249367.119.abstract
We are actively upgrading our 3DPredictor tool now. If you want to reproduce results of the paper, please use commit de22c741325a44385aad184634ed6506a465ab62
Dependences:
-
To train/validate/use models: python3 (we used 3.5, although any version above 3.4 should work) with: numpy, pandas, dicttoxml, termcolor, sklearn/xgboost, swifter
-
To calculate scc: R >= 3.2
-
To dump/export contacts in .hic format: java8, juicer_tools.jar by https://github.com/aidenlab/juicer (also provided in this repo)
Quick demo reproducing main paper results
- Run 'example_generate_data_K562_train' and '_test' specifying chromosomes for training and source of training/validation information (contacts or observed/expected ratios):
#Training data for chrms 1,3,5,7,9
python example_generate_data_K562_train.py 1,3,5,7,9 contacts.gz
#Validation data for chr 18
python example_generate_data_K562_test.py 18 contacts.gz
It will be a lot of technical info in the output. Pay attention to names of the generated files, e.g.:
DataGenerator: Writing data to file output/K562/chr1_chr3_chr5_chr7_chr9training.RandOncontacts.gz.False.11.1500000.50001.1.1.cont_with_CTCF666406.25000
DataGenerator: Writing data to file output/K562/Interval_chr18_0_78000000validatingOrient.contacts.gz.False.11.1500000.50001.505429.all_cont.25000.txt
- Insert names of generated files into the 'example_train_and_validate_K562.py':
# set all training files
training_files = [
"output/K562/chr1_chr3_chr5_chr7_chr9training.RandOncontacts.gz.False.11.1500000.50001.1.1.cont_with_CTCF666406.25000",
]
#....
# set all validation files
validation_files = [
"output/K562/Interval_chr18_0_78000000validatingOrient.contacts.gz.False.11.1500000.50001.505429.all_cont.25000.txt",
]
- Run 'exampple_train_and_validate_K562.py
python exampple_train_and_validate_K562.py
How to use the code
0. Prepare your data We use epigenetic data from Encode, but one can use any source of data to train model. If you are using formats other than standard .bed files, you may want to write your own "reader" classes, which are responsible for data parsing (see below).
One can use jucer_tools dump command to obtain Hi-C contacts required for training. We also suggest to normalize contact counts between experiments (otherwise you can’t compare models trained using data from different Hi-C experiments). Normalization coefficient could be obtained running NormCoef.py on particular contacts dataset and used later when creating the contacts_reader object.
The model itself contains 2 major modules:
1. Data Generation module
The module GenerateData_K562.py loads some external files (i.e. file with known contact frequencies, ChipSeq, E1 and etc.) and builds a dataset with predictor values for each contact.
The architecture of the data generation is following:
-
Reader object is responsible for parsing of specific data (i.e. ChiP-seq) and store it in pandas dataframe format.
-
Predictor_generator object uses reader object as a proxy to access data and based on epigenetic data provided by reader and coordinates of pair of loci generates specific predictors (in other words, performs parametrization)
-
DataGenerator object uses one contact_reader object and list of predictor_generator objects (each linked to its reader object) to generate predictors for each contacts accessible by contact_reader object.
Most of predictor_gerenrator classes do not accept vectorized operation, which means that they accept one pair of loci and return predictor for this pair. When you have multiple loci, it's much more computationally efficient to accept all of them at once as a list (or series or array) and calculate predictors using pandas and numpy vector operations. As for now, we have such implementation only for several predictors (those are available in VectPredictorGenerators.py). Contributions are welcome =)
Basic usage:
Set variables:
params.window_size = 25000 #region around contact to be binned for predictors
params.small_window_size = 12500 #region around contact ancors to be considered as cis
params.mindist = 50001 #minimum distance between contacting regions
params.maxdist = params.window_size #max distance between contacting regions
params.maxdist = 3000000
params.binsize = 20000 #when binning regions with predictors, use this binsize
params.sample_size = 500000 #how many contacts write to file
params.conttype = "contacts"
as well as filenames and genomic intervals of interest (see the code), and run. Data files sholud be located in
input_folder = "set/path/to/this/variable"
The data files currently used could be downloaded from http://genedev.bionet.nsc.ru/site/hic_out/3DPredictor/
Few sample data files could be downloaded from http://genedev.bionet.nsc.ru/hic_out/3DPredictor/
Note that predictors generation takes ~3h for 500 000 contacts. One may change parallelization options by tweak code in DataGenerator.py:
n_cpus = multiprocessing.cpu_count()
There is an example of generating data for K562 cells provided within file GenerateData_K562.py
2. Training and validation module
Run module train_and_validate_K562.py to train model. Set up following variables before running:
training_files - files with data for training
validation_files - a list of files with data for validation
contact type - contacts or OE values:
for contact_type,apply_log in zip(["contacts"],[True]):
keep - a list with predictors to use. Each entery should contain a list of predictors, or a single string "all" (for using all avaliable predictors). E.g. with
keep = ["CTCF_W","contact_dist"]
only CTCF sites inbetween contacting loci and distance between them will be used.
learning algorithm - you can change this in the Predictor.py module:
def train(self,alg = xgboost.XGBRegressor(n_estimators=100,max_depth=9,subsample=0.7
Also set folders name for validating results in Predictor.py module
dump = True, out_dir = "/mnt/scratch/ws/psbelokopytova/201905031108polinaB/3DPredictor/out/models/"
out_dir = "/mnt/scratch/ws/psbelokopytova/201905031108polinaB/3DPredictor/out/pics/",
Please note that fitting model is time-consuming so fitted model is saved to the file with the name representing model parameters (predictors and algorithm). It is automatically determined wheather such file exists and model can be simply loaded without fitting again (to change this behaviour pass rewriteModel=True when calling trainAndValidate function)
Choose validators to estimate accuracy of the algorithm. There are 2 main validators: SCC metric and plot_juicebox module for visual assessment of prediction Output files:
file name | definition |
---|---|
file.xml | model definition |
featureImportances.png | histogram of feature importances |
featureImportances.txt | list of feature importances scores |
file.ssc.out | file with standard metrics and SCC for prediction |
file.scc | pre-file for SCC counting which looks like contact_st--contact_end--real_cont_count--predicted_cont_count |
data.hic | predicted heatmap |
control.hic | heatmap with experimental data |
Rearrangements If you want to predict contacts after rearrangement you should first generate appropriate predictors. You should apply special rearrangement function for EVERY 'reader' object (see above). There are functions for duplication, deletion and inversion. For example:
params.contacts_reader.delete_region(Interval("chr22", 16064000, 16075000))
For prediction of heatmap of contacts in rearranged genome use corresponding validating file and choose transformation option (now it works only for deletion):
trained_predictor.validate(validation_file, show_plot = False,cell_type=cell_type,
#use transformation option if you want to return coordinates after rearrangement
# transformation=
# [decorate_return_coordinates_after_deletion(return_coordinates_after_deletion, interval=deletion)],