NEW feature: RefHiC-SR: Reference panel guided super-resolution inference of Hi-C data introduced in RefHiC v1.1.
We suggest users run RefHiC on GPU.
You can run RefHiC on CPU for TAD/loop annotations, but it is much slower than on GPU. Model training on CPU is almost impossible.
We also provide examples with data for TAD and loop annotations under example folder.
You can find scripts (i.e. .R and .ipynb scripts) and data to reproduce our analysis in the manuscript at https://zenodo.org/record/7011277 .
RefHiC is developed and tested on Linux machines and relies on the following packages:
appdirs==1.4.4 click==8.0.1 cooler==0.8.11 einops h5py importlib_resources>=5.4.0 matplotlib numpy pandas requests setuptools tqdm
RefHiC relies on several libraries including pytorch, scipy, etc. We suggest users using conda to create a virtual environment for it (It should also work without using conda, i.e. with pip). You can run the command snippets below to install RefHiC:
git clone https://github.com/BlanchetteLab/RefHiC.git cd RefHiC conda create --name refhic --file requirements.txt python=3.9 conda activate refhic
Follow https://pytorch.org/get-started/locally/ to install pytorch. It might be the following command depending on your cuda version:
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
Install additional library:
pip install torchmetrics pip install -U git+https://github.com/fadel/pytorch_ema pip install tensorboard pip install -U scikit-learn pip install scipy==1.7.3
Install RefHiC:
pip install --editable .
If fail, please try python setup build
and python setup install
first.
The installation requires network access to download libraries. Usually, the installation will finish within 5 minutes. The installation time is longer if network access is slow and/or unstable.
In RefHiC, we used LARS (torchlars) for contrastive pretraining. However, we find torchlars often introduces many compatibility issues. Thus, we removed LARS from RefHiC.
If you want to use LARS for model training, please uncomment line 6 and 10 in pretrain.py and install torchlars as follows,
pip install torchlars
You can also do contrastive pretraining with Adam, but the performance may be slightly worse than with LARS. You could try to increase pre-train epochs for better accuracy.
After RefHiC installation, you need to initialize RefHiC. It loads reference panel and trained models into your local space. You can run the following command:
refhic config init
Then you will be asked to select (1) download the default reference panel or (2) load your own panel. The default one (~3GB) is for hg38 and contains 29 samples. The easiest way to run RefHiC is to load the default panel.
If you want to use the mouse reference panel (mm10), you need to dowmload it from http://repo.cs.mcgill.ca/PUB/blanchem/RefHiC/mouseRefPanel.zip and deploy it by selecting option (2) load your own panel.
Our reference panel and trained models are for data at 5kb resolution only!
-
Loop and TAD prediction:
use CPU: please set --cpu True
use GPU: please leave --cpu and --gpu as default vaules or set --gpu as the index of the GPU that you want to use.
-
Training:
use CPU: Impossible
use GPU: please leave --gpu as default vaules or set --gpu as the index of the GPU that you want to use.
It involves two steps,
- Get a list of candidate loops
refhic loop pred
It outputs candidate loops from both the target (i.e. input contact map) and decoy (i.e. permuted target) It contains tab separated fields as follows:Chr1 Start1 End1 Chr2 Start2 End2 Score IF P2LL Label
Field Detail Chr1/Chr2 chromosome names Start1/Start2 start genomic coordinates End1/End2 end genomic coordinates (i.e. End1=Start1+resol) Score RefHiC's loop score [0~1] IF Interaction frequency P2LL Peak to lower left ratio [for information only, RefHiC does not use it] Label loop candidate is from target/decoy set [will be used in pool] - slect loops from loop candidates
refhic loop pool
It outputs loops detected from the target data by FDR control. You may need to increase alpha to 0.1 or higher for very low coverage or low quality Hi-C data to select more loops. Sadlly, it will also produce more false positive annotations. Same as above but without the Label field (all loops are from the target).
TAD annotation is simply, you can run
refhic tad predFor FDR control, you can to increase alpha to 0.1 or higher for very low coverage or low quality Hi-C data to select more TAD boundaries. Sadlly, you will see more false positive annotations.
output file 1. Boundary scores:
RefHiC outputs 7 columns tab separated fields (without header) to a file (named user_specific_prefix.bed) for TAD annotation. Each row contains genomic coordinate, boundary scores and 0/1 fields to indicate boundary annotation:
Chr Start End LeftScore LeftBoundary RightScore RigihtBoundary
Field | Detail |
---|---|
Chr | chromosome name |
Start | start genomic coordinate |
End | end genomic coordinate (i.e. End=Start+resol) |
LeftScore/RightScore | boundary scores [-1~1] |
LeftBoundary/RigihtBoundary | boundary annotations [1: boundary, 0: non-boundary] |
output file 2. TAD:
a bed file contains TADs (1 TAD per line)
It is hard to evaluate TAD or loop annotations in the real experiments as we don't have ground truth. You can use the pileup option for a quick check
refhic util pileup
It will produce a pileup image for given foci (i.e. predicted loops or TAD boundaries). You can turn on p2ll option to compute P2LL for the averaged contact pairs. It is helpful for loop analysis.
Let's assume you already have a group of Hi-C contact maps in .mcool format. If your data is in other format, you can use try cooler (https://cooler.readthedocs.io/) to convert data into .mcool. Alternatively, you can use 4dn pipeline (https://github.com/4dn-dcic/docker-4dn-hic) to process your fastq data. To create your own reference panel, you need to:
- Convert all .mcool files to .bcool files (band matrix based cooler format). .bcool is our own format. We discard useless interactions (i.e. contact distance >3MB) to save space:
refhic util cool2bcool [OPTIONS] COOL BCOOL
- create a meta.csv file for your reference panel [header name: sample, file]:
sample,file prostate,HIC00001.hg38.mapq_10.1000.bcool Human embryonic kidney,HIC00002.hg38.mapq_10.1000.bcool ...
- compress all .bcool files and meta.csv into a zip file.
- Load reference panel by run,
refhic config init
select (2) load your own panel.
NB: We believe you can still use our trained models with your own reference panel as we used several approaches to train RefHiC to be robust to referenced samples. But, the performance might be worse than with the default panel. You may need to retrain a model with your own reference panel.
The core steps are get labelled data and study Hi-C contact maps.
- Labelled data: Similar to our manuscript, you have at least two options. (1) Whenever orthogonal data existed (i.e. ChIA-PET, HiCHIP, etc.), use them to get positive cases. For negative cases please refer to our paper. (2) If you don't have high quality orthogonal data, you may get labels by applying conventional tools on the most high quality contact map that you have to get labels.
- contact maps: One significant contribution of RefHiC is the introducing of data augmentation by downsampling technique. So you need to downsample your original contact map into a series of downsamplings. You can do it by applying FANC on .mcool or use shuf to shuffle contact pair list and run 4DN pipeline to create .mcool for downsamplings.
- run the following command to create training dataset, you will be asked to provide labels (i.e. foci) and comma separated downsampled .mcool files:
refhic util traindata
You wiil be aksed to select window size. We found win=21 (i.e. w=10) for loop and win=41 (i.e. w=20) for TAD work well for 5kb data. - train loop model:
refhic loop train
train TAD modelrefhic tad train
labelled data (foci) format:chr1 bin1_start bin1_end chr2 bin2_start bin2_end class_1 <class_2 ...>
You could have more than one label for each training case (i.e. TAD has two labels (left,right), loop has one label)
User We don't ask to edit config file manually, but you can still do it if you want:
refhic config edit
RefHiC will load reference panel and models from the configuration file by default. But you can still use your own models or reference panel without write them into the configuration file. There are two parameters: (1) --reference for reference panel; (2) --modelState for model. You can specific them when you run pred for loop or TAD annotation. The reference parameter takes a meta.csv as input. (NB: file field should contain absolute path to data)
If you use RefHiC in your work, please cite our paper:
Zhang, Y., Blanchette, M. Reference panel guided topological structure annotation of Hi-C data. Nat Commun 13, 7426 (2022). https://doi.org/10.1038/s41467-022-35231-3
A GitHub issue is preferable for all problems related to using RefHiC.
For other concerns, please email Yanlin Zhang or Mathieu Blanchette (yanlin.zhang2@mail.mcgill.ca, blanchem@cs.mcgill.ca).