This python program is the implementation of the HIPPS-DIMES method12. HIPPS-DIMES is a computational method based on the maximum entropy principle, with experimental measured contact map or pair-wise distances as constraints, to generate a unique ensemble of 3D chromatin structures. In a nutshell, this program accepts the input file of a mean spatial distance map (which can be measured in Multiplexed FISH experiment) or a Hi-C contact map (which is converted to distance map internally), and generate an ensemble of individual chromatin conformations that are consistent with the input. The output conformations are stored as .xyz
format files, and can be used to calculate quantities of interest and can be visualized using VMD
or other compatible softwares.
The theory and applications of this method can be found in our work published:
- Shi, Guang, and D. Thirumalai. "From Hi-C contact map to three-dimensional organization of interphase human chromosomes." Physical Review X 11.1 (2021): 011051. link
- Shi, Guang, and D. Thirumalai. "A maximum-entropy model to predict 3D structural ensembles of chromatin from pairwise distances with applications to interphase chromosomes and structural variants." Nature Communications 14.1 (2023): 1150. link
- Python 3.8+
First, download this repository using,
git clone https://github.com/anyuzx/HIPPS-DIMES
Next, go into the repository folder and install required packages using the command below,
cd HIPPS-DIMES
pip3 install --editable .
This command will install the required packages, and install the script as a
python module. Note that you need to install pip3
if it is not installed
already (Follow the instruction on the official document here
https://pip.pypa.io/en/stable/installing/). Once installed, you can call
HippsDimes
directly in the terminal to run the script. The packages installed
are:
Click
Numpy
Scipy
Pandas
Tqdm
Cooler
rich
To get started, please go through the jupyter notebook walkthrough.ipynb
in this repository.
In addition, it is helpful to view the help information for each arguments and options. To display help information, use
HippsDimes --help
This script accept input files in two formats. If the input file is a Hi-C
contact map, it can be in either .cool
format (see https://github.com/open2c/cooler for details of the cooler
library) or pure text format. If the
input file is a mean spatial distance map, the script only accepts a pure text
formatted file. The text format for a matrix is the following: each row of the
file corresponds to the row of the matrix. Values are space-separated. The
content of the file should look like this,
1 2 3
2 1 2
3 2 1
This script will generate several files:
- A text file for the final simulated mean distance map
- A text file for the final simulated contact map (this is the best agreement contact map to the normalized input contact map)
- A text file for the connectivity matrix
- A
.xyz
formatted file for the ensemble of genome structures generated (can be turned off) - A csv formatted file for cost versus iteration data (can be turned off)
First, download a cooler format Hi-C contact map from
here
(The file size is about 116 Mb). This Hi-C contact map is for Chicken cell
mitotic chromosome, originally retrieved from
GEO repository.
Rename it to hic_example.cool
. Then execute the following command,
HippsDimes hic_example.cool test --input-type cmap --input-format cooler -s chr7:10M-15M -i 10 -e 10
This command tells the script to load the Hi-C contact map hic_example.cool
and perform the iterative scaling algorithm. The argument test
instructs the
files names of output files start with test_
. Option --input-type cmap
specifies that the input file is a contact map. Option --input-format cooler
specifies that the input file is a cooler
file. Option -s chr7:10M-15M
specifies that the algorithm is performed on the region 10 Mbps - 15 Mbps on
Chromosome 7. Note that these three options are required and cannot be
neglected. Some option arguments are optional, some are required. Please refer
to the section below and use HippsDimes --help
for details
When the program finishes, the script will generate several output files:
test.xyz
, test_connectivity_matrix.txt
, and test_dmap_final.txt
.
test.xyz
contains 10 sets of individual conformations of x, y, z coordinates
and can be viewed using VMD
or other compatible visualization softwares.
In this example, we use Hi-C contact map for HeLa cell line Chromosome 14 at
time point of 12 hours after the release from prometaphase. For the purpase of
demonstration, you can download the Hi-C .cool
file from
here
which is origannly retreived from
GEO repository
under accession number GSE102740. Before you download, note that the file has
size of about 655 Mb. Once downloaded, execute the following command,
HippsDimes GSM3909682_TB-HiC-Dpn-R2-T12_hg19.1000.multires.cool::6 test --input-type cmap --input-format cooler -s chr14:20M-107M -i 10000 -e 10
Similar to the first example, this command tells the script to load the Hi-C
cooler file GSM3909682_TB-HiC-Dpn-R2-T12_hg19.1000.multires.cool
and its group
6 (data for several different resolution is stored in different groups) and
perform the HIPPS/DIMES algorithm. In this example, we change the number of
iterations to be 10000 by using the option -i 10000
. On a AMD Ryzen 5 3600 CPU
machine, it takes about 3-4 mins to finish the program. Once it is finished,
several ouput files are generated.
The jupyter notebook walkthrough.ipynb
in this repository contains additional examples.
In particular, if you would like see an example of direct application of HIPPS-DIMES on imaging data, please go through the notebook.
INPUT
: File path for the input file. The input file can be a Hi-C contact map or a mean spatial distance map as measured in Multiplexed FISH experiment.OUTPUT_PREFIX
: Prefix for outputfiles. For instance, if one specify it to beTEST
, then all the output files will start withTEST_
.
-k
or--connectivity-matrix
: Provide the path to the existing connectivity matrix one would like to use as initialization. Useful if restarting using the result from the previous run.-e
or--ensemble
: Number of individual conformations to be generated. This script will generate an ensemble of structures consistent with the input Hi-C contact map or the mean spatial distance map. Each individual conformations are different from each other. You can specify how many such individual conformations you want to generate. If not specified, its value would be 1000.-a
or--alpha
: Value of the contact map to distance map conversion exponent. If the input file is Hi-C contact map, the method first convert the contact map to a mean spatial distance map. The equation of the conversion is d*{ij} ~ c*{ij}^{1/\alpha}. The default value of \alpha is 4.0, estimated in this work 10.1126/science.aaf8084. If not specified, its value is 4.0-s
or--selection
: Specify chromosome or region. This option is only required and works when the input file hascooler
format. The value of this option is passed to thecooler.Cooler.matrix().fetch()
method. For details, please refer their documentation.-m
or--method
: Specify the method used for optimization. The default method is Iterative Scaling (IS). Currently, Iterative scaling (IS), gradient descent (GD) and direct inversion (DI) are supported.-l
or--lamd
: Specify the weight for L1 or L2 regularization. Default value is zero, meaning no regularization. Regularization is typically used to avoid over-fitting.-r
or--reg
: Specify the type of regularization. Default is L2 regularization. L1 and L2 are supported.-i
or--iteration
: The method relies on iterative scaling to find the optimal parameters. This option specifies the number of iterations. Generally, the more iterations the model runs, the better results are. However, the convergence of the model slow down when iteration increases. For larger size of contact map and the mean distance map, the number of iterations needed to good convergence is larger. If not specified, its default value is 10000.-r
or--learning-rate
: Learning rate. This hyperparameter controls the speed of convergence. If its value is too small, then convergence is very slow. If its value is too large, the program may never converge. Typically, learning rate can be set to be 1-30 if use Iterative scaling method. It should be a very small value (such as 1e-8) when using gradient descent optimization. The default value is 10.0.--input-type
: The type of the input file. To use the script, the type must be specified. The method can work on both the contact map (cmap
) or distance map (dmap
). This option is required.--input-format
: The format of the input file. If the type of input file is Hi-C contact map, then the script supportcooler
format Hi-C contact map file or a pure text based file. In the text based file, each line corresponds to the row of the contact map. If the type of input file is mean distance map, then the script only support the text based file in which each line represents the row of the mean distance map. This option is required.--log
: A log file will be written if this option is specified. The log file contains the data of cost versus iteration.--no-xyzs
: Turn off writing x,y,z coordinates of genome structures to files.--ignore-missing-data
: Turn on this argument will let the program ignore the missing elements or infinite number in the contact map or distance map--balance
: Turn on the matrix balance for contact map. Only effective wheninput_type == cmap
andinput_format == cooler
--not-normalize
: Turn off the auto normalization of the contact map. Only effective wheninput_type == cmap
--enforce-nonnegative-connectivity-matrix
: Constrain all the "spring constants" to be nonnegative
- In practice, a contact map or distance map larger than 5000x5000 is too large for the method to converge. If your matrix is larger than 5000x5000, I suggest that you can either perform a coarse-graining on the original matrix to get a smaller on,e or you can use the model on a subregion of the contact map/distance map.
- When using the Iterative Scaling algorithm (with argument
-m IS
) for optimization, the learning rate typically can be set between 1 and 50. You should try different values to see what is the optimal learning rate to use. For gradient descent (with argument-m GD
), the learning rate typically needed to be set very small, such as 1e-7. - If your contact map/distance map has a lot of missing or zero entries. You can
try to turn on the option
--ignore-missing-data
. This will tell the code not to consider these missing entries. Thus giving you a less biased result - Whenever the contact map is fed, the program will normalize the contact
map by dividing it by its maximum value entry. If you don't want this, you can
set the option
--not-normalize
. This will tell the code not to normalize the contact map at all - Note that when feeding the contact map, there is no physical length scale associated with it. Thus we cannot set a unit to the resulting distance matrix or the structures. In this sense, the structures generated are dimensionless. But one can use additional information to set the length scale of the problem. For instance, if you have a reasonable estimate of the average distance between the two nearest loci, then you can use this distance as the measure to rescale the structure to be consistent with it.
The python file HippsDimes.py
can also used as a package. You can do import
as,
import HippsDimes as HD
Some useful functions:
HippsDimes.a2xyz_sample
: use connectivity matrix to generate random samples of structures
HippsDimes.a2cmap_theory
: use connectivity matrix and a distance threshold to generate contact map
HippsDimes.a2dmap_theory
: use connectivity matrix to generate a mean distance map
If you used this program in your publication, please cite the following reference:
-
Shi, Guang, and D. Thirumalai. "From Hi-C Contact Map to Three-dimensional Organization of Interphase Human Chromosomes." Physical Review X 11.1 (2021): 011051.
-
Shi, G., Thirumalai, D. A maximum-entropy model to predict 3D structural ensembles of chromatin from pairwise distances with applications to interphase chromosomes and structural variants. Nat Commun 14, 1150 (2023).
Footnotes
-
Shi, Guang, and D. Thirumalai. "From Hi-C Contact Map to Three-dimensional Organization of Interphase Human Chromosomes." Physical Review X 11.1 (2021): 011051. ↩
-
Shi, G., Thirumalai, D. A maximum-entropy model to predict 3D structural ensembles of chromatin from pairwise distances with applications to interphase chromosomes and structural variants. Nat Commun 14, 1150 (2023). ↩