Method for generating a master list / Index of DNaseI-Hypersensitive Sites ("consensus DHSs").
All code, implementation details and design by Wouter Meuleman and Eric Rynes. Docker implementation by Jemma Nelson.
This approach is not recommended, because it will process biosamples serially, which will take an enormous amount of time for any real-life dataset.
- Build the docker image from this repo, with
docker build . --tag=masterlist
- Change to the directory that contains your input files
- Run
docker run --mount type=bind,source="$(pwd)",target=/data masterlist run_sequential.sh my_output_dir chrom_sizes.bed listOfFiles.txt
my_output_dir
will be created inside the working directory. chrom_sizes.bed
is a standard bed file containing the sizes of the chromosomes. listOfFiles.txt
is a list of peak files (which are starch
-formatted), containing one peak file per line. A relative path should be used to a file inside the working directory - absolute paths are not supported, due to the way the files are mounted in the Docker container.
From any directory, execute
[path to]ML_build_slurm.sh workdir outputID chrom_sizes.bed listOfFiles.txt partitionName memSize
where
- messages and intermediate files will be written into
workdir
(it will be created if it doesn't already exist) outputID
is an identifier to be used in the output filenames (e.g., containing the date, number of samples, organism, etc.)chrom_sizes.bed
is a 0-based 3-column BED file containing the lengths of the relevant chromosomeslistOfFiles.txt
is a plain text file containing the paths to variable-width peak files, one per biological sample, one path per linepartitionName
is the name of the cluster on which the SLURM jobs will run (--partition=partitionName
)memSize
is the amount of memory to require for each step, best if tailored to the needs of the collation of peak files (--mem=memSize
)
Output files will be written to the current directory. Messages and intermediate files will be written into workdir
.
For each biological sample to be represented in the Index, a variable-width peak file generated by hotspot2
is required.
To create an Index, a plain text file containing the locations of such peak files, one file per sample, one file per line, is required.
When using Docker to create an Index, all peak files must reside in the same directory as the file containing the list of them,
and no path information can be present in the list of files.
(Users knowledgeable about Docker mounting can use mounts to loosen these restrictions.)
When using SLURM, absolute and relative paths are both allowed in this text file;
multiple directories and symbolic links can be used in this case.
All peak files must have been generated using the same FDR threshold.
After hotspot2 has generated per-base results for a biosample up to a given FDR threshold (e.g. 5% or 100%), hotspots and peaks
at any stricter FDR threshold of interest can be extracted from the results by a pair of scripts contained in the hotspot2
package.
The per-base results reside in a file whose name ends in allcalls.starch
. To obtain variable-width peaks at a desired FDR threshold,
this file, and the hotspot2 output files containing mapped cleavages (name ending in cutcounts.starch
)
and the total number of mapped cleavages (name ending in cleavage.total
) are needed.
Suppose that each file that hotspot2 created for a certain biological sample has a name beginning with "sampXYZ,"
that the per-base results were thresholded at FDR 5%, and that you want to create an Index using peaks called at FDR 0.1%.
For each sample, you would need to run the following two hotspot2
scripts, the second after the first has completed:
hsmerge.sh -f 0.001 sampXYZ.allcalls.starch sampXYZ.hotspots.fdr0.001.starch
density-peaks.bash tmpdirXYZ varWidth_20_XYZ sampXYZ.cutcounts.starch sampXYZ.hotspots.fdr0.001.starch chromSizes.bed sampXYZ.density.starch sampXYZ.peaks.starch `cat sampXYZ.cleavage.total`
The first script, hsmerge.sh
, uses sampXYZ.allcalls.starch
to call hotspots at FDR 0.1% and write them to samp.XYZ.hotspots.fdr0.001.starch
.
The second script, density-peaks.bash
, uses the newly-created FDR 0.1% hotspot file, and pre-existing files sampXYZ.cutcounts.starch
and
sampXYZ.cleavage.total
, to create the desired FDR 0.1% peak file sampXYZ.peaks.starch
. (It also creates file sampXYZ.density.starch
,
which is not needed for Index creation.) A zero-based, 3-column BED file containing the lengths of the relevant chromosomes,
named chromSizes.bed
in the above command, is needed to create each peak file and also needed to create an Index.
tmpdirXYZ
in the second command is the name/location of a directory where temporary files will be stored.
(Important: If density-peaks.bash
is run multiple times simultaneously to process samples in parallel,
distinct temporary directories must be specified for each run/sample.) The argument varWidth_20_XYZ
consists of 3 parts, separated by underscores:
"varWidth," necessary to specify the creation of variable-width peaks; an integer specifying the minimum width in bp that a variable-width peak
must have to be reported (default = 20); and a unique identifier (ID) for the biological sample ("XYZ" in the above example).
The ID will be written into column 4 of every line of the peaks file, and used to define DHSs in the Index and to identify
which biological samples contribute to each entry in the Index; it may contain underscores if desired.
As a basis, we generate a full set of DHSs that may contain partially overlapping elements (_all_chunkIDs.bed
).
This is the version we recommend to use for downstream annotations and analyses.
However, as an alternative, we also provide two additional versions in which overlaps were either
resolved fully (_nonovl_any_chunkIDs.bed
), or partially only for the central "peak summit zones" (_nonovl_core_chunkIDs.bed
).
The columns in these files contain the following information:
- column 1: chromosome name
- column 2: start coordinate of the element (a.k.a. DHS); all coordinates are 0-based
- column 3: end coordinate
- column 4: DHS name/identifier
- column 5: sum of normalized densities at the summit of the DHS
- column 6: number of samples that contributed a peak to the DHS
- column 7: number of distinct peaks that contributed to the DHS (col7 usually equals col6, but sometimes multiple nearby peaks within one sample contribute to the delineation of the DHS)
- column 8: width of the DHS (equals col3 minus col2)
- column 9: coordinate of the summit of the DHS
- column 10: left boundary of the "core" of the DHS (summit coordinate minus dispersion; dispersion = median absolute deviation)
- column 11: right boundary of the "core" of the DHS (summit plus dispersion)
For each of these 3 files, a 12-column version (.bed12
) and bigBed version (.bb
) are also created.
The latter can be viewed in the UCSC Genome Browser; instructions for doing so are provided in the browser documentation.
A (sub)directory will filled with messages and intermediate files; it will be created if it doesn't already exist.
If any output files are incomplete or are not created, check the errorMessages
and R_Messages
subdirectories of the output directory for error messages that can help identify the underlying cause.
In subdirectory data
of this repository, we provide variable-width peak files for 3 biological samples,
and a text file containing their names (fileOf3PeakFiles.txt
). These files can be used to produce an example Index.
We also provide file chromSizes.hg38.bed3
, which contains chromosome lengths in hg38 coordinates, in this subdirectory.
To create the example Index using SLURM:
cd
intodata
- execute:
../ML_build_slurm.sh mySubdir MyResults chromSizes.hg38.bed3 fileOf3PeakFiles.txt <your computing cluster name> 100M
To create the example Index using Docker:
- (build the Docker image, as instructed above in section "How to run within Docker")
cd
intodata
- execute:
docker run --mount type=bind,source="$(pwd)",target=/data masterlist run_sequential.sh mySubdir chromSizes.hg38.bed3 fileOf3PeakFiles.txt MyResults
Either approach will create the following 3 files, each containing a "flavor" of an Index as described in section "Output," in directory data
:
masterlist_DHSs_MyResults_all_chunkIDs.bed
: an Index in which a small percentage of the DHSs overlap one anothermasterlist_DHSs_MyResults_nonovl_any_chunkIDs.bed
: same as previous, but with overlaps removedmasterlist_DHSs_MyResults_nonovl_core_chunkIDs.bed
: same as previous, but restricted to "core" DHSs (see section "Output")
For each of these 3 files, a 12-column version (.bed12
) and bigBed version (.bb
) are also created.
The latter can be viewed in the UCSC Genome Browser; instructions for doing so are provided in the browser documentation.
Subdirectory mySubdir
will also be created, and filled with subdirectories and many intermediate files.
If any "masterlist" files are incomplete or are not created, check the errorMessages
and R_Messages
subdirectories of mySubdir
for error messages that can help identify the underlying cause.
File | Purpose |
---|---|
chunk_bed.awk |
script for partitioning the peaks into genomic islands or "chunks" that can be processed in parallel |
code_ML.R |
common routines |
code_build.R |
code used for converting a genomic chunk of peak calls into tentative DHSs |
code_overlap.R |
code used to detect and resolve overlapping elements, if so desired |
code_gen_masterlist.sh |
code used to concatenate the output of all chunks and generate browser tracks |
ML_build_slurm.sh |
SLURM submission script which executes each of the above in sequence |
run_sequential.sh |
Script for executing each of the above in sequence using Docker |