On branch master

README.md added
uci-cbcl · May 31, 2014 · fd7ccf1 · fd7ccf1
1 parent 052a368
commit fd7ccf1
Showing 1 changed file with 78 additions and 100 deletions.
diff --git a/README.md b/README.md
@@ -5,26 +5,15 @@ README for MixClone 1.0
 INTRODUCTION
 ============
 
-Next-generation sequencing has revolutionized the study of
-cancer genomes. However, the reads obtained from next-
-generation sequencing of tumor samples often consist of a
-mixture of normal and tumor cells, which themselves can
-be of multiple clonal types. A prominent problem in the
-analysis of cancer genome sequencing data is deconvolving
-the mixture to identify the reads associated with tumor
-cells or a particular subclone of tumor cells. Solving the
-problem is, however, challenging due to the so-called
-“identifiability problem”, where different combinations of
-tumor purity and ploidy often explain the sequencing data
-equally well. Here, we propose a new model to resolve the
-identifiability problem by integrating two types of sequencing
-information - somatic copy number alterations and loss of
-heterozygosity - within an unified probabilistic framework.
-We derive algorithms to solve our model, and implement
-them in a software package called PyLOH. We also introduce a 
-novel visualization method "BAF heat map" to to characterize 
-the cluster pattern of LOH. If you have any questions, please
-email yil8@uci.edu
+MixClone is a comprehensive software package to study the subclonal 
+structures of tumor genomes, including subclonal cellular prevalences 
+estimation, allelic configuration estimation, absolute copy number 
+estimation and a few visualization tools. It takes next-generation 
+sequencing data of paired normal-tumor samples as input and integrates 
+sequence information from both somatic copy number alterations and allele
+frequencies within a unified probabilistic framework. If you have any 
+questions, please email yil8@uci.edu
+
 
 
 INSTALL
@@ -42,28 +31,28 @@ Prerequisites
 
 * [Pysam](https://code.google.com/p/pysam/)(>=0.7). To install Pysam, you also need to install [Cython](http://cython.org/) first. 
 
-* [matplotlib](http://matplotlib.org/)(>=1.2.0) is required to plot BAF heat map.
+* [matplotlib](http://matplotlib.org/)(>=1.2.0) is required for a few visualization tools.
 
 
-Altough not required by PyLOH, [samtools](http://samtools.sourceforge.net/) can be useful for creating bam, bam index and fasta index 
-files which are required by the pysam module of PyLOH. 
+Altough not required by MixClone, [samtools](http://samtools.sourceforge.net/) can be useful for creating bam, bam index and fasta index 
+files which are required by the pysam module of MixClone. 
 
 Install from source
 -------------------
-Download the compressed source file PyLOH-*.tar.gz and do as follows:
+Download the compressed source file MixClone-*.tar.gz and do as follows:
 
 ```
-$ tar -xzvf PyLOH-*.tar.gz
-$ cd PyLOH-*
+$ tar -xzvf MixClone-*.tar.gz
+$ cd MixClone-*
 $ python setup.py install
 ```
 
-If you prefer to install PyLOH other than the default directory, you can also use this command:
+If you prefer to install MixClone other than the default directory, you can also use this command:
 ```
 $ python setup.py install --prefix /home/yili/
 ```
 
-There are also `config/` and `bin/` folders under PyLOH-*. The `config/` folder contains example priors and the `bin/` folder contains 
+There is also a `bin/` folders under MixClone-*. The `bin/` folder contains 
 useful utilities, such as the R code to run [BICseq](http://compbio.med.harvard.edu/Supplements/PNAS11.html) and the python script to 
 convert BICseq results to BED file. You can copy these two folders somewhere easily accessible.
 
@@ -75,25 +64,38 @@ USAGE
 
 Overview
 --------
-PyLOH is composed of three modules: 
-* `preprocess`. Preprocess the reads aliments of paired normal-tumor samples in BAM format and produce the paired counts file, 
-preprocessed segments file and preprocessed BAF heat map file as output.
+MixClone is composed of three modules: 
+* `preprocess`. Preprocess the reads aliments of paired normal-tumor samples in BAM format, the tumor genome segmentation file in BED format, and produce the *.MixClone.input.pkl file as output, which will be used for running the model.
 
-* `run_model`. Take the paired counts file and preprocessed segments file as input, estimate tumor purity, the copy number and the
-allele type of each segment.
+* `run_model`. Take the *.MixClone.input.pkl as input, estimate the subclonal cellular prevalence, the absolute copy number and the allelic configuration of each segment, and produce the *.MixClone.output.pkl file as output, which will be used for postprocessing. If the user runs the model without specifying the number of subclonal populations, MixClone will run the model five times with subclonal number ranges from 1 to 5, and recommend the most likely model.
+
+* `postprocess`. Take the *.MixClone.output.pkl file as input, and extract various output files, including the segments file with estimated parameters, the allele counts file, the subclonal analysis summary and a few plots.
 
-* `postprocess`. Take the preprocessed BAF heat map file as input and plot the BAF heat map for each segment as output.
 
-The general workflow of PyLOH is this
-![alt tag](https://github.com/uci-cbcl/PyLOH/blob/gh-pages/images/workflow.png?raw=true)
+
+Tumor genome segmentation
+-------------------------
+MixClone requires a segmentation file of the tumor genome in BED format before running the package. We used [BICseq](http://compbio.med.harvard.edu/Supplements/PNAS11.html) in the original paper. To run a BICseq analysis, you
+can copy the commands in `bin/BICseq.R` (Li, Y., Xie, X. 2014, Bioinformatics) and paste them in a R interative shell. Or you can also run the R script from the command line:
+```
+$ R CMD BATCH bin/BICseq.R
+```
+Note that,`normal.bam` and `tumor.bam` must be in the same directory where you run the command. The R script will output a segments file
+`segments.BICseq`. Then you can use the other script `bin/BICseq2bed.py` (Li, Y., Xie, X. 2014, Bioinformatics) to convert the segments file into BED format:
+```
+$ BICseq2bed.py segments.BICseq segments.bed --seg_length 1000000
+```
+
+**--seg_length** Only convert segments with length longer than the threshold.
+
 
 
 Preprocess
 ----------
 This part of README is based on [JoinSNVMix](https://code.google.com/p/joint-snv-mix/wiki/runningOld). To preprocess the paired 
 cancer sequencing data, execute:
 ```
-$ PyLOH.py preprocess REFERENCE_GENOME.fasta NORMAL.bam TUMOUR.bam BASENAME --segments_bed SEGMENTS.bed --min_depth 20 --min_base_qual 10 --min_map_qual 10 --process_num 10
+$ MixClone.py preprocess REFERENCE_GENOME.fasta SEGMENTS.bed NORMAL.bam TUMOUR.bam INPUT_BASENAME --min_depth 20 --min_base_qual 10 --min_map_qual 10 --process_num 10
 ```
 
 **REFERENCE_GENOME.fasta** The path to the fasta file that the paired BAM files aligned to. Currently, only the
@@ -102,20 +104,16 @@ chromosome format are supported. Note that the index file should be generated fo
 
 `$ samtools faidx REFERENCE_GENOME.fasta`
 
+**SEGMENTS.bed** The BED file for the tumor genome segmentation.
+
 **NORMAL.bam** The BAM file for the normal sample. The BAM index file should be generated for this file and named NORMAL.bam.bai. This can
 be done by running
 
 `$ samtools index NORMAL.bam`
 
 **TUMOUR.bam** The bam file for the tumour sample. As for the normal this file needs to be indexed.
 
-**BASENAME** The base name of preprocessed files to be created.
-
-**--segments_bed** Use the genome segmentation stored in SEGMENTS.bed. If not provided, use 22 autosomes as the segmentaion. 
-But using automatic segmentation algorithm to generate SEGMENTS.bed is highly recommended, such as [BICseq](http://compbio.med.harvard.edu/Supplements/PNAS11.html).
-
-**--WES** Flag indicating whether the BAM files are whole exome sequencing(WES) or not. If not provided, the BAM files
-are assumed to be whole genome sequencing(WGS).
+**INPUT_BASENAME** The base name of the preprocessed input file to be created.
 
 **--min_depth** Minimum depth in both normal and tumor sample required to use a site in the analysis.
 
@@ -126,49 +124,42 @@ are assumed to be whole genome sequencing(WGS).
 **--process_num** Number of processes to launch for preprocessing.
 
 
+
 Run model
 ---------
-After the paired cancer sequencing data is preprocessed, we can run the probabilistic model of PyLOH by execute:
+After the preprocessed input file is created, we can run the probabilistic model of MixClone by execute:
 ```
-$ PyLOH.py run_model BASENAME --allele_number_max 2 --max_iters 100 --stop_value 1e-7
+$ MixClone.py run_model INPUT_BASENAME OUTPUT_BASENAME --max_copynumber 6 --subclone_num 2 --max_iters 30 --stop_value 1e-6
 ```
-**BASENAME** The base name of preprocessed files created in the preprocess step.
+**INPUT_BASENAME** The base name of the preprocessed input file created in the preprocess step.
+
+**OUTPUT_BASENAME** The base name of the output file with model parameters estimated to be created.
 
-**--allelenumber_max** The maximum copy number of each allele allows to take.
+**--max_copynumber** The maximum copy number of each segment allows to take.
 
-**--priors** Path to the file of the prior distribution. The prior file must be consistent with the --allele_number_max. If not provided,
-use uniform prior, which is recommended.
+**--subclone_num** The number of subclones within the tumor sample. If not provided, go through [1, 5] and select the most likely model.
 
 **--max_iters** Maximum number of iterations for training.
 
 **--stop_value** Stop value of the EM algorithm for training. If the change of log-likelihood is lower than this value, stop training.
 
 
+
 Postprocess
 -----------
-Currently, the postprocess module is only for plotting the BAF heat map of each segment:
+After the output file with model parameters estimated, we can extract various result files from the output file by execute:
 ```
-$ PyLOH.py BAF_heatmap BASENAME
+$ MixClone.py postprocess OUTPUT_BASENAME
 ```
 
-**BASENAME** The base name of preprocessed files created in the preprocess step.
+**BASENAME** The base name of the output file created in the run_model step.
+
 
 
 Output files
 ------------
-**\*.PyLOH.counts** The preprocessed paired counts file. It contains the allelic counts information of sites, which are heterozygous 
-loci in the normal genome. The definition of each column in a *.PyLOH.counts file is listed here:
-
-| Column    | Definition                                         | 
-| :-------- | :------------------------------------------------- | 
-| seg_index | Index of each segment                              |      
-| normal_A  | Count of bases match A allele in the normal sample |
-| normal_B  | Count of bases match B allele in the normal sample |
-| tumor_A   | Count of bases match A allele in the tumor sample  |
-| tumor_B   | Count of bases match B allele in the tumor sample  |
-
-**\*.PyLOH.segments** The preprocessed segments file. It contains the genomic information of each segment. The definition of each
-column in a *.PyLOH.segments file is listed here:
+**\*.MixClone.segments** The segments file. It contains the genomic and subclonal information of each segment. The definition of each
+column in a *.MixClone.segments file is listed here:
 
 | Column           | Definition                                                              | 
 | :--------------- | :---------------------------------------------------------------------- | 
@@ -177,46 +168,33 @@ column in a *.PyLOH.segments file is listed here:
 | start            | Start position of the segment                                           |
 | end              | End position of the segment                                             |
 | normal_reads_num | Count of reads mapped to the segment in the normal sample               |
-| tumor_reads_num  | Count of reads mapped to the segment in the normal sample               |
-| LOH_frec         | Fraction of LOH sites in the segment                                    |
+| tumor_reads_num  | Count of reads mapped to the segment in the tumor sample               |
+| LOH_frac         | Fraction of LOH sites in the segment                                    |
 | LOH_status       | FALSE -> no LOH; TRUE -> significant LOH; UNCERTAIN -> medium level LOH |
 | log2_ratio       | Log2 ratio between tumor_reads_num and normal_reads_num                 |
+| copy_number      | Estimated absolute copy number of the segment                           |
+| allele_type      | Estimated allelic configuration of the segment                          |
+| subclone_prev    | Estimated subclonal cellular prevalence of the segment                  |
+| subclone_cluster | Estimated subclonal cluster label of the segment                        |
 
-**\*.PyLOH.segments.extended** The extended segments file after run_model. There are two additional columns:
-
-| Column           | Definition                            | 
-| :--------------- | :-------------------------------------| 
-| copy_number      | Estimated copy number of the segment  |  
-
-**\*.PyLOH.purity** Estimated tumor purity.
-
-**\*.PyLOH.heatmap.pkl** The preprocessed BAF heat map file in Python pickle format.
-
-**\*.PyLOH.heatmap.plot** The folder of BAF heat maps plotted for each segment. A typical BAF heat map looks like this
-![alt tag](https://github.com/uci-cbcl/PyLOH/blob/gh-pages/images/BAF_heamap_sample.png?raw=true)
-
+**\*.MixClone.counts** The allele counts file. It contains the allelic counts information of sites, which are heterozygous 
+SNP sites in the normal genome. The definition of each column in a *.MixClone.counts file is listed here:
 
+| Column    | Definition                                         | 
+| :-------- | :------------------------------------------------- | 
+| seg_name  | Name of the segment                                |      
+| normal_A  | Count of bases match A allele in the normal sample |
+| normal_B  | Count of bases match B allele in the normal sample |
+| tumor_A   | Count of bases match A allele in the tumor sample  |
+| tumor_B   | Count of bases match B allele in the tumor sample  |
+| chrom     | Chromosome of the site                             |
+| pos       | Genomic position of the site                       |
 
-OTHER
-=====
+**\*.MixClone.summary** The summary file about the subclonal analysis, including log-likelihood, subclonal labels and the corresponding cellular prevalences.
 
-BIC-seq related utilities
--------------------------
-We highly recommend using automatic segmentation algorithm to partition the tumor genome, and thus prepare the segments file in BED format.
-For exmaple, we used [BICseq](http://compbio.med.harvard.edu/Supplements/PNAS11.html) in the original paper. To run a BICseq analysis, you
-can copy the commands in `bin/BICseq.R` and paste them in a R interative shell. Or you can also run the R script from the command line:
-```
-$ R CMD BATCH bin/BICseq.R
-```
-Note that,`normal.bam` and `tumor.bam` must be in the same directory where you run the command. The R script will output a segments file
-`segments.BICseq`. Then you can use the other script `bin/BICseq2bed.py` to convert the segments file into BED format:
-```
-$ BICseq2bed.py segments.BICseq segments.bed --seg_length 1000000
-```
+**\*.MixClone.heatmap** The folder of BAF heat maps plotted for each segment (Li, Y., Xie, X. 2014, Bioinformatics).
 
-**--seg_length** Only convert segments with length longer than the threshold.
+**\*.MixClone.segplot.png** The plot of estimated subclonal cellular prevalences and absolute copy numbers of all the segments of non-diploid allelic configuration.
 
+**\*.MixClone.model_selection** The summary about model selection if `--subclone_num` is not specified. It includes the log-likelihood and related change of each model with different number of subclones. Although MixClone selects the most likely model based on a heuristic described in the original paper, users can select the best model use their own judgement based on the log-likelihood related information.
 
-Reference
-=========
-Li, Y., Xie, X. (2014). Deconvolving tumor purity and ploidy by integrating copy number alterations and loss of heterozygosity. Bioinformatics.