From bb1420ef1c425e5f932ca369b4d609d390612ff4 Mon Sep 17 00:00:00 2001 From: sandyjmacdonald Date: Fri, 17 Jul 2015 17:17:24 +0100 Subject: [PATCH] Little update to README.md --- README.md | 65 ++++++++++++++++++++++++++++++------------------------- 1 file changed, 35 insertions(+), 30 deletions(-) diff --git a/README.md b/README.md index c3bc2ba..23a9ede 100644 --- a/README.md +++ b/README.md @@ -2,8 +2,8 @@ [![Build Status](https://travis-ci.org/sandyjmacdonald/dots_for_microarrays.svg?branch=master)](https://travis-ci.org/sandyjmacdonald/dots_for_microarrays) [![Coverage Status](https://coveralls.io/repos/sandyjmacdonald/dots_for_microarrays/badge.svg?branch=master&service=github)](https://coveralls.io/github/sandyjmacdonald/dots_for_microarrays?branch=master) -Dots is a Python package for working with microarray data. -Its back-end is a standalone package for reading in, normalisation, statistical +Dots is a Python package for working with microarray data. +Its back-end is a standalone package for reading in, normalisation, statistical analysis and plotting of Agilent single-colour microarray data. Its front-end isn't finished yet (more on that below). @@ -18,7 +18,7 @@ sudo pip install dots_for_microarrays **OR, ALTERNATIVELY:** Dots has a number of dependencies including NumPy and SciPy and the least painful -way of getting these is to use the +way of getting these is to use the [Anaconda Python distribution](https://store.continuum.io/cshop/anaconda/) which includes NumPy and SciPy and a couple of the other required dependencies like Pandas, Scikit-learn and Bokeh. @@ -35,7 +35,7 @@ Setuptools should take care of the dependencies but, in testing, I've found the and Scikit-learn installations to be problematic, hence my recommendation of using Anaconda to relieve those headaches. -Once you have Anaconda, if you'd like to install and use Dots in a fenced-off virtual +Once you have Anaconda, if you'd like to install and use Dots in a fenced-off virtual environment that won't interfere with anything else, then you can do so as follows: ``` @@ -56,17 +56,17 @@ sudo python setup.py nosetests ## What Dots does 1. Reads in a series of Agilent single-color array files. -**It's important that your array files are named correctly, in order for Dots to work out -to which group and replicate they belong e.g. for treated and untreated groups each with +**It's important that your array files are named correctly, in order for Dots to work out +to which group and replicate they belong e.g. for treated and untreated groups each with three replicates name the files `treated_1.txt, treated_2.txt, treated_3.txt, untreated_1.txt, untreated_2.txt, untreated_3.txt`. 2. Normalises the data by log2-transforming, 75th percentile-shifting and setting the baseline to the median for each gene across all samples. 3. Calculates fold changes and log fold changes for all of the pairs of groups. -4. Runs either a T-test or ANOVA (determined automagically by the number of groups) with +4. Runs either a T-test or ANOVA (determined automagically by the number of groups) with Benjamini-Hochberg p-value adjustment and a Tukey HSD post hoc test to determine signifcant pairs from the ANOVA. -5. Provides a number of different visualisations of the data: box and whisker plots of the +5. Provides a number of different visualisations of the data: box and whisker plots of the normalised data for each sample, a PCA plot of all of the samples, a hierarchically-clustered (by gene) heatmap for the significantly differentially expressed genes (> +/- 2-fold, p < 0.05), a plot of k-means clustered groups of genes with similar expression patterns across the samples, @@ -78,7 +78,7 @@ and volcano plots for each pair of samples. ## What Dots will do in the future -1. Read the array data into an SQLite3 database, signifcantly speeding the whole workflow if +1. Read the array data into an SQLite3 database, signifcantly speeding the whole workflow if you re-analyse your array data at a later date. 2. Assess the quality of the arrays. 3. Provide a web front-end to guide you through the workflow. @@ -99,17 +99,17 @@ You can run it, for example, on the sample data included here (in `dots_sample_d python dots_workflow.py dots_sample_data -o sample_data_output ``` -The `-o` is an optional argument and, if you don't include it, then they'll be put in a +The `-o` is an optional argument and, if you don't include it, then they'll be put in a folder named `output`. ## Getting your hands dirty I've tried to comment the code as thoroughly as possible, so the best way to find out everything -that it can do is to dig into the code. Currently, it's organised in three modules that handle +that it can do is to dig into the code. Currently, it's organised in three modules that handle reading in the arrays, analysing them and the plotting. The docstrings allow you to get information about a function or class by typing, e.g. `help(run_stats)`. -The three modules - `dots_arrays`, `dots_analysis` and `dots_plotting` - are all part of the +The three modules - `dots_arrays`, `dots_analysis` and `dots_plotting` - are all part of the `dots_backend` package. As an example, you can import `dots_arrays` by typing ```python @@ -128,7 +128,7 @@ This module handles reading in individual arrays or a series of arrays as an exp has classes for each of these - the `Array` class and the `Experiment` class - that have a bunch of methods that you can run on them. -There are also two functions - `read_array` and `read_experiment` - that pretty much do what +There are also two functions - `read_array` and `read_experiment` - that pretty much do what they say on the tin. Both of these return `Array` and `Experiment` instances that both have Pandas data frame attributes that contain the data. Where possible, the dots modules use Pandas data frames because... Pandas. @@ -178,7 +178,7 @@ You can read in a whole experiment as follows: experiment = read_experiment(array_filenames, baseline=True) ``` -The `arrays_filenames` should be a list of filenames and the `baseline` option determines whether +The `arrays_filenames` should be a list of filenames and the `baseline` option determines whether the baseline is set to the median. The `Experiment` class is essentially a collection of `Array` instances with some neat methods to, @@ -203,7 +203,7 @@ if you haven't already set the baseline to median. experiment = experiment.remove_sample('treated_1') ``` -This method will be of more use once the quality control features are added, allowing you to remove +This method will be of more use once the quality control features are added, allowing you to remove samples that are of low quality before proceeding with the analysis and plotting. ### The read_annotations function @@ -222,25 +222,25 @@ You can read them in as part of your `read_experiment` call as follows: experiment = read_experiment(array_filenames, baseline=True, annotations_file='annotations.txt') ``` -Note that the `annotations_file` is an +Note that the `annotations_file` is an ## The dots_analysis module -This is the meat of the dots_backend. +This is the meat of the dots_backend. -The `get_fold_changes` function is straightforward and just takes an experiment instance and -returns a data frame with e.g. `FC_treated_untreated` and `logFC_treated_untreated` columns for +The `get_fold_changes` function is straightforward and just takes an experiment instance and +returns a data frame with e.g. `FC_treated_untreated` and `logFC_treated_untreated` columns for each pair of groups in the experiment. Use it as follows: ```python fold_changes = get_fold_changes(experiment) ``` -The `run_stats` function is similarly simple. It automagically decides whether to run just a +The `run_stats` function is similarly simple. It automagically decides whether to run just a T-test (if there are two groups) or to run an ANOVA and Tukey HSD post hoc (if there are three or more groups), and also adjusts the p values with a Benjamini-Hochberg correction. It returns a data frame with `p_val` and `p_val_adj` columns. The significances from the post -hoc test are in columns in the data frame named e.g. `significant_treated_untreated`. Use it +hoc test are in columns in the data frame named e.g. `significant_treated_untreated`. Use it as follows: ```python @@ -248,7 +248,7 @@ stats = run_stats(experiment) ``` There's a simple `run_pca` function that is used by the `do_pcaplot` function in the `dots_plotting` -module. It returns a data frame with the x/y coordinates from the first two principal components. +module. It returns a data frame with the x/y coordinates from the first two principal components. Use it as follows: ```python @@ -260,11 +260,11 @@ do slightly different things. Both functions have an option to select either k-m clustering. The `find_clusters` function returns a list of cluster numbers in the same order as the rows in the -experiment data frame. If the method is hierarchical - `how=`hierarchical` - then the number of +experiment data frame. If the method is hierarchical - `how=`hierarchical` - then the number of clusters is set at the square root of (number of rows divided by two), a good approximation. If the method is k-means - `how='kmeans'` - then values of k (the number of clusters) from 3 to 10 are tested using silhouette analysis and the best value picked. An additional argument passed to the function - -`k_range=(3,51)` allows you to increase the number of values tested to, in this example, 50. Here's +`k_range=(3,51)` allows you to increase the number of values tested to, in this example, 50. Here's how to get a list of clusters with either hierarchical or k-means clustering: ```python @@ -277,7 +277,7 @@ filters the data frame down to only significantly differentially expressed genes things up considerably, especially for the k-means clustering, and makes the heat maps more compact). It returns the filtered data frame along with an extra column `cluster` that contains the cluster numbers. As with the `find_clusters` function, it allows hierarchical or k-means clustering to be -selected. Here's how to get a filtered data frame with clusters with either hierarchical or k-means +selected. Here's how to get a filtered data frame with clusters with either hierarchical or k-means clustering: ```python @@ -320,16 +320,21 @@ do_heatmap(experiment, show=False, image=False, html_file='heatmap.html') do_clusters_plot(experiment, show=True, image=False, html_file='clustersplot.html') ``` -As you'll see, they all take an experiment instance and have a number of other optional arguments. -The `show=False/True` argument determines whether the plot is shown in your browser after it is -generated, with the default being false. The `image=False/True` argument determines whether a PNG +As you'll see, they all take an experiment instance and have a number of other optional arguments. +The `show=False/True` argument determines whether the plot is shown in your browser after it is +generated, with the default being false. The `image=False/True` argument determines whether a PNG format image of the plot is created in addition to the HTML version. Lastly, the `html_file='boxplot.html'` allows you to specify a custom filename for your HTML plot (this is also used for the image filename). +As of version 0.2.0, the box plot outliers are limited to a total of 250,000 glyphs across all of the +samples. Also, the heatmaps are limited to 2,500 rows, by tuning the fold change cutoff until there are +less than 2,500 rows left. These limitations help to prevent strange behaviour when Bokeh deals with a +really large number of glyphs. + All of the plots, with the exception of the clusters plot, use Bokeh's nifty hover function to show you information about the points on the plots, e.g. gene name, normalised expression values, etc. -The `do_volcanoplot` function takes an additional argument (a tuple) that specifies the pair of groups +The `do_volcanoplot` function takes an additional argument (a tuple) that specifies the pair of groups to plot on the volcano plot, for example: ```python @@ -338,4 +343,4 @@ do_volcanoplot(experiment, ('treated', 'untreated'), show=False, image=False, ht **Note that the function, `render_plot_to_png`, that generates the PNG versions of the plots requires the [PhantomJS](http://phantomjs.org) JavaScript API to be installed (it's essentially a headless browser) to -work properly.** \ No newline at end of file +work properly.**