Global normalization methods such as quantile normalization have become a standard part of the analysis pipeline for high-throughput data to remove unwanted technical variation. These methods and others that rely solely on observed data without external information (e.g. spike-ins) are based on the assumption that only a minority of genes are expected to be differentially expressed (or that an equivalent number of genes increase and decrease across biological conditions. This assumption can be interpreted in different ways leading to different global normalization procedures. For example, in one normalization procedure, the method assumes the mean expression level across genes should be the same across samples. In contrast, quantile normalization assumes the only difference between the statistical distribution of each sample is technical variation. Normalization is achieved by forcing the observed distributions to be the same and the average distribution, obtained by taking the average of each quantile across samples, is used as the reference.
While these assumptions may be reasonable in certain experiments, they may not always be
appropriate. Recently, an R/Bioconductor package (quantro
)
has been developed to test for global differences between groups of distributions to evaluate whether
global normalization methods such as quantile normalization should be applied. If global differences
are found between groups of distributions, these changes may be of technical or biological of interest.
If these changes are of technical interest (e.g. batch effects), then global normalization methods should be applied.
If these changes are related to a biological factor (e.g. normal/tumor or two tissues), then
global normalization methods should not be applied because the methods will remove the interesting biological variation
(i.e. differentially expressed genes) and artificially induce differences between genes that were not
differentially expressed. In the cases with global differences between groups of distributions
between biological conditions, quantile normalization is not an appropriate normalization method. In
these cases, we can consider a more relaxed assumption about the data, namely that the statistical distribution
of each sample should be the same within biological conditions or groups (compared to the more
stringent assumption of quantile normalization, which states the statistical distribution is the same across all samples).
Here we introduce a generalization of quantile normalization, referred to as smooth quantile normalization
(qsmooth), which is a weighted average of the two types of assumptions about the data.
The qsmooth R-package contains the qsmooth()
function, which computes a weight at every quantile
that compares the variability between groups relative to within groups. In one extreme, quantile normalization
is applied and in the other extreme quantile normalization within each biological condition is applied.
The weight shrinks the group-level quantile normalized data towards the overall reference quantiles
if variability between groups is sufficiently smaller than the variability within groups. The algorithm is described in the
Figure below (see the vignettes/qsmooth-vignette.pdf
for more details).
The R-package qsmooth can be installed from Github using the R package devtools:
Use to install the latest version of qsmooth from Github:
library(devtools)
install_github("stephaniehicks/qsmooth")
It can also be installed using Bioconductor:
# install BiocManager from CRAN (if not already installed)
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
# install qsmooth package
BiocManager::install("qsmooth")
After installation, the package can be loaded into R.
library(qsmooth)
The main function in the qsmooth package is qsmooth()
. The qsmooth()
function needs two objects:
(1) a data frame or matrix with observations (e.g. probes or genes) on the rows and samples as the columns
(e.g. let's call it eset
) and (2) a group level factor called group_factor
(let's call it outcome
).
This order of this factor variable must match the order of the columns in the eset
object because it contains
information about which group each sample is from.
To run the qsmooth()
function,
qs <- qsmooth(object = eset, group_factor = outcome)
Individual slots can be extracted using accessor methods:
qsmoothData(qs) # extract smoothed quantile normalized data
qsmoothWeights(qs) # extract smoothed quantile normalized weights
The weights can be directly plotted using the qsmoothPlotWeights()
function.
qsmoothPlotWeights(qs) # plot weights
See vignettes/qsmooth-vignette.pdf
for more details.
Report bugs as issues on the GitHub repository