Skip to content

Efficient statistical implementations for high throughput data analysis in R

License

Notifications You must be signed in to change notification settings

juliangehring/HighSpeedStats

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HighSpeedStats: Efficient Statistical Implementations

https://travis-ci.org/juliangehring/HighSpeedStats.svg?branch=master https://codecov.io/github/juliangehring/HighSpeedStats/coverage.svg?branch=master

Motivation

The amount of data in the field of computational biology increases at a fast pace, and together with this the computational demands for analyzing it. This setting poses new challenges to the algorithms and implementations used in the analysis of this data, and demand for a high-throughput processing of big amounts of data efficiently.

The R programming languages has gained a central role in the analysis workflows of biological data. While a large number of relevant methods are offered by this, the implementations are often not suited for a large scale analysis, and can become a bottleneck.

With the highSpeedStats package, we provide a selected set of statistical functions, optimized for a speed and memory efficient implementation. We plan to release the existing functionality as an open-source project, and continue the development as a community project.

Use Cases

library(HighSpeedStats)

library(microbenchmark)

Fisher’s Exact Test

Fisher’s Exact Test is used on contingency tables, in most cases a 2x2 table. In computational biology, this has been applied for example in detecting enrichment in gene sets or identifying strand biases in variant calling.

We compare different methods by sampling all possible configurations in the parameter space $(a, b, c, d) ∈ (0..m, 0..m, 0..m, 0..m)$ for the contingency table

ab
cd
m = 20
grid = expand.grid(a = 0:m, b = 0:m, c = 0:m , d = 0:m)
rbind(head(grid, 3), tail(grid, 3))
dim(grid)

Here, we will compare the performance of three methods to compute two-sided p-values with Fisher’s Exact Test:

feTestR
A reference implementation, taken from the VariantTools package and based on the base R function fisher.test. In the current setting, this is about 60 faster that apply ing over the rows of the matrix and extracting the p-value.
fisherExactTest
An optimized equivalent to feTestR, using the boost library. Please note that due to limitations of the boost library, using this implementation is only beneficial for samples sizes ~< 170.
ultrafastfet
A highly optimized function, implemented in C++. At the moment, this uses a different numerical stabilization than the approaches mentioned above, which can results in deviations of the computed p-values compared to the other two methods.
bench = microbenchmark(
    p1 <- with(grid, feTestR(a, b, c, d)),
    p2 <- with(grid, fisherExactTest(a, b, c, d)),
    p3 <- with(grid, ultrafastfet(a, b, c, d)),
    times = 3)
print(bench)
exprminlqmeanmedianuqmaxneval
p1 <- with(grid, feTestR(a, b, c, d))4167.4661944183.68783554204.498781666674199.9094774223.01507554246.1206743
p2 <- with(grid, fisherExactTest(a, b, c, d))605.507734605.6629355606.295877333333605.818137606.689949607.5617613
p3 <- with(grid, ultrafastfet(a, b, c, d))110.728237113.1032895114.134416666667115.478342115.8375065116.1966713

The results from the feTestR and ultrafastfet yield the same two-sided p-values, minor differences are due to rounding errors.

all.equal(p1, p3)

Extensive testing

bench = microbenchmark(
    p0 <- with(grid, mapply(foo, a, b, c, d)),
    p1 <- with(grid, feTestR(a, b, c, d)),
    p2 <- with(grid, fisherExactTest(a, b, c, d)),
    p3 <- with(grid, ultrafastfet(a, b, c, d)),
    times = 1)

all.equal(p0, p1)

foo <- function(a, b, c, d) {
    fisher.test(matrix(c(a, b, c, d), 2))$p.value
}

Maximum Position in Matrix

If we consider for example a matrix with nucleotide counts across multiple positions, we can derive the consensus sequence by finding the nucleotide with the highest abundance at each site.

Essentially, this boils down to finding the position with the maximal value in each row of a matrix. The max.col base R function would be the starting point for this. However, we would like tied within a row to be indicated, which we cannot do directly with the max.col function. We have written the maxColR function that does this for us as a reference, the maxCol provides the efficient implementation.

m = matrix(rbinom(1e6, 50, 0.5), ncol = 4)
head(m)
V1V2V3V4
20262828
22242921
19272121
31312322
22212924
31242728

Comparing the performance reveals a lower runtime of the maxCol implementation.

bench2 = microbenchmark(
    idx_old <- maxColR(m),
    idx_new <- maxCol(m),
    times = 5)
print(bench2)
exprminlqmeanmedianuqmaxneval
idx_old <- maxColR(m)85.03676385.644796103.497937486.826066114.033516145.9485465
idx_new <- maxCol(m)5.4325235.44847117.30779566.47484634.35451934.8286195

Finally, we show that the results of both implementations are identical.

identical(idx_old, idx_new)

More information can be found in the manual pages of the individual functions.

About

Efficient statistical implementations for high throughput data analysis in R

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published