Skip to content

AllSorts Tutorial

Quarkins edited this page Jan 4, 2017 · 3 revisions

Hello and welcome to the Acute Lymphoblastic Leukemia (ALL) Classifier. The aim of this tool is simple: to help classify different sub-types of ALL by gene expression from RNA seqeunced data. The tool will successfully classify a sample into one of the following four categories:

Phlike: Philadelphia like, a gene expression similar to that of the famous Philadelphia fuison (BCR-ABL)

ERG: A fusion resulting in an expression

ETV: A fusion resulting in an expression profile similar or same as that of ETV-RUNX1.

Other: A miscellaneous class containing a mixture of other sub-classes (e.g. MLL, High Hyperdiploidy, T-ALL e.t.c)

NOTE: By definition the classifier can ONLY designate a sample to one of the above four classes it was built on. However, if the probability for classification is less than threshold for all four categories the classifier returns as "Unclassified". This indicates that the gene expression profile was not similar enough to any of the classes with any confidence, however that does not mean it could not in reality be one of those types.

Right, so lets get stuck in to the functionality. This might be what a standard workflow could look like:

Read in the data and get in format:

cf = system.file("data","test_data.txt",package="AllSorts") #Get path to raw text file (a tsv)
counts = read.table(file=cf,sep=' ',stringsAsFactors = F,header=T)
head(counts)

Note that the row-names of counts must either be Gene Symbols, so if they are not you will need to transform them to hg19 gene symbols.

Now use the streamline function to produce the log fpkm matrix with the genes required for classification:

library(AllSorts)
sfpkm = streamline(counts[,c(1:6)],counts$Gene_Length)
head(sfpkm)

Once the FPKM has been constructed and subsetted purely on the genes required by the classifier one can simply input the dataset into the classifier:

threshes = c(0.25,0.25,0.75,0.75)
classed<-classify(sfpkm,threshes)
classed

The above thresholds are the minimum probabilities for calling a sample ERG, ETV, Phlike or Other respectively.

Note: the classify function has already a set of predetermined thresholds for calling a sample one of the classes based on the random forest probabilities. However, as we demonstrated above a user can define their own thresholds based on their desired sensitivities and specificities for calling a given class. ** Defaults are recommended** (ERG:0.25, ETV:0.25, Phlike:0.5, Other:0.75)

So as we can see, in this toy dataset, we classify two samples as ETV, one as ERG, and three others.

We may want to visualise our samples to see how different they are from themselves and whether the samples with same classification actually cluster together nicely:

visualise(sfpkm,classed)

Here the separation is clear and gives us confidence that we are distinguishing the classes well!

But if we have plenty of samples way may wish to get an overview of the probability landscape of which samples are more likely to be classed as one particular class as well as compare samples which have been classified as the same class. AllSorts provides a nice visaulisation for this:

probvis(classed)
Clone this wiki locally