GitHub - mhk7/clixo_0.3: CliXO algorithm implementation version 0.3 (doi: 10.1093/bioinformatics/btu282)

mhk7 / clixo_0.3 Public

Notifications You must be signed in to change notification settings
Fork 2
Star 9

CliXO algorithm implementation version 0.3 (doi: 10.1093/bioinformatics/btu282)

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
boost		boost
example		example
LICENSE		LICENSE
README		README
boost.h		boost.h
clixo		clixo
clixo.cpp		clixo.cpp
clixo.o		clixo.o
clustersToDAG		clustersToDAG
clustersToDAG.cpp		clustersToDAG.cpp
clustersToDAG.o		clustersToDAG.o
corrector.h		corrector.h
dag.h		dag.h
dagConstruct.h		dagConstruct.h
dagConstruct.h~		dagConstruct.h~
extractOnt		extractOnt
graph_undirected.h		graph_undirected.h
graph_undirected_bitset.h		graph_undirected_bitset.h
makefile		makefile
nodeDistanceObject.h		nodeDistanceObject.h
ontologyTermStats		ontologyTermStats
util.h		util.h

Repository files navigation

Author: Michael Kramer (mhkramer7@gmail.com). Trey Ideker lab, UCSD.
Date: 5/20/2015
CliXO release 0.3
Please cite: Kramer M, Dutkowski J, Yu M, Bafna V, Ideker T. Inferring gene ontologies from pairwise similarity data. Bioinformatics, 30: i34-i42. 2014. doi: 10.1093/bioinformatics/btu282

This program is already compiled and should run without recompiling, but if it does not immediately work on your system then perform the following to compile.  This program requires g++ version 4.7 or newer to compile.  Simply enter the main folder and type:

make clixo

The program is run from the command line as:

./clixo SIMILARITY_NETWORK_FILE ALPHA BETA

The SIMILARITY_NETWORK_FILE is a file which contains three columns, tab separated.  Columns 1 and 2 are the two items (e.g. genes) which have some similarity measure between them.  Column 3 is the similarity value.  This similarity value must be positive.  Any missing similarity values in this file (i.e. no line is present for a pair of genes) are assumed to be zero.  We suggest using names including some text and not simply numbers to identify genes - if numbers are used, please make them consecutive and starting at 0 (otherwise internal terms may share the same name).  I recognize that may be an annoying requirement - sorry.

Please see the paper describing CliXO for a full discussion of the alpha and beta parameters.  Briefly, alpha is a noise parameter - a term T with a weight of W in the resulting ontology is created only if there is no possible parent term T_p with weight W_p > W - alpha.  Alpha can roughly be thought of as representing a significant difference in the similarity values - i.e. a difference which would not be due to noise only.  Beta is a parameter which deals with missing edges - we suggest setting beta to 0.5 (this will merge highly overlapping terms).

The CliXO algorithm will print diagnostic information to stdout (we suggest redirecting this to a file).  Diagnostic information will be preceeded with a "#" symbol for easy removal later.  Most of this information is self explanatory I hope.  As terms are found to be valid (i.e. they will be included in the final ontology), they are output with tag "Valid cluster:" at the beginning of the line, followed by a tab, all the genes in the term (comma separated), followed by another tab and then the similarity value at which this term was inferred.  This allows the user to view the output so far of CliXO even if it has not completed (oftentimes many of the smaller terms are inferred quite quickly whereas larger terms can take much longer).  At completion, CliXO will print "# Ontology is:" followed by the inferred ontology.  The inferred ontology will be in a three column, tab separated format where each line represents a relationship in the ontology.  Column 1 is the parent node (always an internal node or the root, identified by a number).  Column 2 is the child node (may be an internal node or a terminal node, i.e. a gene).  Column 3 is the distance between a parent and child node (distance = half the difference in term weight between the parent and child, where term weight is the similarity value at which the term was inferred).

Please feel free to contact the author with any questions, comments, or concerns.

Auxillary functions: 

extractOnt FILE THRESHOLD MIN_TERM_SIZE OUTFILE

This script will allow the user to "peek" at the results of an in process CliXO run. It takes as input a CliXO output file (FILE), a similarity threshold above which terms will be saved and below which terms will be ignored (THRESHOLD), a minimum term size to keep (MIN_TERM_SIZE) and the name of an output file to create (OUTFILE).

ontologyTermStats

This utility will allow users to look at the ontology created by clixo.  There are several different types of stats that it can generate, but the most useful are the options "size" and "genes".  Option "size" will return a two column file where column one is the term identifier and column two is the number of leaf nodes / genes annotated to that term (all annotations are propagated upwards in the ontology).  Option "genes" adds a third column which is a comma separated list of all the genes annotated to the term.

The folder example shows some example inputs and outputs on a very small toy ontology.  The results in this directory can be recreated by running the following commands from within this directory:

../clixo exampleInput 0 1 > exampleOutput
grep -v "#" exampleOutput > exampleOutputOnt
../ontologyTermStats exampleOutputOnt genes gene > exampleOutputOntStats
../extractOnt exampleOutput 0.5 2 examplePeek

NEW IN 0.2: Updated alpha parameter which is now size dependent, allowing for more accuracy for larger terms.  Minor bug fixes.
NEW IN 0.3: Significant speed improvements (2x - 1000x, with larger improvements seen in larger data sets)