-
Notifications
You must be signed in to change notification settings - Fork 25
Home
To install candis right from scratch, check out our exhaustive guides:
- A Hitchhiker's Guide to Installing candis on Mac OS X
- A Hitchhiker's Guide to Installing candis on Linux OS (In Progress)
- A Hitchhiker's Guide to Installing candis on Windows OS (Contributors Wanted)
Year | Student | Mentor(s) |
---|---|---|
2017 | Achilles Rasquinha | Dr. Akram Mohammed, Dr. Tomas Helikar |
2018 | Rupav Jain | Dr. Akram Mohammed, Achilles Rasquinha |
CancerDiscover is a complete end-to-end Machine Learning Pipeline (from Data Preprocessing to Model Deployment) dedicated for DNA Microarray data analysis and modelling. The toolkit includes an AffyMetrix CEL file to ARFF convertor, predefined Search and Evaluator algorithm combinations for Feature Selection and support for the SLURM Workload Manager. While the entire pipeline has been primarily written in Perl, its dependencies included languages like R, Java and a bit of Bash. This made the overall pipeline quite heavy in terms of its dependencies thereby making it difficult to have it deployed on machines, especially Windows-based.
The primary requirement for this summer was to have a neat Graphical User Interface built on top of CancerDiscover. Unlike many GSoC Projects which includes major contributions to existent code bases, GSoC '17 opened way for a new neatly rewritten OSS project - candis (a portmanteau of the words cancer and discover), as a major extension and upgrade to the existent pipeline.
5th May 2017 - 30th May 2017
During the initial meet of the Community Bonding Phase, Dr. Akram expressed a keen requirement to have the pipeline parse for more than just two feature vectors (CancerDiscover's ARFF parser was as of now incapable of doing this). This limited the degree of freedom for users of the pipeline to build better prediction models which could then acquire a better understanding of the data to be passed through the pipeline. candis currently ensures that the pipeline is indeed capable of parsing data sets of any number of dimensions/features (Commit 7304753).
Another concern raised during the phase was to have a Rich Internet Application (RIA) instead of a Qt-based Graphical User Interface (which had been initially proposed). This would help in creating a better ecosystem for the to-be built application with great flexibility in design, server-based processing and easy deployment. We zeroed down to considering Python as our default language of choice to help us rewrite the pipeline and consider a Flask-based server (under the candis.app.server module) and React (under the candis.app.client module) as our front-end framework for the RIA. candis currently has a dedicated sub-module for the RIA under the candis.app Python module.
In short, we were able to transform from this (a prototype submitted during the Application Phase)
...to this
30th May 2017 - 30th June 2017
Behind the Curtains
candis currently comprises of two IO Handlers, namely:
Object | Extension | Purpose |
---|---|---|
CData |
.cdata |
Parsing, Viewing, Pre-processing, Converting and Serializing input datasets. |
Pipeline |
.cpipe |
Configuring, Initiating, Manipulating and Running the ML Pipeline. |
The primary goal of such IO handlers were to work flexibly with the RIA as well as the Command Line Interface. A candis.CData
object instance acts as a wrapper around the input data set file that caters to Genome Data Files (in this case, AffyMetrix CEL files) (NOTE: Extending this to other file formats are open for contribution). Currently, a CData object instance handles Quality Control Checks (Background Correction, Normalization, Phenotype Microarray Correction and Summarization) in order to generate expression set values. There's not much magic here, we're just having a Bioconductor Library - affy as a dependency to perform pre-processing on an AffyBatch
(a list of CEL files). What helped us make this happen is to use rpy2 as a dependency which acts as an interface between Python and R, thereby helping candis to pre-process CEL files (Commit 4ae5621).
R down, Perl/Bash to go.
Draw the Curtains
Viewing and Manipulating CDATA files on the RIA required a rich Excel-like viewer and editor widget included. I had 4 primary React components namely - FileEditor
, DataEditor
and FileViewer
implemented wrapped around a multi-purpose Modal
component. We used react-data-grid as an extension to the above components.
Check out the demos on YouTube:
Curtains, and ReST.
I love ReST, and candis's app is very much ReST-driven. Since our pipeline consists of chunked stages, it's then best to have chunked routes too. candis's currently holds the following major routes:
Route | Purpose |
---|---|
/api/data |
Reading, Writing, Resource Discovery (Fetching Files) |
/api/pipeline |
Running Pipelines and Querying Status |
/api/preprocess |
Querying currently available Preprocessing methods |
/api/featselect |
Querying currently available Feature Selection methods |
/api/model |
Querying currently available learning algorithms |
I believe there's a need for improvement here. One, the pipeline currently runs as one single-threaded process; the routes /api/preprocess
, /api/featselect
and /api/model
therefore must incorporate suitable IO-based actions. And the other, is to reduce the overall overhead within /api/pipeline
(I've witnessed the application tends to unhandle errors, thus leaving a confused client-state. Moreover, pipelines surely require better thread management and throughput).
The problem with ReST is that there's no standardized response structure (unlike GraphQL). I went ahead and built one instead. candis's Response
object is a combination of JSend and Google JSON Guidelines specification (Commit b81fabc).
Why ReST-driven? Primarily because Machine Learning Pipelines for DNA Microarrays are computationally heavy in terms of both, memory and speed. A server-side execution-based application structure makes candis easily deployable. In fact, an instance of candis.app is been continuously deployed on Heroku.
Customizing candis to one's needs is easy. I like the way the addict
library revamps Python's dict
object with a JavaScript's Object
-like interface. I worked extensively in building candis's Config
data structure that comprises of configuration parameters for the CLI as well as the RIA, and even the Pipeline
(Commit df86a89). A candis.Config
object works similar to the way Python's dict
works, with the exception of it being an n-ary tree-like.
Each leaf node of the tree holds a configuration value. A leaf node is denoted by an uppercase attribute whereas each internal node is denoted by a capitalized attribute. - excerpts from the Documentation
This is an inspiration from CancerDiscover's Configuration.txt file that helps users customize pipelines. However in the case of candis, you customize the entire application's state. Custom Configuration comes along with a cache manager and thanks to it, candis.Cache
can now customize your configuration with values present within your $HOMEPATH/.candis/config.json
file (Commit 88fbf331).
The canonical way of importing candis is as follows:
>>> import candis
That's it! You've got access to pretty much everything candis has to offer.
candis is built with high-abstraction and modularity, which means high-level APIs are a few LOCs away to help you build powerful ML models. For instance, parsing a CData instance to an ARFF is as easy as:
>>> cdata = candis.CData.load('path/to/filename.cdata')
>>> cdata.toARFF('path/to/filename.arff')
or how about re-configuring and running a pipeline:
>>> pipeline = candis.Pipeline(config = { 'preprocess': { 'background_correction': 'rma' }})
>>> pipeline.run(cdata)
The same goes for registering widgets. Almost all necessary tools (to build the pipeline) can be accessed via the ToolBox
component.
As one can see, the toolbox consists of various instances of a Compartment
object which then consists various Tool
widgets. Registering new compartments and tools can be done within the compartments
metadata object. You could even have them registered asynchronously! Compartment
's fetcher
prop
will go ahead and build that for you (Commits - df86a898, 939f4de, 3e392e2, d9fdae5, 4be1f06).
A scope for improvement here would be to create a standardized data response for tools when fetched asynchronously. There should be a direct mapping between tools and pipeline stages. As of now, there isn't any.
30th June 2017 - 29th August 2017
Behind the Curtains
Currently the RIA represents a Pipeline as a sequence of stages. And I believe this is open for explorartion; primarily because ML Pipelines can be viewed as Graphs too (which had been my initial intuition and prototype). A data-flow paradigm is infact ideal; but for now, sequences of stages.
CancerDiscover uses a pre-defined list of Feature Selection algorithm combinations' lookup provided by WEKA. Currently, candis makes it way more robust and "tweakable" by helping you to register alogirithm combinations to-be used within your $HOMEPATH/.candis/config.json
file. The current structure is as follows:
>>> import random
>>> randomc.choice(candis.CONFIG.Pipeline.FEATURE_SELECTION)
{'evaluator': {'name': 'CfsSubsetEval'},
'search': {'name': 'BestFirst', 'options': ['-D', '1', '-N', '5']},
'use': False}
As one can see, you pass not just the Evaluator and Search class names but also a set of desired parameters (options
). This however, isn't available on the RIA yet and therefore; is open for contribution. There must be a neat way of passing parameter metadata upfront and values back.
Also, registering models goes the same way:
>>> random.choice(candis.CONFIG.Pipeline.MODEL)
{'label': 'k-Nearest Neighbor', 'name': 'lazy.IBk', 'use': False}
Adding the options
parameter would help you tweak models too.
candis uses python-weka-wrapper (which then uses the python-javabridge
library for running and accessing the Java Virtual Machine) to utilize everything (almost) WEKA has to offer. For MacOS + Python3 users, a quick warnings is that candis uses a bleeding-edge version of python-javabridge
(Check out python-javabridge
's Issue #111).
And that's how you script in Python. Perl/Bash out (Commit da4c8d9).
Draw the Curtains
Like I said, a Graph representation of a Pipeline would have been a better data structure to wrap up our "lego-blocks".
However as of now, the application helps you register them as sequences of stages. For this, I had the following major React Components - DocumentProcessor
, DocumentPanel
and PipelineEditor
.
-
Currently, candis parses AffyMetrix CEL Files alone. Dr. Akram expressed a keen desire to include more DNA Microarray IO handlers into the candis.ios module.
-
I'd initially worked on Querying, Searching and Downloading for datasets from the National Centre for Biotechnology Information (NCBI) repository. I'd kept this aside in order to focus more the RIA. The work (in progress) can be found under the candis.data.entrez module. This acts as an API wrapper to NCBI's extremely powerful server-side API - Entrez.
-
I'd like a
toDataFrame
setter attached (which returns a pandas.DataFrame) to theCData
object. This would then open an array of opportunities (no pun intendeted) to perform Data Analysis on DNA Microarrays within Python (no wrappers, no dependencies, just pure Python). -
One thing what lacked this summer was Unit Testing and Documentation. While I'm aware these both go hand-in-hand when it comes to developing software, I didn't happened to prioritise the same. I'm keen to write test cases and documenting the entire project in the future and would like contributions for the same.
-
Adding a Continuous Integration Platform (Travis CI seems to be my favourite) is a must.
candis, although written by me, is a brainchild of Dr. Akram and his team. I can't thank him enough for the immense support he'd provided me throughout the bonding and development phase. He's zealous when it comes to building neat interfaces for the Bioinformatics community (there's a dearth need for the same) and we hope, candis meets these goals. Right from providing domain knowledge to building the end-application, Dr. Akram was there to guide. Without his help and support (resources, guidance and mentorship), candis wouldn't end up being cutting-edge. He puts a great amount of faith in his students to achieve the results desired and I thank him immensely for having that faith in me to have this almost production-ready by the end of the programme. I'd also like to thank Dr. Tomas for providing me the green light to build a server-based application and his constructive inputs during the initial stages of development.
Following were the major goals to be achieved this summer:
- Include more DNA Microarray IO handlers than just AffyMetrix CEL files.
- User authentication.
- Database-driven application.
- Download CEL files from NCBI using entrez utility.
- Documentation
- Unit Testing - frontend
- Unit Testing - backend
24th April 2018 - 13th May 2018
In this phase, I got to know more about Candis by the mentors in details. Mr.Achilles and Dr.Akram mentioned they wanted Candis production ready by the end of GSoC programme. I was required to implement user authentication, make Candis a database-driven application, and get it deployed on a platform for everyone to use. Candis already had a feature to convert microarray gene expression data in the form of CEL files into equivalent ARFF files. These ARFF files are used to work upon by the Weka platform. This platform is being used for data analysis and predictive modelling. Mr.Achilles wanted me to implement a function to convert these ARFF files into pandas data frame which would then open an option to use python directly for the same Machine Learning tasks using scikit/keras/tensorflow. This was done in PR #98. Before merging anything in the codebase, I was required to integrate a continuous integration tool - 'travis' with candis. This would ensure that every commit is added to the codebase only after a successful travis build. We also needed testing tools to check if the app was working as expected. I chose PyTest for flask testing and Jest for ReactJs testing. Jest is the same tool used by Facebook (created the ReactJs library), so it already had a good support online. By the end of this phase I had set up my development environment and travis with test coverage tools successfully. I had also added a few test cases using Jest.
14th May 2018 - 13th June 2018
In this phase, I started implementing a feature which could download the CEL files from the NCBI. Mr.Achilles had already setup entrez module for the same purpose, I was required to enhance and complete this feature. I configured get-candis script which could install candis on a bare Linux/MacOS container/machine without any trouble. I added some basic UI features <--------insert images------------>. Refer #56, #51 and #52. Completed the Entrez feature in this phase. This was the layout/logic used in designing entrez utility in candis...
And the UI part for this utility consists of 2 modal pages - A formik form to search data and ReactDataGrid Data Grid for user to select and download one of the many CEL files available.
See the entrez feature in action:In this period, there was a lot of discussion on database-driven application. We agreed upon using Postgresql as the database and sqlaclhemy as an ORM.
14th June 2018 - 13th July 2018
In this phase, I implemented user authentication, adding database support, making the endpoints private. For user authentication, I used JWT tokens authentication, instead of session management. I created tables for storing user data, candis pipelines(a user creates) and the response of each endpoint(error or successful response). Advantage of having user authentication is that user can have private pipelines. For this, I had to implement one-to-many relationship between users
and pipeline
tables. Refer PR #121 for more details. For making forms (without tears 😭) and data validation in ReactJS , I used Formik and Yup respectively.
14th July 2018 - 14th August 2018
In this phase, I tried deploying the application on DigitalOcean droplet. As per Mr.Achilles suggestions, I deployed the app with gunicorn as backend server, nginx as a reverse proxy backend server - redirecting requests from the client to the gunicorn application. For monitoring and controlling the processes, I used systemd script. The deployment was successful on a DigitalOcean droplet.
How to tackle problems, and write quality code!
Know more about candis and my experience in the first phase of GSoC'18: Wordpress blog. Get Started with testing: Medium Story.
It was a dream come true ✨ for me when I was given the opportunity to work for candis in Google Summer of Code 2018. I am extremely thankful to my mentors for giving me this opportunity. Mr. Achilles and Dr. Akram have been great mentors, guiding and supporting me through this remarkable journey. I am indebted to Mr. Achilles for sharing his knowledge and bringing me forth in the field of web-development. It was Mr. Achilles who introduced me to a very useful editor, vim. Mr. Achilles focusses on code quality, scalability of an application and using the different conventional principles (like the 12 factor app, YAGNI, KISS) while making a web application. This helped me to achieve writing quality code and structuring codebase as similar as possible to what was setup by Mr. Achilles in candis before I started working. The faith instilled in me by my mentors gave a boost to my confidence, and I was able to tackle and solve easy to hard issues in the project. Special thanks to Dr. Akram for giving invaluable inputs on improving the user interface of candis.