Skip to content

eltonlaw/gooserocket

Repository files navigation

gooserocket

gooserocket is a computational bioinformatics research platform for running experiments on AWS. The target is to provide a platform that provides a cheap way to:

  1. Schedule distributed batch jobs for grabbing/searching/categorizing/featurizing data and storing in S3.
  2. Provide a library of primitives to work with the various public databases and biological data formats
  3. Spin up jupyter notebooks with access to pre processed data. Run ML training/prediction/analytics.

Design goals

(Not set in stone)

  1. Keeping costs low: AWS spot instances will be used, so things will be built such that they are interruption tolerant.
    1. No idle resources. Only select things in S3 will be persisted.
    2. All jobs run through spot instances, compute set up to be idempotent and interruptible.
    3. Minimize egress from S3/ECR, keep everything in same region, HA is unneccessary. At $0.09/GB, grabbing terabyte level data from S3 gets super expensive
  2. Most of the scientific work will be done in a jupyter notebook. Heavy processing will handled by an orchestrator which takes in requests, spins up the necessary infra and responds immediately with a response id. From the jupyter nb, you'll ask about the response id until its succeeded/failed at which point there'll be a file in s3.

Aspirations

  • Use alphafold to predict protein structure given amino acid sequence.
  • Compare protein sequence to other protein sequences (FASTA / PSI-BLAST for biologically significant match)
  • Read in compounds from PubChem and run AutoDock Vina on them with some target protein in a distributed fashion
  • Run similarity analysis using rxrx3
  • Build up database of results of genomics experiments from public data
  • Creating a blast database in S3 and running blast on reference genome to see alignment

To Do

  • gr-data: Lib code for getting & managing heterogeneous data
    • Fn to get amino acid sequence from uniprot given id, ex. P0DTC2
    • Fn to get protein structure from pdb given the pdb id, ex. 6VXX
    • Fn to access genbank's nucleic acid sequences
    • Investigate data specific compression to minimize on-disk size.
    • Import/export SMILES - spec
    • Import/export .pdb - link
    • Import/export .fasta - wiki
  • gr-bio: Algorithms
  • gr-engine: Launches experiments
    • Launch a jupyter notebook with required deps
    • Package some entrypoint into a spot instance fleet and schedule
  • gr-infra:
    • Parallelized s3:GetObject from s3
    • Parallelized s3:PutObject to s3
    • create cloudformation stack
    • update cloudformation stack
  • gr-cli: CLI for interacting with resources
    • ./gr-cli deploy <target>
    • ./gr-cli infra datasources ls
    • ./gr-cli experiments ls
    • ./gr-cli jobs ls
    • ./gr-cli system prune
    • ./gr-cli shutdown all
  • gr-tracing: utils for tracing
    • export to honeycomb.io for initial experiments monitoring

Infra

Currently just cloudformation stacks deployed via aws api calls by rust sdk using these hardcoded CF templates

NOTE: Not all required AWS resources are in CF such that you can deploy the entire platform cold: security group, ssh key-pair, create billing reports sent to bucket

Usage

Setting up the experimentation environment

To deploy the base infra (S3 buckets, IAM roles etc.)

./gr-cli deploy common-infra

To build the jupyter image

./gr-cli deploy jupyter-image

To create a jupyter notebook instance out of the image above

./gr-cli deploy jupyter-notebook

To stop the deployed jupyter notebook instance

./gr-cli shutdown

Nix Environment Setup

Troubleshooting

error: experimental Nix feature 'nix-command' is disabled; use '--extra-experimental-features nix-command' to override

Add to ~/.config/nix/nix.conf

experimental-features = nix-command

error: getting status of /nix/var/nix/daemon-socket/socket: Permission denied

If just installed, restart

Check if the daemon is running and start it if not

sudo systemctl status nix-daemon.service
sudo systemctl enable nix-daemon.service
sudo systemctl start nix-daemon.service

Add user to nix-users

sudo usermod -aG nix-users $(whoami)

About

Bioinformatics framework

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published