gooserocket

gooserocket is a computational bioinformatics research platform for running experiments on AWS. The target is to provide a platform that provides a cheap way to:

Schedule distributed batch jobs for grabbing/searching/categorizing/featurizing data and storing in S3.
Provide a library of primitives to work with the various public databases and biological data formats
Spin up jupyter notebooks with access to pre processed data. Run ML training/prediction/analytics.

Design goals

(Not set in stone)

Keeping costs low: AWS spot instances will be used, so things will be built such that they are interruption tolerant.
1. No idle resources. Only select things in S3 will be persisted.
2. All jobs run through spot instances, compute set up to be idempotent and interruptible.
3. Minimize egress from S3/ECR, keep everything in same region, HA is unneccessary. At $0.09/GB, grabbing terabyte level data from S3 gets super expensive
Most of the scientific work will be done in a jupyter notebook. Heavy processing will handled by an orchestrator which takes in requests, spins up the necessary infra and responds immediately with a response id. From the jupyter nb, you'll ask about the response id until its succeeded/failed at which point there'll be a file in s3.

Aspirations

Use alphafold to predict protein structure given amino acid sequence.
Compare protein sequence to other protein sequences (FASTA / PSI-BLAST for biologically significant match)
Read in compounds from PubChem and run AutoDock Vina on them with some target protein in a distributed fashion
Run similarity analysis using rxrx3
Build up database of results of genomics experiments from public data
Creating a blast database in S3 and running blast on reference genome to see alignment

To Do

gr-data: Lib code for getting & managing heterogeneous data
- Fn to get amino acid sequence from uniprot given id, ex. P0DTC2
- Fn to get protein structure from pdb given the pdb id, ex. 6VXX
- Fn to access genbank's nucleic acid sequences
- Investigate data specific compression to minimize on-disk size.
- Import/export SMILES - spec
- Import/export .pdb - link
- Import/export .fasta - wiki
gr-bio: Algorithms
- AutoDock Vina scoring
- Implement everything in https://rosalind.info/problems/list-view/
gr-engine: Launches experiments
- Launch a jupyter notebook with required deps
- Package some entrypoint into a spot instance fleet and schedule
gr-infra:
- Parallelized s3:GetObject from s3
- Parallelized s3:PutObject to s3
- create cloudformation stack
- update cloudformation stack
gr-cli: CLI for interacting with resources
- ./gr-cli deploy <target>
- ./gr-cli infra datasources ls
- ./gr-cli experiments ls
- ./gr-cli jobs ls
- ./gr-cli system prune
- ./gr-cli shutdown all
gr-tracing: utils for tracing
- export to honeycomb.io for initial experiments monitoring

Infra

Currently just cloudformation stacks deployed via aws api calls by rust sdk using these hardcoded CF templates

NOTE: Not all required AWS resources are in CF such that you can deploy the entire platform cold: security group, ssh key-pair, create billing reports sent to bucket

Usage

Setting up the experimentation environment

To deploy the base infra (S3 buckets, IAM roles etc.)

./gr-cli deploy common-infra

To build the jupyter image

./gr-cli deploy jupyter-image

To create a jupyter notebook instance out of the image above

./gr-cli deploy jupyter-notebook

To stop the deployed jupyter notebook instance

./gr-cli shutdown

Nix Environment Setup

Troubleshooting

error: experimental Nix feature 'nix-command' is disabled; use '--extra-experimental-features nix-command' to override

Add to ~/.config/nix/nix.conf

experimental-features = nix-command

error: getting status of /nix/var/nix/daemon-socket/socket: Permission denied

If just installed, restart

Check if the daemon is running and start it if not

sudo systemctl status nix-daemon.service
sudo systemctl enable nix-daemon.service
sudo systemctl start nix-daemon.service

Add user to nix-users

sudo usermod -aG nix-users $(whoami)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

gooserocket

Design goals

Aspirations

To Do

Infra

Usage

Setting up the experimentation environment

Nix Environment Setup

Troubleshooting

error: experimental Nix feature 'nix-command' is disabled; use '--extra-experimental-features nix-command' to override

error: getting status of /nix/var/nix/daemon-socket/socket: Permission denied

Files

README.md

Latest commit

History

README.md

File metadata and controls

gooserocket

Design goals

Aspirations

To Do

Infra

Usage

Setting up the experimentation environment

Nix Environment Setup

Troubleshooting

error: experimental Nix feature 'nix-command' is disabled; use '--extra-experimental-features nix-command' to override

error: getting status of /nix/var/nix/daemon-socket/socket: Permission denied