gooserocket is a computational bioinformatics research platform for running experiments on AWS. The target is to provide a platform that provides a cheap way to:
- Schedule distributed batch jobs for grabbing/searching/categorizing/featurizing data and storing in S3.
- Provide a library of primitives to work with the various public databases and biological data formats
- Spin up jupyter notebooks with access to pre processed data. Run ML training/prediction/analytics.
(Not set in stone)
- Keeping costs low: AWS spot instances will be used, so things will be built such that they are interruption tolerant.
- No idle resources. Only select things in S3 will be persisted.
- All jobs run through spot instances, compute set up to be idempotent and interruptible.
- Minimize egress from S3/ECR, keep everything in same region, HA is unneccessary. At $0.09/GB, grabbing terabyte level data from S3 gets super expensive
- Most of the scientific work will be done in a jupyter notebook. Heavy processing will handled by an orchestrator which takes in requests, spins up the necessary infra and responds immediately with a response id. From the jupyter nb, you'll ask about the response id until its succeeded/failed at which point there'll be a file in s3.
- Use alphafold to predict protein structure given amino acid sequence.
- Compare protein sequence to other protein sequences (FASTA / PSI-BLAST for biologically significant match)
- Read in compounds from PubChem and run AutoDock Vina on them with some target protein in a distributed fashion
- Run similarity analysis using rxrx3
- Build up database of results of genomics experiments from public data
- Creating a blast database in S3 and running blast on reference genome to see alignment
- gr-data: Lib code for getting & managing heterogeneous data
- Fn to get amino acid sequence from uniprot given id, ex. P0DTC2
- Fn to get protein structure from pdb given the pdb id, ex. 6VXX
- Fn to access genbank's nucleic acid sequences
- Investigate data specific compression to minimize on-disk size.
- Import/export SMILES - spec
- Import/export .pdb - link
- Import/export .fasta - wiki
- gr-bio: Algorithms
- AutoDock Vina scoring
- Implement everything in https://rosalind.info/problems/list-view/
- gr-engine: Launches experiments
- Launch a jupyter notebook with required deps
- Package some entrypoint into a spot instance fleet and schedule
- gr-infra:
- Parallelized s3:GetObject from s3
- Parallelized s3:PutObject to s3
- create cloudformation stack
- update cloudformation stack
- gr-cli: CLI for interacting with resources
-
./gr-cli deploy <target>
-
./gr-cli infra datasources ls
-
./gr-cli experiments ls
-
./gr-cli jobs ls
-
./gr-cli system prune
-
./gr-cli shutdown all
-
- gr-tracing: utils for tracing
- export to honeycomb.io for initial experiments monitoring
Currently just cloudformation stacks deployed via aws api calls by rust sdk using these hardcoded CF templates
NOTE: Not all required AWS resources are in CF such that you can deploy the entire platform cold: security group, ssh key-pair, create billing reports sent to bucket
To deploy the base infra (S3 buckets, IAM roles etc.)
./gr-cli deploy common-infra
To build the jupyter image
./gr-cli deploy jupyter-image
To create a jupyter notebook instance out of the image above
./gr-cli deploy jupyter-notebook
To stop the deployed jupyter notebook instance
./gr-cli shutdown
error: experimental Nix feature 'nix-command' is disabled; use '--extra-experimental-features nix-command' to override
Add to ~/.config/nix/nix.conf
experimental-features = nix-command
If just installed, restart
Check if the daemon is running and start it if not
sudo systemctl status nix-daemon.service
sudo systemctl enable nix-daemon.service
sudo systemctl start nix-daemon.service
Add user to nix-users
sudo usermod -aG nix-users $(whoami)