-
Notifications
You must be signed in to change notification settings - Fork 10
Architecture: data storage
The database schema is "owned" by sm-graphql. All changes must be made to the TypeScript model.ts
files and applied through TypeORM migrations. More info in SQL migrations.
See Dev-environment-tips#set-up-datagrip-to-connect-to-the-production-databases for an easy way to connect DataGrip to prod without needing the VPN.
If you're good with SQL, sometimes it's fastest to just SSH to a server and run commands directly e.g. sudo -u postgres psql sm postgres -c "SELECT * FROM graphql.dataset WHERE id = '...'"
Be aware that the engine tests need a static copy of the DB schema. This file is generated in the sm-graphql project by running yarn run gen-sql-schema
. It should be checked in to git so that the CI can run engine tests without needing to set up an environment to run sm-graphql.
The decision to use TypeORM was made after evaluating all popular Python and TypeScript ORMs at the time, and finding that only TypeORM had satisfactory performance, automated migration generation and a strong connection between the database schema and the language's static type checking. Additionally, as sm-graphql handles significantly more SQL operations than engine, it made sense to prioritize convenience in sm-graphql.
There are currently two SQL namespaces (called "schemas" in SQL terminology, but they shouldn't be confused with the schema definition), public
and graphql
. graphql
was introduced in an attempt to keep the engine and sm-graphql applications independent - a decision that was soon found to slow development with no benefit. After a group discussion, we decided all new tables should be in the public
namespace unless there's a good organizational reason not to, such as a new many-to-many table between two existing graphql
tables.
This is an awkward separation of responsibility - engine manages creation/deletion of public.dataset
and sm-graphql manages creation/modification/deletion of graphql.dataset
, however both applications can read both tables. They are not connected by a foreign key, and occasionally they go out of sync.
This was not an intentional design decision, it was just too difficult to refactor and migrate into a clean state at the time. engine's lack of an ORM makes it tedious and error-prone to add new columns to public.dataset
.
Where to place a column depends on which application is responsible for saving it to the database. Most new columns should go in graphql.dataset
because sm-graphql handles most basic CRUD operations. Columns should only be added to public.dataset
if the value is calculated/modified from engine code.
No particular reason for this - we just haven't invested the time to upgrade yet.
- It has a limited set of JSON functions
- It doesn't cache evaluation of JSON operations, which can make JSON access in the
annotation
table very slow. E.g. if you SELECT or use a WHERE filter ondataset.config->'isotope_generation'
in a query that joins toannotation
, it will JSON-decodedataset.config
for EVERY output row, even if there's only 1 dataset and 10000 annotations. To get around this, useWITH
CTEs to make subqueries that filter/extract thedataset
JSON fields separately from the query that joins toannotation
.
ElasticSearch is used for searching, retrieving and applying authorization rules for datasets and annotations. It's possible to 100% recreate the ElasticSearch index based on data stored in Postgres using scripts/update_es_index.py
, however: 1) this process takes days to run, and 2) some of the data (m/z values, isomers/isobars) is computed on-the-fly and can only be computed in Python code.
It's pretty messy to work with, so we've try to minimize the surface area interacting with ElasticSearch. All schema definitions and write operations should be kept in es_export.py
and all queries should be kept in esConnector.ts
.
I really can't give any other development tips - every time I've worked with ElasticSearch I've needed to read the docs.
-
/var/www/mol-images
- Molecular database molecule images (served through nginx) -
/var/www/webapp
- Compiled webapp (served through nginx) -
/var/www/webapp/old_ver
- Previous compiled version of webapp (also served through nginx, allowing the website to still work even if people don't refresh for a while after a deployment) -
/opt/dev/metaspace/metaspace/graphql/health.json
- When there is a maintenance message or read-only mode is active, graphql will use this file to persist the status even if the process is restarted. This file is usually removed automatically during Ansible deployment. -
/opt/data/metaspace/public/raw_optical_images/
- Contains raw user-uploaded optical images. After these are submitted, sm-api also makes several lower-resolution copies that are stored in S3.
The AWS CLI is much easier to use than the web interface. Once set up you can do e.g. aws s3 ls s3://sm-engine-upload | less
to view a bucket, or just aws s3 ls
to list buckets.