Skip to content

Document store requirements gathering

Jonathan A Rees edited this page Sep 11, 2013 · 11 revisions

The term "study-object" refers abstractly (i.e. without commitment to representation) to what we currently represent in two ways (1) as a combination or rows in various tables in the Phylografter mysql database, (2) as single NexSON files.

Software (any replicate) potential requirements

  1. Must be able to host at least 100,000 mutable study-objects. [JAR said 20,000, Duke said 100,000, currently about 2600, we know of about 8000 studies out there.]

  2. Must support study-object sizes ranging in size up to 50MB. [JAR said 100MB, Duke 50MB. Current largest is ~24MB -- they could grow considerably with annotations... Automatic annotating could cause very rapid file size increase on large trees. Not sure what this implies as far as an upper limit, but something to think about.]

  3. (Not clear whether other kinds of objects need to be supported. Say nothing about this right now)

  4. Study-object format should be documented and extensible, and should retain backward compatibility

  5. Must have basic version control features: past versions of each object (indefinitely or limited to a maximum age? JAR: no maximum age, but perhaps frequency limited e.g. quarterly); commits (change sets) and commit messages; committer identification and timestamps; commit tags.

  6. Three modes of access:

    1. 'raw' mode which allows access to the backend datastore via the Git protocol [Duke added 'git protocol. JAR disagrees since in this context git is not a foregone conclusion.]
    2. 'visitor' mode allows people to view/download the raw data via HTTP/HTTPS
    3. 'developer' mode : a web API at api.opentreeoflife.org which speaks JSON
  7. Deploying api.opentreeoflife.org (what's that?) should be straightforward - decoupled from other services

  8. Should be attractive to developers

    1. Familiar technologies
    2. Automated test suites
    3. Low-overhead testing

Hosting (production instance) potential requirements

  1. Survivability: Objects must be accessible in 'raw' mode even when there are no opentree-managed servers running. (That is, everyone on the opentree project can disappear or become delinquent or broke, and the wider world will still be able to access or fork the content; and also change the content (as hosted), if there is a cooperating party with sufficient privileges. People will always be able to access the raw JSON data via some server [Duke said 'raw.github.com' but we aren't assuming github at this point]
  2. 'Raw' mode should have high availability, high bandwidth, low latency
  3. Plan needs to be in place for contingency of destitution (e.g., 'raw' service could be no-cost)
  4. Plan needs to be in place in case of problem with hosting service: if it ends operation, changes incompatibly, changes price so that service becomes unaffordable, or quality degrades to an unacceptable level
  • Authentication - how will users/curators identify themselves? Or does contingency plan mean read-only world access? ** Will we always allow anonymous read access? Anonymous full-database scraping is effectively a DDoS, so we will need policies to bandwidth and api-call throttle greedy folks. (Perhaps hosting service will take care of this) ** For write access, how will we manage the creation, storage and expiration of API tokens (if at all)?
  • How does the datastore integrate with the open tree backup strategy? (Added by Duke; JAR does not understand)
  • How will updates to the production system be deployed?
Clone this wiki locally