Skip to content

Hub Technical Details

moulliet edited this page Sep 30, 2016 · 8 revisions

overview

The Hub is designed as a fault tolerant, highly available service for data distribution and storage.

It uses a in-process short term cache, called Spoke, to provide read after write consistency and low latency performance.

For the sake of simplicity, all Spoke operations assume "all data is everywhere". This allows us to ignore the complexity of hash rings for sharding data. While we recommend a minimum cluster of 3 servers for fault tolerance, the hub can run with just a single server.

Long term storage is provided by S3. The modular design allows the addition of new datastores.

time

A key for an item inserted into the hub is the combination of the current system time in UTC and a hash to break millisecond ties, {year}/{month}/{day}/{hour}/{minute}/{second}/{hash} To provide consistent ordering for sequential writes, the system time between the instances needs to be coordinated.

writes

Items written to the hub are immediately compressed and then assigned a key. Each item is concurrently written to all Spoke instances. A successful Spoke write is defined as at least 50% of the nodes.

After a successful Spoke write, the item is put on an internal bounded queue to write to S3. A background process is used to verify that all items actually exist in S3.

reads

Read requests attempt to read from Spoke first, if the item is within the Spoke TTL (defaults to 60 minutes). The hub attempts each Spoke nodes in random order until the item is found. If not found or outside the Spoke TTL, S3 is queried.

querying

Query requests are latest, earliest, next, previous and time ranges. If a query is determined to be entirely within Spoke's TTL, all Spoke nodes are queried concurrently. Otherwise, S3 is also queried concurrently. The union of all items found are then filtered based on the request criteria.

callbacks

Webhook, Websocket, Events, Replication and Global Replication are built on the query interface, and presume that items are written within 'app.stable_seconds' (default 5 seconds) of the key being assigned. All of our deployments to EC2 have been able to meet this timing requirement easily.

Monitoring

http://hub/health will show if a specific hub host is up, and what version it is running.

The hub uses New Relic for API level monitoring. Since all hub and spoke interactions are http, the transactions automatically show up in New Relic.

The hub also publishes detailed metrics data to Hosted Graphite and DataDog Both are more useful for detailed metrics per channel and the health of the system.