Sharded Telemetry Server #349

jsdw · 2021-06-21T09:53:15Z

An async hyper+soketto+tokio (-actix) telemetry server, split into:

a telemetry_core binary which the UI connects to and receives information from to display, and shards connect to to relay information to it.
a telemetry_shard binary which takes JSON feeds from nodes and sends them to the telemetry process to be aggregated and sent to UI feeds.

The telemetry core process works by spinning up a single aggregator loop which receives messages from shards and feeds, and updates the node state or subscribes/unsubscribes feeds as necessary.

At some point:

Allow shards to connect to multiple Telemetry Cores, not just the one. This would allow us to scale the core service too when more feeds connect by spinning up more of them.
And/or allow a telemetry core to spin up multiple aggregator loops and distribute connected feeds across them
Allow aggregators to only handle messages relating to certain genesis hashes (chains) to split the shard side of the work across aggregators rather than broadcasting it to all of them.

I'll leave master alone for now (to avoid any automated deployment bits), and merge this into long lived sharded branch which can accomodate other changes (benchmark/e2e test tooling and any deployment config) before we merge it to master.

Manual testing

To try this thing out locally, you can do the following:

In the substrate repository:

Start up an "alice" node in one terminal and a bob node in another (pointed to where shard 1 will start).

cargo run -- \
  --tmp \
  --chain local \
  --alice \
  --port 30333 \
  --ws-port 9944 \
  --rpc-port 9933 \
  --node-key 0000000000000000000000000000000000000000000000000000000000000001 \
  --validator \
  --telemetry-url 'ws://localhost:8001/submit 1'

cargo run -- \
  --tmp \
  --chain local \
  --bob \
  --port 30334 \
  --ws-port 9945 \
  --rpc-port 9934 \
  --validator \
  --bootnodes /ip4/127.0.0.1/tcp/30333/p2p/12D3KooWEyoppNCUx8Yx66oV9fJnriXwCcXwDDUA2kj6vnc6iDEp \
  --telemetry-url 'ws://localhost:8001/submit 1'

Now, let's start two more nodes on a different chain and pointed to a different telemetry shard:

cargo run -- \
  --tmp \
  --chain dev \
  --alice \
  --port 30335 \
  --ws-port 9946 \
  --rpc-port 9935 \
  --node-key 0000000000000000000000000000000000000000000000000000000000000001 \
  --validator \
  --telemetry-url 'ws://localhost:8002/submit 1' \
  --name AliceDev

cargo run -- \
  --tmp \
  --chain dev \
  --bob \
  --port 30336 \
  --ws-port 9947 \
  --rpc-port 9936 \
  --validator \
  --bootnodes /ip4/127.0.0.1/tcp/30335/p2p/12D3KooWEyoppNCUx8Yx66oV9fJnriXwCcXwDDUA2kj6vnc6iDEp \
  --telemetry-url 'ws://localhost:8002/submit 1' \
  --name BobDev

In this repository+branch

On another three terminals (sorry about all the terminals..), run the following from within the backend folder of this repo:

The main telemetry process that UIs will connect to:

cargo run --bin telemetry

A couple of shards to receive telemetry and forward it on:

cargo run --bin shard -- -l 127.0.0.1:8001

cargo run --bin shard -- -l 127.0.0.1:8002

And finally, on yet another terminal and in t he frontend folder of this repo, we'll start the UI up so that we can see what's going on:

yarn start

Now, visit http://localhost:3000 and watch the data start coming in as nodes connect to the started shards.

Have a go at killing shards, or the telemetry core, or nodes and see what happens.

…il refactor

…ut of sync

… shard

…ing)

backend/common/src/util/null.rs

backend/shard/src/aggregator.rs

…nodes

backend/common/src/either_sink.rs

backend/common/src/http_utils.rs

backend/common/src/most_seen.rs

backend/common/src/node_types.rs

backend/common/src/ws_client/connect.rs

adding frontend configmaps and envVars optimizing docker-compose and DockerfIle

backend/telemetry_core/benches/subscribe.rs

backend/common/src/ready_chunks_all.rs

backend/telemetry_core/src/aggregator/inner_loop.rs

backend/telemetry_core/src/find_location.rs

backend/telemetry_core/src/main.rs

niklasad1

Overall, the PR is very clean and looks good, maybe in the future we could "resultify" the APIs instead doing this much of unwrap.

I didn't review the logic super careful because it's already battle-tested, mostly style stuff ^

Frontend helm

dvdplm

Overall LGTM. Good job here, I especially like the attention paid to docs&comments.

backend/common/src/id_type.rs

backend/common/src/rolling_total.rs

dvdplm · 2021-08-11T15:05:30Z

backend/common/src/rolling_total.rs

+        self.0 = time;
+    }
+    pub fn increment_by(&mut self, duration: Duration) {
+        self.0 += duration;


Is addition to Instant saturating?

I wouldn't assume so. This source is only actually used in tests, so if we overflow the time I think I'd be happy to chalk it up to programmer error and let it panic

backend/common/src/node_types.rs

dvdplm · 2021-08-11T15:17:26Z

backend/telemetry_core/src/state/chain.rs

+    set.insert("Polkadot");
+    set.insert("Kusama");
+    set.insert("Westend");
+    set.insert("Rococo");


Should we add the common good parachains here as well? statemine/statemint?

I don't know enough to have an opinion on that, but I'd be happy to add them! I can't see any nodes from those on the current telemetry server yet; is that expected (I guess they aren't a thing just yet?)?

backend/telemetry_core/src/state/chain.rs

dvdplm · 2021-08-11T15:49:53Z

backend/telemetry_core/src/state/chain.rs

+    /// Check if the chain is stale (has not received a new best block in a while).
+    /// If so, find a new best block, ignoring any stale nodes and marking them as such.
+    fn update_stale_nodes(&mut self, now: u64, feed: &mut FeedMessageSerializer) {
+        let threshold = now - STALE_TIMEOUT;


Saturating sub here might be best?

If we ever hit an underflow error here we've done something very wrong, so I'd be happy to let it panic personally

dvdplm · 2021-08-11T15:54:43Z

There were a couple of warnings when generating the docs, trivial stuff.

README.md

backend/telemetry_core/src/state/chain.rs

maciejhirsz and others added 19 commits June 8, 2021 12:17

Squashed diff from mh-backend-shard

8db384b

rename shared to common to disambiguate from 'shard'

c276c20

Split msg into JSON and internal variant, and other bits

8e25b4f

remove a few unnecessary structs

588f1ea

tweak CI to work with shards+core split

5b01179

Ci tidy

2b0accb

Get chatter between shard and core working

3a527e6

Remove NodeConnector from core for now; only messages from shards unt…

9741b0f

…il refactor

Rework: Shard working, Telemetry Core needs logic filling in

dfe0165

Fix CI

486418e

wrap assigning local/global IDs into struct to avoid things getting o…

20524ac

…ut of sync

Add a note about closing ws with statuscode+reason

06d131b

Allow multiple SystemConnects to be handled from a single node in the…

19ef458

… shard

Remove 'remove' logging

83e2cee

WIP filling in core aggregator match arms and various other tweaks

6328319

feed/shard disconnects can be handled, and unbounded output to feeds

7dfc582

bimap to store global ID mappings: we'll assign them in node state

2db2677

Lots more refactoring, finish add node (and almost the location updat…

47c12ce

…ing)

locator shuffling around

e383866

maciejhirsz reviewed Jun 24, 2021

View reviewed changes

backend/common/src/util/null.rs Outdated Show resolved Hide resolved

backend/shard/src/aggregator.rs Outdated Show resolved Hide resolved

jsdw added 10 commits June 24, 2021 18:34

Update node locations when they come in, and get the real IP addr of …

fb80edb

…nodes

handle port more properly in real_ip filter and a little refactor

4f60453

Handle removing a node, and a shard disconnecting (bulk remove)

89dfad5

simplify feed sending a little

00c6e4f

test and fix most_seen

8a0eb14

Add some State tests, and use genesis_hash, not label, where possible

c5ca84e

Finish first pass update_node impl

770739c

Passing a callback isn't worth the extra code; just pass a feed thing

06bd660

Give things unique ID types, not aliases, to prevent mixups

4308359

Move a bunch of things around and flatten common crate

f7ab329