Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharded Telemetry Server #349

Merged
merged 139 commits into from
Aug 12, 2021
Merged

Sharded Telemetry Server #349

merged 139 commits into from
Aug 12, 2021

Conversation

jsdw
Copy link
Collaborator

@jsdw jsdw commented Jun 21, 2021

An async hyper+soketto+tokio (-actix) telemetry server, split into:

  • a telemetry_core binary which the UI connects to and receives information from to display, and shards connect to to relay information to it.
  • a telemetry_shard binary which takes JSON feeds from nodes and sends them to the telemetry process to be aggregated and sent to UI feeds.

The telemetry core process works by spinning up a single aggregator loop which receives messages from shards and feeds, and updates the node state or subscribes/unsubscribes feeds as necessary.

At some point:

  • Allow shards to connect to multiple Telemetry Cores, not just the one. This would allow us to scale the core service too when more feeds connect by spinning up more of them.
  • And/or allow a telemetry core to spin up multiple aggregator loops and distribute connected feeds across them
  • Allow aggregators to only handle messages relating to certain genesis hashes (chains) to split the shard side of the work across aggregators rather than broadcasting it to all of them.

I'll leave master alone for now (to avoid any automated deployment bits), and merge this into long lived sharded branch which can accomodate other changes (benchmark/e2e test tooling and any deployment config) before we merge it to master.

Manual testing

To try this thing out locally, you can do the following:

In the substrate repository:

Start up an "alice" node in one terminal and a bob node in another (pointed to where shard 1 will start).

cargo run -- \
  --tmp \
  --chain local \
  --alice \
  --port 30333 \
  --ws-port 9944 \
  --rpc-port 9933 \
  --node-key 0000000000000000000000000000000000000000000000000000000000000001 \
  --validator \
  --telemetry-url 'ws://localhost:8001/submit 1'
cargo run -- \
  --tmp \
  --chain local \
  --bob \
  --port 30334 \
  --ws-port 9945 \
  --rpc-port 9934 \
  --validator \
  --bootnodes /ip4/127.0.0.1/tcp/30333/p2p/12D3KooWEyoppNCUx8Yx66oV9fJnriXwCcXwDDUA2kj6vnc6iDEp \
  --telemetry-url 'ws://localhost:8001/submit 1'

Now, let's start two more nodes on a different chain and pointed to a different telemetry shard:

cargo run -- \
  --tmp \
  --chain dev \
  --alice \
  --port 30335 \
  --ws-port 9946 \
  --rpc-port 9935 \
  --node-key 0000000000000000000000000000000000000000000000000000000000000001 \
  --validator \
  --telemetry-url 'ws://localhost:8002/submit 1' \
  --name AliceDev
cargo run -- \
  --tmp \
  --chain dev \
  --bob \
  --port 30336 \
  --ws-port 9947 \
  --rpc-port 9936 \
  --validator \
  --bootnodes /ip4/127.0.0.1/tcp/30335/p2p/12D3KooWEyoppNCUx8Yx66oV9fJnriXwCcXwDDUA2kj6vnc6iDEp \
  --telemetry-url 'ws://localhost:8002/submit 1' \
  --name BobDev

In this repository+branch

On another three terminals (sorry about all the terminals..), run the following from within the backend folder of this repo:

The main telemetry process that UIs will connect to:

cargo run --bin telemetry

A couple of shards to receive telemetry and forward it on:

cargo run --bin shard -- -l 127.0.0.1:8001
cargo run --bin shard -- -l 127.0.0.1:8002

And finally, on yet another terminal and in t he frontend folder of this repo, we'll start the UI up so that we can see what's going on:

yarn start

Now, visit http://localhost:3000 and watch the data start coming in as nodes connect to the started shards.

Have a go at killing shards, or the telemetry core, or nodes and see what happens.

backend/common/src/util/null.rs Outdated Show resolved Hide resolved
backend/shard/src/aggregator.rs Outdated Show resolved Hide resolved
adding frontend configmaps and envVars

optimizing docker-compose and DockerfIle
Copy link
Member

@niklasad1 niklasad1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the PR is very clean and looks good, maybe in the future we could "resultify" the APIs instead doing this much of unwrap.

I didn't review the logic super careful because it's already battle-tested, mostly style stuff ^

Copy link
Contributor

@dvdplm dvdplm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Good job here, I especially like the attention paid to docs&comments.

backend/common/src/id_type.rs Outdated Show resolved Hide resolved
backend/common/src/rolling_total.rs Outdated Show resolved Hide resolved
self.0 = time;
}
pub fn increment_by(&mut self, duration: Duration) {
self.0 += duration;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is addition to Instant saturating?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't assume so. This source is only actually used in tests, so if we overflow the time I think I'd be happy to chalk it up to programmer error and let it panic

backend/common/src/node_types.rs Outdated Show resolved Hide resolved
set.insert("Polkadot");
set.insert("Kusama");
set.insert("Westend");
set.insert("Rococo");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the common good parachains here as well? statemine/statemint?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough to have an opinion on that, but I'd be happy to add them! I can't see any nodes from those on the current telemetry server yet; is that expected (I guess they aren't a thing just yet?)?

backend/telemetry_core/src/state/chain.rs Outdated Show resolved Hide resolved
/// Check if the chain is stale (has not received a new best block in a while).
/// If so, find a new best block, ignoring any stale nodes and marking them as such.
fn update_stale_nodes(&mut self, now: u64, feed: &mut FeedMessageSerializer) {
let threshold = now - STALE_TIMEOUT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saturating sub here might be best?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we ever hit an underflow error here we've done something very wrong, so I'd be happy to let it panic personally

@dvdplm
Copy link
Contributor

dvdplm commented Aug 11, 2021

There were a couple of warnings when generating the docs, trivial stuff.

README.md Outdated Show resolved Hide resolved
@jsdw jsdw merged commit 705d57a into master Aug 12, 2021
@jsdw jsdw deleted the jsdw-sharding branch August 12, 2021 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants