Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update 4: Tracking / Stats Tooling #4

Closed
gavinmcdermott opened this issue Sep 23, 2016 · 5 comments
Closed

Update 4: Tracking / Stats Tooling #4

gavinmcdermott opened this issue Sep 23, 2016 · 5 comments

Comments

@gavinmcdermott
Copy link
Owner

gavinmcdermott commented Sep 23, 2016

Thanks to @ReidWilliams for the conversations and initial writeup.

Understanding testnet performance

There is a need for the testnet to benchmark the performance of their messaging implementation based on a variety of factors (compare similar algos over time, compare different algos, etc.).

The libp2p-datalogger

Overview

Assumptions about how that the tracking/logging work:

  • Testing framework will be configured to instantiate a given number of IPFS nodes. Currently they are all within a single Node.js process. In the future they will likely be separate Node.js processes possibly running inside containers on separate machines.
  • Testing framework has a configurable node topology (partial mesh, ring, minimum spanning tree, etc). Each node instantiated with an experimental libp2p which implements a routing strategy.
  • Some kind of messaging / load test will be run.
  • Performance data will be collected and analyzed. This will likely include simple numerical stats and an evolving visual interface.

Features

Being updated to work with this initial implementation

libp2p-datalogger will a simple module to make it easy to log arbitrary data during a testnet experiment. It should be useful for logging (and potentially viewing in a browser) the benchmarking data including status messages, event messages, and performance data.

  • Data should be loggable on a per-node and global network basis.
  • Each libp2p/pubsub node in a test network can create and use its own datalogger instance.
  • Log messages and performance data are arbitrary JSON recorded for each node.
  • Logging automatically timestamps the data.
  • Logged data is kept in memory and periodically published via something like a log.publish() command. Publishing the log creates a set of data objects for the published log. At the end of an experiment, this makes it easy to view individual and aggregate node performance data.
  • The datalogger does not define what data should be logged or provide statistical methods on the data. It just logs messages and publishes them (to some other tool—eventually using IPFS?). Statistical methods (E.g.: computing a max, min, and average queue length) should be done in a processing step after the experiment.

Questions

Specifically @diasdavid...

  • Specific initial stats: What are the most important things you want to know/learn about message propagation in these initial implementations?

  • libp2p versions: Because you'll be swapping versions of libp2p that have a specific messaging strategy built in, we'll need the ability to know which lib version and algo you're using. Update Since looking at the floodsub implementation, I'll use this for reference.

  • Secio and speed: libp2p's move to secio means we should use pre-generated keys; I wrote up a quick keygen tool. It is now working and I'm dropping in the pregen keys today. Something to note: Creating the network is now significantly slower as a result of using full keys...need to think about how we can reduce this

  • Entry points and Methods: I was thinking of:

    *Aside from using the EventEmitters and exposed properties on each pubsub node (e.g.: this) any other suggestions for places/libs we might want hook into (anything in bitswap for example)? Because aside of the obvious ones like these, I recall you mentioning something about timestamps from incoming messages.

  • Information at Scales: It’ll be useful to collect per node statistics. For tests with a large number of nodes (10k, 100k, 1M more?) it probably won’t be necessary to analyze data from all nodes, but it may be useful to analyze data from a random subset. Thoughts?


Update

Tues. Sept. 27, 2016

@gavinmcdermott
Copy link
Owner Author

gavinmcdermott commented Sep 26, 2016

Hey @nicola, I also ask if I could get your feedback on things. I'm adding benchmark/logging tools this week to get it to "Developer Ready" ™ and I'd love a few more eyes on it.

Based on this libp2p/pubsub issue's comments, I thought you might have some interest. Feel free to take a look at things or add suggestions/features/inspiration for the benchmark tool that aren't mentioned above.

Metrics:

  • What are some metrics / pieces of info you'd care about when testing implementations?

Faults (getting to these after the initial benchmark tools):

  • Based on a Network's nodes and topology properties, is there an interface you'd expect to see with regards to triggering specific types of faults?
    • E.g.: In considering the developer experience of telling a network to add a delay to x% of nodes vs. looping over a percentage of nodes and triggering some delay mechanism on them

Thanks for any input!

@nicola
Copy link

nicola commented Sep 26, 2016

Perfect! I will have a read very soon (from mobile now)

@daviddias
Copy link

Hi @gavinmcdermott, as always, awesome stuff! Giving you feedback on the things you asked:

Specific initial stats: What are the most important things you want to know/learn about message propagation in these initial implementations?

It will be very important to guarantee correctness, as in, at least, in a small network (~50), we should be able to observe cleary the message propagation and make sure the code is sound.

After that, it is about load, it is important to understand how the nodes behave when they get slammed with messages (memory usage + routed messages)

Then, it would be top to know how many times a node sees a repeated message. Since our first implementation is floodsub, it is important that we are efficient and not slam the nodes with the same content over and over (they have a timecache of 30 seconds, which should be enough to avoid getting repeated messages).

Entry points and Methods: I was thinking of:

*Aside from using the EventEmitters and exposed properties on each pubsub node (e.g.: this) any other suggestions for places/libs we might want hook into (anything in bitswap for example)? Because aside of the obvious ones like these, I recall you mentioning something about timestamps from incoming messages.

It actually might be good if the monitoring systems get all the information through logs, we use debug which enables even us to say DEBUG=*monitor and enable all the logs for monitoring, this way we will be able to use this service for other language implementations + parse them async.

Information at Scales: It’ll be useful to collect per node statistics. For tests with a large number of nodes (10k, 100k, 1M more?) it probably won’t be necessary to analyze data from all nodes, but it may be useful to analyze data from a random subset. Thoughts?

To start, just being able to see that subscriptions are being propagated and that edge nodes get the messages would be ideal. Then checking the time from publish -> routing -> subscription, would enable us to get proper metrics to improve speed and network forming.

@gavinmcdermott
Copy link
Owner Author

gavinmcdermott commented Sep 28, 2016

Thanks @diasdavid, I appreciate the notes! Will check out and apply. I'm cranking on this for the next few days to get it into your hands asap! Will keep you posted...

@gavinmcdermott
Copy link
Owner Author

gavinmcdermott commented Oct 2, 2016

Good news @diasdavid: This week I set to work on a creating few lightweight modules that now suit our purposes better! I developed a much cleaner view of things since I started building against your initial FloodSub implementation. I'll likely have some fun updates for Monday's call. I'll also have a lot more pushed tomorrow, but in the mean time, feel free to check out the updates below...

In a nutshell, we now have the concept of a modular libp2p-pstn-* ecology (pstn stands in for pubsub_testnet, which felt a bit long).

Updates

Modules are coming together as follows:

  • libp2p-pstn-node: for creating testnet node instances. (Repo link)
  • libp2p-pstn-logger: for adding testnet logging functionality to each pubsub instance. It is a decorator/proxy that currently works with the FloodSub mentioned above. It'd be great to firm up pubsub interface with you this week so that others can rely on this ecosystem for their implementations. It'll also allow others to create and test new topologies and strategies much faster! (Repo link)
  • libp2p-pstn-stats: for benchmarking pubsub implementations in the testnet. Works with pstn-node, pstn-logger, and the floodsub. (Repo link)
  • libp2p-pstn-topo-*: for creating testnet node topologies (distinctly different from pubsub topologies). initial repos up tomorrow
  • libp2p-pstn: the actual testnet instance. It will use pregenerated keys to speed things up. initial repos up tomorrow

Benchmarking Priorities

Regarding priorities, the above will allow us to benchmark the basics of message propagation through the test network. But after digging into the code, I find that some of deeper metrics like message repeats (and others) cannot be determined unless we open up an interface.

For example: I fooled around with poking into a pubsub node's peerSet's stream, and piping it into a sink, and decoding the protobuf...but that was hacky and not what we want ultimately. I think we can figure out what's needed for deeper metrics this week once we're solid on the basics. But we can talk about that on Monday.

Feedback

Look forward to hearing any feedback—regardless I'll be running to get some wonderful sh*t up for Monday's libp2p sync!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants