Add stat collection to IPTB #65

davinci26 · 2018-07-30T18:25:08Z

Hey y'all,

Thanks for the project it helped me a lot!

As discussed in IPFS issue board IPFS Performance #5226, I made some changes to the IPTB framework to generate measure the performance of IPFS and generate performance graphs. In detail I added the following functions in the framework:

iptb make-topology: This creates a connection graph between the nodes (e.g. star topology, barbell topology). In the topology files empty lines and lines starting with # are disregarded. For non empty line the syntax is origin:connection 1, connection 2 ... where origin and connections are specified with their node ID.
iptb dist -hash: The simulation here distributes a single file from node 0 to every other node in the network. Then it calculates the average time required to download the file, the standard deviation of the time, the maximum time, the minimum time, the duplicate blocks. The results are saved in a generated file called results.json.

I also added a Python3 script to plot the results that adds an additional optional dependency in the project, Matplotlib.

Finally I created a readme file, simulation.md, that explains the logic of the simulation. I also added there the response of @whyrusleeping to the issue Simulate bad network #50 so people would know that there is support for bad network.

I would appreciate your feedback and any improvements suggestions :)

whyrusleeping · 2018-07-30T19:05:51Z

Hey @davinci26 Thanks a bunch for this! I hope you don't mind us pulling this apart a bit. We're working on refactoring iptb to be less ipfs centric (giving it the ability to really test any libp2p based protocol). And to complete that work, it may require some cutting and pasting :)

I'll leave it to @travisperson and @frrist to take it from here and see how integrating this work could happen.

davinci26 · 2018-07-30T20:07:42Z

No problem, feel free to make any changes you want!

I spent quite some time playing with this repo so if you want any help in the refactoring or in the integration, give me a shout :)

travisperson · 2018-07-31T19:27:08Z

This is really cool and highlights what iptb can do nicely. 🏆

You can checkout the current work we are doing in PR #61, and there is an issue which has some discussion in #62 if you want to see what changes we are making. My comments below will be with regard to this work and how we can incorporate the changes you have here.

I think we can probably incorporate the idea behind these changes and improve iptb.

It looks like make-topology is a nice configuration based way to run a bunch of iptb connect calls which I can see having value when running iptb from an interactive shell. It would be interesting to explore ways to make some of it more generic, though I don't think that would be at all required for a first step.

I could see us adding a new flag to connect, which would specify a topology file. If the flag is specified, then the arguments to connect would no longer be required, and the file would be used to run the connect calls.

iptb connect --topology /tmp/barbell.txt

I really like the idea of baking stats into iptb, as it seems like a perfect use case. I think we could make it more generic as well, enabling it to be used to record a lot of different commands, for example, targeting the dht specifically.

We could make the run command be able to record basic statistics about command execution by passing a --stats flag, which specified a file to write stats out too, no arguments would dump the stats to stdout.

The dist command as written now, could be ran with something like iptb run --stats results.json -- ipfs cat <hash>.

Note: iptb run -- <cmd>, runs <cmd> on all nodes, it replaces for-each.

(If we wanted to run it against a subset of nodes, we could do iptb run --stats results.json [1,3,5-9] -- ipfs cat $HASH).

The only stat that does not carry over outside of ipfs very well is the duplicate block stat. That particular stat though I think is a perfect candidate for the new metric part of the api. We could make the stats take a list of metrics to collect along side the standard time execution stuff.

iptb run [--stats <outfile>] [--collect <metric> ...] -- <cmd>

davinci26 · 2018-08-01T09:37:35Z

Thanks for the review! I totally agree with your suggestions.

From my experience, the problem I had with the For each cli command was that it was synchronous. This prohibited me from simulating multiple nodes trying to do the same thing.

Maybe this could also be added as an optional flag in the run command:
iptb run --async --stats results.json -- ipfs cat <hash>

I can add the changes as you described them after you merge PR #61 to master, or you can add them directly to PR #61 :).

davinci26 · 2018-08-20T14:29:29Z

Hey y'all,
I saw that PR #61 was merged into master.

Can I start working on the topology feature and the statistics metrics based on the suggestions and comments of @travisperson?

travisperson · 2018-08-20T15:19:21Z

Hey @davinci26 that sounds great! Let me know if you have any questions. You can find me (tperson) on our #ipfs-dev IRC channel (freenode).

the problem I had with the For each cli command was that it was synchronous.

It's asynchronous now (as of #61)! 🎉

davinci26 · 2018-08-27T11:29:33Z

Hey again,

I made the majority of the changes so feel free to review the code. I also added unit tests on the Stats just to be sure there is no stupid miscalculation.

The stats produced are the following:

type Stats struct {
	Mean      float64
	Std       float64
	Max       float64
	Quantile3 float64
	Median    float64
	Quantile1 float64
	Min       float64
}

Stats Implementation details:

In the function mapWithOutput I calculate the time elapsed for the command and use it to add it in the Results struct
I added in the Build Report a flag to calculate the stats if the user requests it. Currently, the stats are just printed on the std out. If you are happy with the design I can modify it to also be printed in a file if specified by the user. I print the Stats in JSON format for easier parsing by other programs.
Also in the Build Report I pass an additional string which is the command executed. For me it feels more correct in the case that the output will be automatically parsed by another program to produce statistics (calculate bottlenecks on pipelines).
Right now the majority of commands produce the stats. This could easily be disabled if you dont think its useful.
I assume it would be useful to add stats besides time on the various plugins. However, I am not 100% sure what kind of metrics the plugins you plan to add produce. If you have any reading material on that then I can take a look into it.

Topology Implementation details:

Added one function that parses a topology files with the same format as above and then transforms it into the range format used internally.

Looking forward to your comments 😃

travisperson

Thanks for updating!

Right now the majority of commands produce the stats. This could easily be disabled if you dont think its useful.

I don't think having stats around init and start will be very useful. Mainly because I don't think iptb is the best way to measure that behavior. The measurements would tell you more about iptb and the plugin than the actual thing behind it.

The connect command also has the possibility for a lot of overhead. The command first has to determine how to connect one node to another (request swarm addresses, and possibly a peer id) before it tries to run a connect call.

I think it's best to only have the stats collecting on the run command for now.

I assume it would be useful to add stats besides time on the various plugins. However, I am not 100% sure what kind of metrics the plugins you plan to add produce. If you have any reading material on that then I can take a look into it.

I think we would just want to provide a way for the user to specify different metrics to collect from each node the command was run against. This would just leverage the Metrics interface.

With regard to the Stats struct, should we also be reporting the individual times for each node?

travisperson · 2018-08-28T20:15:31Z

.gitignore

@@ -1,2 +1,5 @@
 *.so
 iptb
+.vscode
+topologies/*
+deb.sh


The .vscode I think is fine, but I don't think we need the other two.

travisperson · 2018-08-28T21:21:05Z

commands/connect.go

+				return nil, errors.New("Line " + strconv.Itoa(lineNumber) + " does not follow the correct format")
+			}
+			destinations = strings.Split(lineTokenized[1], ",")
+		}


nit: The else of this if statement is needless, it can just be part of the regular flow when the statement is false.

travisperson · 2018-08-28T21:23:26Z

main.go

@@ -14,6 +14,7 @@ import (
 )

 func loadPlugins(dir string) error {
+


Any reason for these changes?

The code formatting tool decided to add an empty line there. I have manually checked the files and they are the same.

travisperson · 2018-08-28T22:41:09Z

commands/connect.go

+			for _, connectionRow := range topologyGraph {
+				from := connectionRow[0]
+				to := connectionRow[1:]
+				err = connectNodes(tb, []int{from}, to, timeout, flagStats)


Instead of passing the flagStats through, it might be better (and probably should of been done the first time) to return a ([]results, error) from connectNodes and then call buildReport at the end.

dgrisham · 2018-09-01T17:12:49Z

commands/utils.go

-		c1 = (length-1)/2 + 1
-		c2 = c1 - 1
+// Print individual results to JSON format
+func printIndivudualResults(nodeTime []float64, nodeID []int, metric string) {


Think that's a typo and should be printIndividualResults ;)

dgrisham · 2018-09-01T17:25:07Z

commands/utils.go

+the strings are not castable to float
+the function returns an error
+*/
+func substractArrays(lhs []string, rhs []string) ([]float64, error) {


One more typo, subtractArrays*

I will fix all of them :) Thnx a lot for the review, I totally agree details matter a lot !

No worries! And one of the comments I made got marked as outdated for some reason, but printIndividualResults is typo'd as printIndivudualResults

dgrisham · 2018-09-01T17:26:30Z

commands/stats.go

+
+	// Calculate the min, max mean, std with a single pass of the array
+	sum := 0.
+	sumSquarred := 0.


sumSquared (just one r)

dgrisham · 2018-09-01T17:28:14Z

commands/stats.go

+}
+
+//BuildStats generates a stats struct from an input float64 slice
+//It returns an error if the slice is empty


These two lines are missing a space after the // to be consistent with the others (sorry, details stand out to me haha)

dgrisham · 2018-09-01T17:30:22Z

README.md

@@ -26,7 +26,7 @@ $ iptb shell 4
 $ ipfs cat QmNqugRcYjwh9pEQUK7MLuxvLjxDNZL1DH8PJJgWtQXxuF
 hey!
 ```
-
+Available plugins now: Local IPFS node (plugin_name: localipfs), Docker IPFS node (plugin_name: dockeripfs)


Would also be nice to be able to list available plugins via the CLI. Doesn't have to be in this PR though, just mentioning.

dgrisham · 2018-09-01T17:30:56Z

commands/connect.go

-There are three variants of the command. It can accept no arugments,
-a single argument, or two arguments. The no argument and single argument
-expands out to the two argument usage.
+There are four variants of the command. It can accept no arugments,


dgrisham · 2018-09-01T17:31:25Z

commands/connect.go

-expands out to the two argument usage.
+There are four variants of the command. It can accept no arugments,
+a single argument, two arguments or a file. The no argument and single argument
+expands out to the two argument usage. The file argument, parses the file and exapnds it


expands* (the second one)

davinci26 · 2018-09-03T12:11:37Z

commands/run.go

+		nodeMetricValue[i] = value
+	}
+	return nodeMetricValue, nil
+}


Here I opted to collect the metric before and after the command is executed in run.go. I think with this approach the code is cleaner and makes more sense.
I could add it in the mapWithOutput function but since this function is using locks the more operations I do inside the locks the more the overhead.

I would love to hear your thoughts on which design you think is more useful for the metrics you plan to implement.

davinci26 · 2018-09-03T12:18:18Z

commands/utils.go

+the function returns an error
+TODO: This could be more generic
+*/
+func subtractArrays(lhs []string, rhs []string) ([]float64, error) {


As it is implemented now the metrics are collected as follows:
Let metric be collected, then we collect m1_before before the execution of the run command and m1_after after. Then we calculate the statistics based on Δm1 = (m1_after - m1_before).

travisperson · 2018-09-05T20:12:31Z

I was giving this some thought over the weekend and I think we should try and only add what is minimally required to expose the functionality required to do these stat calculations. While iptb could actually do all the calculations, I don't know if there is a strong reason why it needs to do them. It might be better to just report the raw stats and allow the user to handle how to calculate values from them. So with this we would just be adding an elapsed time to the results (as you already have).

Collecting metrics before and after isn't a lot different than running iptb metric <node> <metric> manually before or after (the command probably should support ranges), so I don't know if we should add anything like that at the moment.

Per our discussion on IRC, you also pointed out that for metrics it would be better to actually sample over the course of the commands execution. I think this would probably be the correct thing to do as well for any realtime stats (cpu / memory / etc). This does get a little bit complicated though and I think we really need to start a discussion in another issue and work through exactly the requirements needed to find a good approach if we want to go down this route.

For this PR though I think we can keep it pretty simple, and just always report the elapse time for all commands ran that use the standard buildReport. We will no longer need the --stats flag right now.

iptb run -- ipfs id --format="<id>"
node[0] exit 0; elapsed 10ns

QmXschyVzLmS4JqPN1kuhCXTjau2oQkVuzjvTbQFTGm3w3
node[1] exit 0; elapsed 10ns

Qme3h8WwfpBiPHPfdEs9GuegijVhaBX9xYPXTTDAS6uciR
node[2] exit 0; elapsed 10ns

QmbKpZTyRodMmjkBPdyngkha1ic6bu7YVXPzisRdFNi5Yq

To summarize. ~~In this PR we should have the connect topology, and in addition~~ add a elapsed time to all results / buildReport.

I think we can easily move forward with that, and then start to discuss further how we can approach the other topics.

travisperson · 2018-09-05T23:54:00Z

On an additional note I'm actually going to push back on the topology feature as well right now.

Sorry for this being a bit messy. I should have requested an issue from the start so that we could have a discussion outside of the code.

My concerns at the moment with regard to the topology is that it can be accomplished with a shell script fairly easily, and I don't see the exact benefits of the topology file at the moment due to the restrictiveness of the format, as it works with exactly N nodes.

Let take a step back from this and open two issues to see how we can address the requests

Connect topology Connection Topology #76
Collecting statistics Collecting statistics #77

I'll open these issues and bring some context in, we can then discuss them from there.

davinci26 · 2018-09-10T15:23:07Z

The last commit reflects the discussion in #77 #76 . The user can run
ipfs run 0 --encoding JSON <args> or ipfs run 0 --encoding text <args> and get the outpout in the said encoding.
The JSON encoding transforms the output (including the plugin stderr and stdout) into a JSON. This way the user can specify he/she prefers a human-readable or a machine-readable output.

travisperson · 2018-09-11T01:12:02Z

commands/run.go

+			Name:  "encoding",
+			Usage: "Specify the output format, current options JSON and text",
+			Value: "text",
+		},


We should probably add this as a global flag under the app in main.go so that it works for other commands. The format error checking below can take place in the Before block as well.

travisperson · 2018-09-11T01:19:01Z

commands/utils.go

+	Node    int
+	Output  testbedi.Output
+	Error   error
+	Elapsed float64


Keeping this as the time.Duration over converting it to seconds in a float64 would probably be good for this internal structure. It will also make it easier to provide a more human readable format as Duration#String handles that. The JSON output can probably say a float64 and returned in seconds though being the SI unit and all.

travisperson · 2018-09-11T01:24:58Z

commands/utils.go

-			if rs.Output.Error() != nil {
-				fmt.Printf("%s", rs.Output.Error())
+			if encoding == "text" {
+				fmt.Printf("node[%d] exit %d Elapsed %.4fs \n", rs.Node, rs.Output.ExitCode(), rs.Elapsed)


nit: Lowercase the E here, and remove the extra space at the end.

Also with rs.Elapsed being a duration, we can use %s to get a human readable format out.

travisperson · 2018-09-11T01:37:09Z

commands/utils.go

+	ExitCode     int
+	Errors       error
+	PluginStdOut string
+	PluginStdErr string


Errors => Error as it's a single error.

nit: To keep the casing similar to other usages through the code and go in general I think we should use Stdout and Stderr here.

travisperson · 2018-09-11T01:50:08Z

commands/utils.go

+				}
+				if b, err := ioutil.ReadAll(rs.Output.Stderr()); err == nil {
+					pluginErr = string(b)
+				}


We need to handle the other side of this error, when the error is not nil.

travisperson · 2018-09-11T01:50:29Z

commands/utils.go

+					PluginStdOut: pluginOut,
+					PluginStdErr: pluginErr,
+					Elapsed:      rs.Elapsed,
+				})


Need to also handle this error.

travisperson

Have a few comments around the implementation. I think once these are addressed we can move forward with this as we will have a pretty settled interface that I think will work.

travisperson

Looking good. Just need to handle an error, and fix a newline in a few places.

travisperson · 2018-09-11T18:41:11Z

commands/utils.go

+				}
+				pluginErr := string(pluginErrB)
+				// JSONify the plugin output
+				rsJSON, _ := json.Marshal(output{


missing this error

travisperson · 2018-09-11T18:43:11Z

commands/utils.go

-
-			io.Copy(os.Stdout, rs.Output.Stdout())
-			io.Copy(os.Stdout, rs.Output.Stderr())
-
 			fmt.Println()


We should move this up inside of the text encoding section, and then also make sure we have a newline outside of the loop (it can be printed for all cases that are not text).

This will make it so that all output for a single command will be ld-json, and between each execution will be a newline. This will help break up the output if appending to the same file.

The output should look like if it where being appended to the same file

{"Node": 0, ...} {"Node": 1, ...} {"Node": 2, ...} {"Node": 0, ...} {"Node": 1, ...} {"Node": 2, ...}

davinci26 · 2018-09-13T12:32:33Z

I handled the error cases mentioned and also the formatting is as specified when I append to a file.

frrist

Hey @davinci26 thank you for all the work you have done here!

It looks like this PR is trying to tackle more than the title suggests. It appears that an output encoding option has been added for both text and json, in addition to stats for the elapsed time of the start command. Typically changes of this nature require previous discussion wrt design (we apologize for not making this more clear in our README). Until we have more design discussion around these features, we will be unable to accept the changes here as we cannot yet determine if they align with the long-term strategy for IPTB.

Some discussion around this has already been started in #77, and we have opened #88 to start a discussion around a similar feature set. It would be great if we could get your help hammering out the design details of these things.

If our pace here is limiting the work you are trying to do, please feel free to fork this repo -- that could even serve as a means to prototype out some of your ideas so that we can better incorporate them with the design goals of IPTB

davinci26 · 2018-09-13T19:29:53Z

Hey @frrist,

I was thinking about this as well, that this PR is getting too big and for no reason, as there was no design discussion prior to the beginning of the work. It was also an omission from my side for not opening an issue prior to the PR. Feel free to decline the PR.

I have written my ideas in #77 under general thoughts and I am more than happy to help in designing meaningful event logs.

whyrusleeping · 2018-09-13T19:48:27Z

I'm really excited that iptb is getting so much attention. I love a lot of the direction here, let's get things designed nicely and march onwards :) Thanks @davinci26 for pushing!

One thing I always try to keep in mind in the design of tools is the unix philosophy, primarily "do one thing well". What that means here specifically, I'm not sure, but I'm thinking that we may end up adding a companion tool to iptb for metrics collection, and just ensuring that iptb doesnt prevent us from getting the information we need. This would allow us to more easily tune the stats we collect for each usecase. (similar design philosophy went into the plugins architecture)

whyrusleeping requested review from travisperson and frrist July 30, 2018 19:03

davinci26 force-pushed the master branch from 916b18e to a7d4cd7 Compare August 23, 2018 10:37

davinci added 4 commits August 23, 2018 12:29

Network topology feature compiled, still needs testing

7ef9124

updated git ignore

e80ab06

First round of testing

a80af49

Initial implementation of stats

ba5b49a

travisperson changed the title ~~IPTB Pull request~~ Add stat collection to IPTB Aug 28, 2018

travisperson suggested changes Aug 28, 2018

View reviewed changes

dgrisham reviewed Sep 1, 2018

View reviewed changes

davinci26 commented Sep 3, 2018

View reviewed changes

This was referenced Sep 6, 2018

Connection Topology #76

Open

Collecting statistics #77

Open

Input file with list of commands to run async on nodes #75

Merged

lightweight stats

27a1aac

davinci26 force-pushed the master branch from 5baba2f to 27a1aac Compare September 10, 2018 12:27

davinci26 added 2 commits September 10, 2018 15:40

changed variable from timeElapsed to Elapsed

acdfbba

JSON output

6330379

travisperson reviewed Sep 11, 2018

View reviewed changes

travisperson suggested changes Sep 11, 2018

View reviewed changes

Moved encoding to global variable, various formatting changes

4756f1c

travisperson suggested changes Sep 11, 2018

View reviewed changes

Formatting output

1139529

davinci26 mentioned this pull request Sep 13, 2018

Fix: Connect nodes in parallel #87

Closed

frrist suggested changes Sep 13, 2018

View reviewed changes

davinci26 closed this Aug 11, 2020

		@@ -14,6 +14,7 @@ import (
		)

		func loadPlugins(dir string) error {

Add stat collection to IPTB #65

Add stat collection to IPTB #65

Conversation

davinci26 commented Jul 30, 2018

whyrusleeping commented Jul 30, 2018

davinci26 commented Jul 30, 2018

travisperson commented Jul 31, 2018

davinci26 commented Aug 1, 2018

davinci26 commented Aug 20, 2018

travisperson commented Aug 20, 2018

davinci26 commented Aug 27, 2018

travisperson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

travisperson commented Sep 5, 2018 • edited Loading

travisperson commented Sep 5, 2018 • edited Loading

davinci26 commented Sep 10, 2018

travisperson Sep 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

travisperson left a comment

Choose a reason for hiding this comment

travisperson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davinci26 commented Sep 13, 2018

frrist left a comment • edited Loading

Choose a reason for hiding this comment

davinci26 commented Sep 13, 2018

whyrusleeping commented Sep 13, 2018

travisperson commented Sep 5, 2018 •

edited

Loading

travisperson commented Sep 5, 2018 •

edited

Loading

travisperson Sep 11, 2018 •

edited

Loading

frrist left a comment •

edited

Loading