Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stat collection to IPTB #65

Closed
wants to merge 9 commits into from
Closed

Conversation

davinci26
Copy link

Hey y'all,

Thanks for the project it helped me a lot!

As discussed in IPFS issue board IPFS Performance #5226, I made some changes to the IPTB framework to generate measure the performance of IPFS and generate performance graphs. In detail I added the following functions in the framework:

  • iptb make-topology: This creates a connection graph between the nodes (e.g. star topology, barbell topology). In the topology files empty lines and lines starting with # are disregarded. For non empty line the syntax is origin:connection 1, connection 2 ... where origin and connections are specified with their node ID.
  • iptb dist -hash: The simulation here distributes a single file from node 0 to every other node in the network. Then it calculates the average time required to download the file, the standard deviation of the time, the maximum time, the minimum time, the duplicate blocks. The results are saved in a generated file called results.json.

I also added a Python3 script to plot the results that adds an additional optional dependency in the project, Matplotlib.

Finally I created a readme file, simulation.md, that explains the logic of the simulation. I also added there the response of @whyrusleeping to the issue Simulate bad network #50 so people would know that there is support for bad network.

I would appreciate your feedback and any improvements suggestions :)

@whyrusleeping
Copy link
Member

Hey @davinci26 Thanks a bunch for this! I hope you don't mind us pulling this apart a bit. We're working on refactoring iptb to be less ipfs centric (giving it the ability to really test any libp2p based protocol). And to complete that work, it may require some cutting and pasting :)

I'll leave it to @travisperson and @frrist to take it from here and see how integrating this work could happen.

@davinci26
Copy link
Author

No problem, feel free to make any changes you want!

I spent quite some time playing with this repo so if you want any help in the refactoring or in the integration, give me a shout :)

@travisperson
Copy link
Member

This is really cool and highlights what iptb can do nicely. 🏆

You can checkout the current work we are doing in PR #61, and there is an issue which has some discussion in #62 if you want to see what changes we are making. My comments below will be with regard to this work and how we can incorporate the changes you have here.

I think we can probably incorporate the idea behind these changes and improve iptb.

It looks like make-topology is a nice configuration based way to run a bunch of iptb connect calls which I can see having value when running iptb from an interactive shell. It would be interesting to explore ways to make some of it more generic, though I don't think that would be at all required for a first step.

I could see us adding a new flag to connect, which would specify a topology file. If the flag is specified, then the arguments to connect would no longer be required, and the file would be used to run the connect calls.

iptb connect --topology /tmp/barbell.txt

I really like the idea of baking stats into iptb, as it seems like a perfect use case. I think we could make it more generic as well, enabling it to be used to record a lot of different commands, for example, targeting the dht specifically.

We could make the run command be able to record basic statistics about command execution by passing a --stats flag, which specified a file to write stats out too, no arguments would dump the stats to stdout.

The dist command as written now, could be ran with something like iptb run --stats results.json -- ipfs cat <hash>.

Note: iptb run -- <cmd>, runs <cmd> on all nodes, it replaces for-each.

(If we wanted to run it against a subset of nodes, we could do iptb run --stats results.json [1,3,5-9] -- ipfs cat $HASH).

The only stat that does not carry over outside of ipfs very well is the duplicate block stat. That particular stat though I think is a perfect candidate for the new metric part of the api. We could make the stats take a list of metrics to collect along side the standard time execution stuff.

iptb run [--stats <outfile>] [--collect <metric> ...] -- <cmd>

@davinci26
Copy link
Author

Thanks for the review! I totally agree with your suggestions.

From my experience, the problem I had with the For each cli command was that it was synchronous. This prohibited me from simulating multiple nodes trying to do the same thing.

Maybe this could also be added as an optional flag in the run command:
iptb run --async --stats results.json -- ipfs cat <hash>

I can add the changes as you described them after you merge PR #61 to master, or you can add them directly to PR #61 :).

@davinci26
Copy link
Author

Hey y'all,
I saw that PR #61 was merged into master.

Can I start working on the topology feature and the statistics metrics based on the suggestions and comments of @travisperson?

@travisperson
Copy link
Member

Hey @davinci26 that sounds great! Let me know if you have any questions. You can find me (tperson) on our #ipfs-dev IRC channel (freenode).

the problem I had with the For each cli command was that it was synchronous.

It's asynchronous now (as of #61)! 🎉

@davinci26
Copy link
Author

Hey again,

I made the majority of the changes so feel free to review the code. I also added unit tests on the Stats just to be sure there is no stupid miscalculation.

The stats produced are the following:

type Stats struct {
	Mean      float64
	Std       float64
	Max       float64
	Quantile3 float64
	Median    float64
	Quantile1 float64
	Min       float64
}

Stats Implementation details:

  1. In the function mapWithOutput I calculate the time elapsed for the command and use it to add it in the Results struct
  2. I added in the Build Report a flag to calculate the stats if the user requests it. Currently, the stats are just printed on the std out. If you are happy with the design I can modify it to also be printed in a file if specified by the user. I print the Stats in JSON format for easier parsing by other programs.
  3. Also in the Build Report I pass an additional string which is the command executed. For me it feels more correct in the case that the output will be automatically parsed by another program to produce statistics (calculate bottlenecks on pipelines).
  4. Right now the majority of commands produce the stats. This could easily be disabled if you dont think its useful.
  5. I assume it would be useful to add stats besides time on the various plugins. However, I am not 100% sure what kind of metrics the plugins you plan to add produce. If you have any reading material on that then I can take a look into it.

Topology Implementation details:

  1. Added one function that parses a topology files with the same format as above and then transforms it into the range format used internally.

Looking forward to your comments 😃

@travisperson travisperson changed the title IPTB Pull request Add stat collection to IPTB Aug 28, 2018
Copy link
Member

@travisperson travisperson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating!

Right now the majority of commands produce the stats. This could easily be disabled if you dont think its useful.

I don't think having stats around init and start will be very useful. Mainly because I don't think iptb is the best way to measure that behavior. The measurements would tell you more about iptb and the plugin than the actual thing behind it.

The connect command also has the possibility for a lot of overhead. The command first has to determine how to connect one node to another (request swarm addresses, and possibly a peer id) before it tries to run a connect call.

I think it's best to only have the stats collecting on the run command for now.

I assume it would be useful to add stats besides time on the various plugins. However, I am not 100% sure what kind of metrics the plugins you plan to add produce. If you have any reading material on that then I can take a look into it.

I think we would just want to provide a way for the user to specify different metrics to collect from each node the command was run against. This would just leverage the Metrics interface.

With regard to the Stats struct, should we also be reporting the individual times for each node?

@@ -1,2 +1,5 @@
*.so
iptb
.vscode
topologies/*
deb.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .vscode I think is fine, but I don't think we need the other two.

return nil, errors.New("Line " + strconv.Itoa(lineNumber) + " does not follow the correct format")
}
destinations = strings.Split(lineTokenized[1], ",")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The else of this if statement is needless, it can just be part of the regular flow when the statement is false.

@@ -14,6 +14,7 @@ import (
)

func loadPlugins(dir string) error {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for these changes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code formatting tool decided to add an empty line there. I have manually checked the files and they are the same.

for _, connectionRow := range topologyGraph {
from := connectionRow[0]
to := connectionRow[1:]
err = connectNodes(tb, []int{from}, to, timeout, flagStats)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing the flagStats through, it might be better (and probably should of been done the first time) to return a ([]results, error) from connectNodes and then call buildReport at the end.

c1 = (length-1)/2 + 1
c2 = c1 - 1
// Print individual results to JSON format
func printIndivudualResults(nodeTime []float64, nodeID []int, metric string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think that's a typo and should be printIndividualResults ;)

the strings are not castable to float
the function returns an error
*/
func substractArrays(lhs []string, rhs []string) ([]float64, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more typo, subtractArrays*

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix all of them :) Thnx a lot for the review, I totally agree details matter a lot !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries! And one of the comments I made got marked as outdated for some reason, but printIndividualResults is typo'd as printIndivudualResults


// Calculate the min, max mean, std with a single pass of the array
sum := 0.
sumSquarred := 0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sumSquared (just one r)

}

//BuildStats generates a stats struct from an input float64 slice
//It returns an error if the slice is empty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines are missing a space after the // to be consistent with the others (sorry, details stand out to me haha)

@@ -26,7 +26,7 @@ $ iptb shell 4
$ ipfs cat QmNqugRcYjwh9pEQUK7MLuxvLjxDNZL1DH8PJJgWtQXxuF
hey!
```

Available plugins now: Local IPFS node (plugin_name: localipfs), Docker IPFS node (plugin_name: dockeripfs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would also be nice to be able to list available plugins via the CLI. Doesn't have to be in this PR though, just mentioning.

There are three variants of the command. It can accept no arugments,
a single argument, or two arguments. The no argument and single argument
expands out to the two argument usage.
There are four variants of the command. It can accept no arugments,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arguments*

expands out to the two argument usage.
There are four variants of the command. It can accept no arugments,
a single argument, two arguments or a file. The no argument and single argument
expands out to the two argument usage. The file argument, parses the file and exapnds it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expands* (the second one)

commands/run.go Outdated
nodeMetricValue[i] = value
}
return nodeMetricValue, nil
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I opted to collect the metric before and after the command is executed in run.go. I think with this approach the code is cleaner and makes more sense.
I could add it in the mapWithOutput function but since this function is using locks the more operations I do inside the locks the more the overhead.

I would love to hear your thoughts on which design you think is more useful for the metrics you plan to implement.

the function returns an error
TODO: This could be more generic
*/
func subtractArrays(lhs []string, rhs []string) ([]float64, error) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is implemented now the metrics are collected as follows:
Let metric be collected, then we collect m1before before the execution of the run command and m1after after. Then we calculate the statistics based on Δm1 = (m1after - m1before).

@travisperson
Copy link
Member

travisperson commented Sep 5, 2018

I was giving this some thought over the weekend and I think we should try and only add what is minimally required to expose the functionality required to do these stat calculations. While iptb could actually do all the calculations, I don't know if there is a strong reason why it needs to do them. It might be better to just report the raw stats and allow the user to handle how to calculate values from them. So with this we would just be adding an elapsed time to the results (as you already have).

Collecting metrics before and after isn't a lot different than running iptb metric <node> <metric> manually before or after (the command probably should support ranges), so I don't know if we should add anything like that at the moment.

Per our discussion on IRC, you also pointed out that for metrics it would be better to actually sample over the course of the commands execution. I think this would probably be the correct thing to do as well for any realtime stats (cpu / memory / etc). This does get a little bit complicated though and I think we really need to start a discussion in another issue and work through exactly the requirements needed to find a good approach if we want to go down this route.

For this PR though I think we can keep it pretty simple, and just always report the elapse time for all commands ran that use the standard buildReport. We will no longer need the --stats flag right now.

iptb run -- ipfs id --format="<id>"
node[0] exit 0; elapsed 10ns

QmXschyVzLmS4JqPN1kuhCXTjau2oQkVuzjvTbQFTGm3w3
node[1] exit 0; elapsed 10ns

Qme3h8WwfpBiPHPfdEs9GuegijVhaBX9xYPXTTDAS6uciR
node[2] exit 0; elapsed 10ns

QmbKpZTyRodMmjkBPdyngkha1ic6bu7YVXPzisRdFNi5Yq

To summarize. In this PR we should have the connect topology, and in addition add a elapsed time to all results / buildReport.

I think we can easily move forward with that, and then start to discuss further how we can approach the other topics.

@travisperson
Copy link
Member

travisperson commented Sep 5, 2018

On an additional note I'm actually going to push back on the topology feature as well right now.

Sorry for this being a bit messy. I should have requested an issue from the start so that we could have a discussion outside of the code.

My concerns at the moment with regard to the topology is that it can be accomplished with a shell script fairly easily, and I don't see the exact benefits of the topology file at the moment due to the restrictiveness of the format, as it works with exactly N nodes.

Let take a step back from this and open two issues to see how we can address the requests

  1. Connect topology Connection Topology #76
  2. Collecting statistics Collecting statistics #77

I'll open these issues and bring some context in, we can then discuss them from there.

@davinci26
Copy link
Author

The last commit reflects the discussion in #77 #76 . The user can run
ipfs run 0 --encoding JSON <args> or ipfs run 0 --encoding text <args> and get the outpout in the said encoding.
The JSON encoding transforms the output (including the plugin stderr and stdout) into a JSON. This way the user can specify he/she prefers a human-readable or a machine-readable output.

commands/run.go Outdated
Name: "encoding",
Usage: "Specify the output format, current options JSON and text",
Value: "text",
},
Copy link
Member

@travisperson travisperson Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add this as a global flag under the app in main.go so that it works for other commands. The format error checking below can take place in the Before block as well.

Node int
Output testbedi.Output
Error error
Elapsed float64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this as the time.Duration over converting it to seconds in a float64 would probably be good for this internal structure. It will also make it easier to provide a more human readable format as Duration#String handles that. The JSON output can probably say a float64 and returned in seconds though being the SI unit and all.

if rs.Output.Error() != nil {
fmt.Printf("%s", rs.Output.Error())
if encoding == "text" {
fmt.Printf("node[%d] exit %d Elapsed %.4fs \n", rs.Node, rs.Output.ExitCode(), rs.Elapsed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Lowercase the E here, and remove the extra space at the end.

Also with rs.Elapsed being a duration, we can use %s to get a human readable format out.

ExitCode int
Errors error
PluginStdOut string
PluginStdErr string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors => Error as it's a single error.

nit: To keep the casing similar to other usages through the code and go in general I think we should use Stdout and Stderr here.

}
if b, err := ioutil.ReadAll(rs.Output.Stderr()); err == nil {
pluginErr = string(b)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to handle the other side of this error, when the error is not nil.

PluginStdOut: pluginOut,
PluginStdErr: pluginErr,
Elapsed: rs.Elapsed,
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to also handle this error.

Copy link
Member

@travisperson travisperson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a few comments around the implementation. I think once these are addressed we can move forward with this as we will have a pretty settled interface that I think will work.

Copy link
Member

@travisperson travisperson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Just need to handle an error, and fix a newline in a few places.

}
pluginErr := string(pluginErrB)
// JSONify the plugin output
rsJSON, _ := json.Marshal(output{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing this error


io.Copy(os.Stdout, rs.Output.Stdout())
io.Copy(os.Stdout, rs.Output.Stderr())

fmt.Println()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move this up inside of the text encoding section, and then also make sure we have a newline outside of the loop (it can be printed for all cases that are not text).

This will make it so that all output for a single command will be ld-json, and between each execution will be a newline. This will help break up the output if appending to the same file.

The output should look like if it where being appended to the same file

{"Node": 0, ...}
{"Node": 1, ...}
{"Node": 2, ...}

{"Node": 0, ...}
{"Node": 1, ...}
{"Node": 2, ...}

@davinci26
Copy link
Author

I handled the error cases mentioned and also the formatting is as specified when I append to a file.

Copy link
Member

@frrist frrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @davinci26 thank you for all the work you have done here!

It looks like this PR is trying to tackle more than the title suggests. It appears that an output encoding option has been added for both text and json, in addition to stats for the elapsed time of the start command. Typically changes of this nature require previous discussion wrt design (we apologize for not making this more clear in our README). Until we have more design discussion around these features, we will be unable to accept the changes here as we cannot yet determine if they align with the long-term strategy for IPTB.

Some discussion around this has already been started in #77, and we have opened #88 to start a discussion around a similar feature set. It would be great if we could get your help hammering out the design details of these things.

If our pace here is limiting the work you are trying to do, please feel free to fork this repo -- that could even serve as a means to prototype out some of your ideas so that we can better incorporate them with the design goals of IPTB

@davinci26
Copy link
Author

Hey @frrist,

I was thinking about this as well, that this PR is getting too big and for no reason, as there was no design discussion prior to the beginning of the work. It was also an omission from my side for not opening an issue prior to the PR. Feel free to decline the PR.

I have written my ideas in #77 under general thoughts and I am more than happy to help in designing meaningful event logs.

@whyrusleeping
Copy link
Member

I'm really excited that iptb is getting so much attention. I love a lot of the direction here, let's get things designed nicely and march onwards :) Thanks @davinci26 for pushing!

One thing I always try to keep in mind in the design of tools is the unix philosophy, primarily "do one thing well". What that means here specifically, I'm not sure, but I'm thinking that we may end up adding a companion tool to iptb for metrics collection, and just ensuring that iptb doesnt prevent us from getting the information we need. This would allow us to more easily tune the stats we collect for each usecase. (similar design philosophy went into the plugins architecture)

@davinci26 davinci26 closed this Aug 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants