Story: Test Suite for 1MB -> 100TB Payloads #102

flyingzumwalt · 2017-01-16T19:27:40Z

We don't have good metrics, graphs or reports about performance when we increase sizes and loads -- where/when/how performance dipped under certain circumstances. We need to know more than "Does it scale?". We need to know "how does it scale?" So we can identify the domain of problems, etc.

Acceptance Scenario

This story will be done when IPFS Maintainers (or, ideally, anyone) can run a suite of scripts that test IPFS at each order of magnitude of total data, from 1MB -> 100TB (up to 500TB or 1PB).

For each magnitude the tests should cover a variety of payloads. At the very least, there should be

a giant file payload
a payload with lots of little files

Tasks

Identify or Generate Test Data for each of the Workloads
Create the tests. Add them to https://github.com/ipfs/fs-stress-test (or another appropriate location)
Provide documentation on how to run the tests
Have at least one collaborator run the tests based on the provided documentation

flyingzumwalt · 2017-01-16T19:30:21Z

@jbenet @whyrusleeping What can we add to make the acceptance scenario more precise? What are the tests checking for?

Are they reporting on performance?
Are they watching memory load?
Are they watching blockstore performance?
Are they running checks to confirm that data were replicated properly?

ghost · 2017-01-16T19:38:12Z

Another note: getting ahold of 100TB hardware is non-trivial, and expensive

flyingzumwalt · 2017-01-16T20:08:15Z

Yeah, but there are definitely orgs out there who do have easy access to that kind of storage and want to test IPFS at those volumes. It might make sense to write these tests with the assumption that they will be run by 3rd parties who then hand back reports after running them.

flyingzumwalt · 2017-01-16T20:20:27Z

@Kubuxu this might be a good place for you to start tomorrow before we get to do a proper sprint planning call. Also watch the issues in the "Ready" column in https://waffle.io/ipfs/archives. At the moment most of them are for @jbenet to review stuff but you can review them too!

Kubuxu · 2017-01-17T17:06:50Z

The major thing we need to know is how blockstores scale with number of items and size of those items. Those are two metrics that will probably characterize performance.

Setup for those tests is expensive and long (you need to write those GiBs or TiBs of data into disk). We can do those tests incrementally but we need setup dedicated for it).

Are they running checks to confirm that data were replicated properly?

In IFPS we have HashOnRead option which for archives IMO should be default on, disk corruptions happen even in RAID setups and without it we have no way of noticing it. We have to check if it screams loudly about it, and have a way to recover (linking the corrupted block to original file and reading it to recover corrupted block).

whyrusleeping · 2017-01-18T18:38:09Z

Tests we want:

1 node, adding dataset
1 node adding dataset, same node cat'ing dataset
2 nodes, add on one, cat on the other
3 nodes, add on 1, cat on other two at the same time
3 nodes, add on 1, cat on other two, one after the other
10 nodes, add on 1, cat on others concurrently
10 nodes, add on 1, cat on others serially
100 nodes, add on 1, cat on others concurrently
100 nodes, add on 1, cat on others in sets of 10 (first ten nodes concurrently, next ten, etc)
100 nodes, add on 1, cat on others serially

Each of these tests will be run on each dataset, where the datasets vary by the following variables:

number of files
size of files (min/max)
nesting depth of directories

Each test should also be run for the following different node configurations:

routing = { normal, dhtclient, none}
NoSync = {true, false}
--raw-leaves on add
--chunker={normal, rabin}

during these tests, we want to gather the metrics described in this issue: ipfs/kubo#3607

whyrusleeping · 2017-01-18T18:41:14Z

Ideally, we can start any of these tests with a UX of something like:

> iptest 10n-concur-cat --datasets=datadef.json --nodecfg=ncfg.json
Running... [ elapsed time: 1m23s ]
Test Complete!
Results available at: /ipfs/QmFooBarBaz

jbenet · 2017-01-23T19:59:13Z

something like

> iptest --routing=<value> --repo-sync=<value> --raw-leaves=<value> --chunker=<value> --num-files=<value> --file-size-min=<val>  --file-size-max=<val>
Running... [elapsed time: 1m23s]
Test complete!
results available at: /ipfs/QmFooBarBaz

whyrusleeping · 2017-01-24T05:12:26Z

@jbenet made a rudimentary tool to do this here: https://github.com/whyrusleeping/iptest

flyingzumwalt · 2017-01-25T19:46:46Z

@jbenet reminder: For #122 we need you to clarify which tests we're aiming for. That mainly involves rearranging and this list #102 (comment) and calling out which parts are important for #122 to enable.

jbenet · 2017-01-25T20:24:59Z

clarifying tests we need from #102 (comment)

We

MUST have P0
SHOULD have P1 and P2

in

P0 (easiest, get these working first)
- 1 node, adding dataset
- 1 node adding dataset, same node cat'ing dataset
P1 (should not be much more, we should aim to get these)
- 2 nodes, add on one, cat on the other
- 3 nodes, add on 1, cat on other two at the same time (concurrent)
- 3 nodes, add on 1, cat on other two, one after the other (serially)
P2 (10 -> 100 should not be much work, can be just one line change)
- 10 nodes, add on 1, cat on others concurrently
- 10 nodes, add on 1, cat on others serially
- 100 nodes, add on 1, cat on others concurrently
- 100 nodes, add on 1, cat on others serially
- 100 nodes, add on 1, cat on others in sets of 10 (first ten nodes concurrently, next ten, etc)

As described in #102 (comment) each of these tests will be run on each dataset, with varying these variables:

number of files
size of files (min/max)
nesting depth of directories

We can adjust these variables o/ to play with go-random-files better, or modify go-random-files flags to take these variables better.

With tuples like:

num files, min file size, max file size, directory nesting depth
1, 1KB, 1KB, 1
1000, 10KB, 1MB, 10
10000, 500B, 1KB, 100
100000, 500B, 1KB, 1000
1000000, 500B, 1KB, 1000
10000000, 500B, 1KB, 1000
1, 1MB, 1MB, 1
10, 1MB, 10MB, 1
100, 1MB, 10MB, 5
1, 100MB, 100MB, 1
10, 20MB, 100MB, 3
100, 20MB, 100MB, 5
1000, 20MB, 100MB, 10
1, 1GB, 1GB, 1
10, 500MB, 1GB, 1
100, 100MB, 1GB, 5
1000, 100MB, 1GB, 10
1, 1TB, 1TB, 1
10, 500GB, 1TB, 1

We can auto generate the tuples or pick a few we think are interesting. I think auto generating may generate a lot more than we need. So maybe making a standard list will be useful. We can auto generate that list and prune it or something.

Also, this makes me think that we should improve go-random-files to sample from other distributions (non uniform) in the future, or have things like total max size, randomly distributed. (to generate more realistic workloads.

If any of the tuples above are too big to hit now, that's fine. let's aim for getting a bunch of them working first.

Each test should also be run for the following different node configurations:

routing = { normal, dhtclient, none}
NoSync = {true, false}
--raw-leaves on add
--chunker={normal, rabin}

during these tests, we want to gather the metrics described in this issue: ipfs/kubo#3607

whyrusleeping · 2017-01-25T20:43:35Z

Generating these datasets from scratch on each test run is impractical. We should have a way to generate them determinstically, but be able to reuse previously generated datasets across test runs.

flyingzumwalt added needs clarification story labels Jan 16, 2017

flyingzumwalt added this to the Data.gov (aka 300 TB Challenge) milestone Jan 16, 2017

flyingzumwalt added the backlog label Jan 16, 2017

flyingzumwalt mentioned this issue Jan 16, 2017

Sprint: Data.gov (aka 300 TB Challenge) #87

Open

flyingzumwalt changed the title ~~Story: Test Suite for 1M -> 100TB Payloads~~ Story: Test Suite for 1MB -> 100TB Payloads Jan 16, 2017

flyingzumwalt assigned whyrusleeping Jan 17, 2017

flyingzumwalt added ready and removed backlog labels Jan 17, 2017

This was referenced Jan 17, 2017

Make sure Bitswap works in all cases #121

Open

Identify the "big bugs" and "big optimizations" relevant for data.gov #105

Closed

flyingzumwalt added in progress and removed ready labels Jan 18, 2017

Kubuxu mentioned this issue Jan 19, 2017

Figure out how to automate test suite & allocate storage/devices/etc. #122

Closed

6 tasks

flyingzumwalt mentioned this issue Jan 24, 2017

Set up a kubernetes cluster on IPFS team's machines #130

Open

jbenet self-assigned this Jan 25, 2017

flyingzumwalt removed the needs clarification label Jan 25, 2017

jbenet removed their assignment Jan 25, 2017

flyingzumwalt added backlog and removed in progress labels Jan 26, 2017

flyingzumwalt unassigned whyrusleeping Jan 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Story: Test Suite for 1MB -> 100TB Payloads #102

Story: Test Suite for 1MB -> 100TB Payloads #102

flyingzumwalt commented Jan 16, 2017 •

edited

Loading

flyingzumwalt commented Jan 16, 2017

ghost commented Jan 16, 2017

flyingzumwalt commented Jan 16, 2017

flyingzumwalt commented Jan 16, 2017

Kubuxu commented Jan 17, 2017 •

edited

Loading

whyrusleeping commented Jan 18, 2017

whyrusleeping commented Jan 18, 2017

jbenet commented Jan 23, 2017

whyrusleeping commented Jan 24, 2017

flyingzumwalt commented Jan 25, 2017

jbenet commented Jan 25, 2017 •

edited

Loading

whyrusleeping commented Jan 25, 2017 •

edited by Kubuxu

Loading

Story: Test Suite for 1MB -> 100TB Payloads #102

Story: Test Suite for 1MB -> 100TB Payloads #102

Comments

flyingzumwalt commented Jan 16, 2017 • edited Loading

Acceptance Scenario

Tasks

flyingzumwalt commented Jan 16, 2017

ghost commented Jan 16, 2017

flyingzumwalt commented Jan 16, 2017

flyingzumwalt commented Jan 16, 2017

Kubuxu commented Jan 17, 2017 • edited Loading

whyrusleeping commented Jan 18, 2017

whyrusleeping commented Jan 18, 2017

jbenet commented Jan 23, 2017

whyrusleeping commented Jan 24, 2017

flyingzumwalt commented Jan 25, 2017

jbenet commented Jan 25, 2017 • edited Loading

clarifying tests we need from #102 (comment)

whyrusleeping commented Jan 25, 2017 • edited by Kubuxu Loading

flyingzumwalt commented Jan 16, 2017 •

edited

Loading

Kubuxu commented Jan 17, 2017 •

edited

Loading

jbenet commented Jan 25, 2017 •

edited

Loading

whyrusleeping commented Jan 25, 2017 •

edited by Kubuxu

Loading