Identify the "big bugs" and "big optimizations" relevant for data.gov #105

flyingzumwalt · 2017-01-16T19:52:36Z

Make a short list of the "big bugs" and "big optimizations" relevant for this sprint. We'll want a good list to have in mind

ie. file attrs, bitswap supporting paths (kills so many RTTs)

Kubuxu · 2017-01-17T17:19:02Z

Storage is one that everyone notices but in my optionion it isn't the limiting factor in this case.
300TiB of data is such a huge amount that just adding it to go-ipfs currently would take 6 months (300*2^40 / (20*2^20)/60/60/24/30, about 20 MiB
/s is add speed I got from go-ipfs on beefy PC with SSD caching and config optimizations), currently that for sure can be improved.

In my opinion we should focus on getting the performance for those big datasets in a range that is possible to use them at all. This includes adding, fetching, rechecking.

Also I have no idea if our GC will run at all with such a number of keys (it might run out of memory).

DHT is also other major problem that might make the go-ipfs choke even if you've got the disk space.

Filestore would be nice space wise but it probably could be a sprint on its own (or not, depending on how cleanly we want to do it), and even if we got it there might other barriers for deploying this data into IPFS.

whyrusleeping · 2017-01-17T18:55:24Z

Some notes i wrote down the other day:

UX
- Adds are slow
  - fetches are bursty
    - this is likely due to the bitswap concurrency factor per peer being too high
  - small files are slow
    - this is likely due to us having poor batching code, doing one batch per tiny file
- managing things you've added is hard
  - Automatically add entries to mfs, maybe an equivalent of the 'Downloads' directory on mfs
  - Should make a survey for what UX people want here
Scaling Performance
- Does flatfs degrade? (need metrics)
- Look at alternate datastores
  - SQL
  - Bolt
  - "RoundFS"
- Content Routing is slow
  - Provide selectors could help, harder to do
  - Trackers is a fairly easy thing to do
  - Larger block size could help scale the problem down by a constant factor
    - Need 'importer parameters' on objects to validate things properly
  - Need bitswap without providing

jbenet · 2017-01-17T18:55:35Z

@whyrusleeping and I added some notes here ipfs/notes#216 -- copied here

data.gov

@whyrusleeping has a diagram (post it here maybe?)
improving add perf
improving UX
went over possible on-disk datastore changes
- single mmapped file, btree index of offsets, unodered blocks after
went over ipfs-pack, manifest, verify
- how it combines very well with filestore
- importer string
other-repo-datastore
accumulators wish list
to discuss still:
- s3-datastore
- filestore implementation details

flyingzumwalt · 2017-01-17T20:34:02Z

Relevant notes from the sprint planning call:

Big Bugs & Optimizations

TODO: dig up diagram @whyrusleeping created

Adding is still very slow
- adding large files is faster than adding lots of small files
- need a way to test these things See Story: Test Suite for 1MB -> 100TB Payloads #102
- @lgierth recently added ~3.2 TB for CCC. It took about a day to add. Performance dropped as the repo grew. Would have taken half a day if performance had stayed constant.
- @Kubuxu ran some tests (see Identify the "big bugs" and "big optimizations" relevant for data.gov #105 (comment))
- Path forward: design good tests. See Story: Test Suite for 1MB -> 100TB Payloads #102
fetching from network is very slow (won't be able to fill the pipes)
- tests should also address this
DHT with huge datasets might get oversaturated
- DHT is not going to scale in time for this sprint -- means we need to find a way to do the routing See Figure out content routing without DHT #120
- {@whyrusleeping mentioned something i didn't hear..}
Garbage Collection might not work with huge datasets
- leaving GC out of scope for this sprint.
Bitswap hasn't really been tested yet
- @lgierth & @whyrusleeping ran some tests on this but didn't get clear info. It took over a week to ???
- See Make sure Bitswap works in all cases #121
Private Networks -- do we need it in order to do Provide Instructions for setting up data.gov Collaborators' Testbed Network #116 and Figure out content routing without DHT #120?

flyingzumwalt · 2017-01-17T20:35:12Z

Marking the issue "Done" because we've identified the list, but we will still be using it as a reference.

flyingzumwalt added the ready label Jan 16, 2017

flyingzumwalt added this to the Data.gov (aka 300 TB Challenge) milestone Jan 16, 2017

flyingzumwalt assigned jbenet, whyrusleeping and Kubuxu Jan 16, 2017

flyingzumwalt mentioned this issue Jan 16, 2017

Sprint: Data.gov (aka 300 TB Challenge) #87

Open

flyingzumwalt closed this as completed Jan 17, 2017

flyingzumwalt removed the ready label Jan 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify the "big bugs" and "big optimizations" relevant for data.gov #105

Identify the "big bugs" and "big optimizations" relevant for data.gov #105

flyingzumwalt commented Jan 16, 2017 •

edited

Loading

Kubuxu commented Jan 17, 2017 •

edited

Loading

whyrusleeping commented Jan 17, 2017

jbenet commented Jan 17, 2017

flyingzumwalt commented Jan 17, 2017

flyingzumwalt commented Jan 17, 2017

Identify the "big bugs" and "big optimizations" relevant for data.gov #105

Identify the "big bugs" and "big optimizations" relevant for data.gov #105

Comments

flyingzumwalt commented Jan 16, 2017 • edited Loading

Kubuxu commented Jan 17, 2017 • edited Loading

whyrusleeping commented Jan 17, 2017

jbenet commented Jan 17, 2017

data.gov

flyingzumwalt commented Jan 17, 2017

Big Bugs & Optimizations

flyingzumwalt commented Jan 17, 2017

flyingzumwalt commented Jan 16, 2017 •

edited

Loading

Kubuxu commented Jan 17, 2017 •

edited

Loading