Slowly handling large number of files #3528

jdgcs · 2016-12-21T14:37:31Z

Version information:

% ./ipfs version --all
go-ipfs version: 0.4.4-
Repo version: 4
System version: amd64/freebsd
Golang version: go1.7

Type:

./ipfs add became very slow when handling large number of files.

Priority:P1

Description:

./ipfs add became very slow when handling about 45K files(~300GB), it took about 3+ seconds to wait after the process bar finished.

But we can run several IPFS in the same machine to deal with this issue.

About the machine:
CPU:E3-1230V2, RAM:16G, storage:8T with 240G SSD cache@ZFS

Thanks for the amazing project!

jdgcs · 2016-12-30T06:43:23Z

% ./ipfs repo stat
NumObjects 9996740
RepoSize 381559053765
RepoPath /home/liu/.ipfs
Version fs-repo@4

FortisFortuna · 2018-06-28T01:24:11Z

I encounter this too.

schomatis · 2018-06-28T02:11:15Z

Hey @FortisFortuna, yes, this is a common issue with the default flatfs datastore (which basically stores each 256K-chunk of each file being added in a different file in the repository and ends up collapsing the filesystem), could you try the badgerds datatsore and see if it helps? (Initialize the repository with the --profile=badgerds option).

FortisFortuna · 2018-06-28T03:33:03Z

thank you
ipfs config profile apply badgerds
ipfs-ds-convert convert

I have about 18 GB and 500K files (Everipedia) on the default flatfs. Do these commands convert the blocks from flatfs to badgerds so I don't have to do everything over again?

Stebalien · 2018-06-28T03:46:14Z

Yes. However, it may be faster to do it over as this will still have to extract the blocks from flatfs and move them into badgerds.

schomatis · 2018-06-28T03:46:26Z

Yes, keep in mind the conversion tool will require twice the size of the repo being converted.

Stebalien · 2018-06-28T03:47:28Z

Also, I'd be interested in how big your datastore gets with badgerds.

FortisFortuna · 2018-06-29T22:53:28Z

I am unable to build the conversion tool. It stalls for me on the make inside ipfs-ds-convert
[0 / 22]

Stebalien · 2018-06-29T23:41:15Z

Looks like you're having trouble fetching the dependencies. Try building ipfs-inactive/ipfs-ds-convert#11.

FortisFortuna · 2018-06-30T16:43:58Z

Ok thanks! The pull request you did let me build it. I will follow the instructions in this thread and #5013 now and try to convert the db (I backed up the flatfs version just in case). Thanks for the quick reply.

FortisFortuna · 2018-07-03T00:10:48Z

Works, but still the same slow speed

schomatis · 2018-07-03T00:31:53Z

@FortisFortuna That's strange, I would definitely expect a speed up when using Badger instead of the flat datastore when adding files, I mean I can't say that it would be fast, but it should be noticeably faster than you're previous setup.

Could you, as a test, initialize a new repo with the --profile=badgerds option and add a small sample of your data set (say 30GB) to check if you experience different speeds when writing than with flatfs. (Badger's performance may degrade with bigger data sets but not to the point of being comparable with flatfs so this test should be representative enough to check that you're setting everything properly on your end, and in that case we should investigate further on our -or Badger's- side.)

Stebalien · 2018-07-03T00:37:06Z

Hm. Actually, this may be pins. Are you adding one large directory or a bunch of individual files? Our pin logic is really optimized at the moment so if you add all the files individually, you'll end up with many pins and performance will be terrible.

FortisFortuna · 2018-07-03T00:43:36Z

Everipedia has around 6 million pages, and I have IPFS'd about 710K of them in the past week on a 32 core 252G RAM machine. Something is bottlenecking because I am only getting about 5-10 hashes a second. I know for a fact the bottleneck is the ipfs add in the code. The machine isn't even running near full capacity.
I am using this:
https://github.com/ipfs/py-ipfs-api

import ipfsapi
api = ipfsapi.connect('127.0.0.1', 5001)
res = api.add('test.txt')

Specifically, a gzipped html file of average size ~15 kB is being added each loop.

Stebalien · 2018-07-03T00:54:04Z

Ah. Yeah, that'd do it. We're trying to redesign how we do pins but that's currently under discussion.

So, the best way to deal with this is to just add the files all at once with ipfs add -r. Alternatively, you can disable garbage collection (don't run the daemon with the --enable-gc flag) and just add the files without pinning them (use pin=False).

FortisFortuna · 2018-07-03T00:55:50Z

i will try pin=False. I need to keep track of which files get which hashes through, so I don't think I can simple pre-generate the html files, then add, unless you know a way.
If I skip the pinning, will I still be able to ipfs cat them?

Stebalien · 2018-07-03T01:02:31Z

Once you've added a directory, you can get the hashes of the files in the directory by running either:

ipfs files stat --hash /ipfs/DIR_HASH/path/to/file to get the hash of an individual file.
ipfs ls /ipfs/DIR_HASH to list the hashes/names of all the files in a directory.

Note: If you're adding a massive directory, you'll need to enable (directory sharding](https://github.com/ipfs/go-ipfs/blob/master/docs/experimental-features.md#directory-sharding--hamt) (which is an experimental feature).

FortisFortuna · 2018-07-03T01:12:29Z

Thanks

FortisFortuna · 2018-07-03T01:32:58Z

So to clarify, if I put pin=False, I can still retrieve / cat the files right, as long as I keep garbage collection off? I noticed a gradual degradation in file addition speed as more files were added.

FortisFortuna · 2018-07-03T02:31:39Z

You are a god among men @Stebalien. Setting pin=False int the Python script did it! To summarize

Using badgerrds
using ---offline
ipfs config Reprovider.Strategy roots
ipfs config profile apply server
set pin=False when ipfs add-ing in my Python script.

Getting like 25 hashes a second now vs 3-5 before. ipfs cat works too

FortisFortuna · 2018-10-30T21:44:14Z

Hey guys. The IPFS server was working fine with the above helper options, until I needed to restart. When I did, the daemon tries to initialize, but freezes. I am attempting to update from 0.4.15 to 0.4.17 to see if that helps, but now it stalls on "applying 6-to-7 repo migration".
I have over 1 million IPFS hashes (everipedia.org).
Anything I am doing wrong?

FortisFortuna · 2018-10-30T21:45:37Z

I see this in the processes
"/tmp/ipfs-update-migrate590452412/fs-repo-migrations -to 7 -y"
could it be I/O limitations?

FortisFortuna · 2018-10-30T22:04:43Z

Ok, so the migration did eventually finish, but it took a while (~ 1 hr). Once the update went through, the daemon started fast. It is working now.

Stebalien · 2018-10-31T15:20:08Z

So, that migration should have been blazing fast. It may have been that "daemon freeze". That is, the migration literally uses the 0.4.15 repo code to load the repo for the migration.

It may also have been the initial repo size computation. We've switched to memoizing the repo size as it's expensive to compute for large repos but we have to compute it up-front so that might have delayed your startup.

hoogw · 2019-02-27T16:40:43Z

ipfs add -r 289GB( average file size < 10MB)
after add 70GB, speed noticeable slowing done,
spend 2 days to reach 200GB,

Do you means to speed up by (go-ipfs v0.4.18)
ipfs add pin=false -r xxxxxxxxx ?

Is this right?

Stebalien · 2019-02-27T17:28:29Z

@hoogw please report a new issue.

hoogw · 2019-02-27T19:01:25Z

ipfs add (by default --pin = true)

To turn off pin to speed up,

D:\test>ipfs add --pin=false IMG_1427.jpg
4.18 MiB / 4.18 MiB [========================================================================================] 100.00%�added QmekTFtiQqrhiqms8FXZqPD1TfMc9kQUoNF8WVUNBGJF8h IMG_1427.jpg
4.18 MiB / 4.18 MiB [========================================================================================] 100.00%
D:\test>

jdgcs mentioned this issue Dec 30, 2016

Adding file to repo with many files is slow #3545

Open

whyrusleeping added the topic/repo Topic repo label Sep 2, 2017

schomatis added topic/datastore Topic datastore topic/badger Topic badger labels Jun 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slowly handling large number of files #3528

Slowly handling large number of files #3528

jdgcs commented Dec 21, 2016

jdgcs commented Dec 30, 2016

FortisFortuna commented Jun 28, 2018

schomatis commented Jun 28, 2018

FortisFortuna commented Jun 28, 2018

Stebalien commented Jun 28, 2018

schomatis commented Jun 28, 2018

Stebalien commented Jun 28, 2018

FortisFortuna commented Jun 29, 2018

Stebalien commented Jun 29, 2018

FortisFortuna commented Jun 30, 2018

FortisFortuna commented Jul 3, 2018

schomatis commented Jul 3, 2018

Stebalien commented Jul 3, 2018

FortisFortuna commented Jul 3, 2018 •

edited

Loading

Stebalien commented Jul 3, 2018

FortisFortuna commented Jul 3, 2018 •

edited

Loading

Stebalien commented Jul 3, 2018

FortisFortuna commented Jul 3, 2018

FortisFortuna commented Jul 3, 2018

FortisFortuna commented Jul 3, 2018 •

edited

Loading

FortisFortuna commented Oct 30, 2018

FortisFortuna commented Oct 30, 2018

FortisFortuna commented Oct 30, 2018 •

edited

Loading

Stebalien commented Oct 31, 2018

hoogw commented Feb 27, 2019

Stebalien commented Feb 27, 2019

hoogw commented Feb 27, 2019

Slowly handling large number of files #3528

Slowly handling large number of files #3528

Comments

jdgcs commented Dec 21, 2016

Version information:

Type:

Priority:P1

Description:

jdgcs commented Dec 30, 2016

FortisFortuna commented Jun 28, 2018

schomatis commented Jun 28, 2018

FortisFortuna commented Jun 28, 2018

Stebalien commented Jun 28, 2018

schomatis commented Jun 28, 2018

Stebalien commented Jun 28, 2018

FortisFortuna commented Jun 29, 2018

Stebalien commented Jun 29, 2018

FortisFortuna commented Jun 30, 2018

FortisFortuna commented Jul 3, 2018

schomatis commented Jul 3, 2018

Stebalien commented Jul 3, 2018

FortisFortuna commented Jul 3, 2018 • edited Loading

Stebalien commented Jul 3, 2018

FortisFortuna commented Jul 3, 2018 • edited Loading

Stebalien commented Jul 3, 2018

FortisFortuna commented Jul 3, 2018

FortisFortuna commented Jul 3, 2018

FortisFortuna commented Jul 3, 2018 • edited Loading

FortisFortuna commented Oct 30, 2018

FortisFortuna commented Oct 30, 2018

FortisFortuna commented Oct 30, 2018 • edited Loading

Stebalien commented Oct 31, 2018

hoogw commented Feb 27, 2019

Stebalien commented Feb 27, 2019

hoogw commented Feb 27, 2019

FortisFortuna commented Jul 3, 2018 •

edited

Loading

FortisFortuna commented Jul 3, 2018 •

edited

Loading

FortisFortuna commented Jul 3, 2018 •

edited

Loading

FortisFortuna commented Oct 30, 2018 •

edited

Loading