Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slowly handling large number of files #3528

Open
jdgcs opened this issue Dec 21, 2016 · 27 comments
Open

Slowly handling large number of files #3528

jdgcs opened this issue Dec 21, 2016 · 27 comments
Labels
topic/badger Topic badger topic/datastore Topic datastore topic/repo Topic repo

Comments

@jdgcs
Copy link

jdgcs commented Dec 21, 2016

Version information:

% ./ipfs version --all
go-ipfs version: 0.4.4-
Repo version: 4
System version: amd64/freebsd
Golang version: go1.7

Type:

./ipfs add became very slow when handling large number of files.

Priority:P1

Description:

./ipfs add became very slow when handling about 45K files(~300GB), it took about 3+ seconds to wait after the process bar finished.

But we can run several IPFS in the same machine to deal with this issue.

About the machine:
CPU:E3-1230V2, RAM:16G, storage:8T with 240G SSD cache@ZFS

Thanks for the amazing project!

@jdgcs
Copy link
Author

jdgcs commented Dec 30, 2016

% ./ipfs repo stat
NumObjects 9996740
RepoSize 381559053765
RepoPath /home/liu/.ipfs
Version fs-repo@4

@whyrusleeping whyrusleeping added the topic/repo Topic repo label Sep 2, 2017
@FortisFortuna
Copy link

I encounter this too.

@schomatis
Copy link
Contributor

Hey @FortisFortuna, yes, this is a common issue with the default flatfs datastore (which basically stores each 256K-chunk of each file being added in a different file in the repository and ends up collapsing the filesystem), could you try the badgerds datatsore and see if it helps? (Initialize the repository with the --profile=badgerds option).

@schomatis schomatis added topic/datastore Topic datastore topic/badger Topic badger labels Jun 28, 2018
@FortisFortuna
Copy link

thank you
ipfs config profile apply badgerds
ipfs-ds-convert convert

I have about 18 GB and 500K files (Everipedia) on the default flatfs. Do these commands convert the blocks from flatfs to badgerds so I don't have to do everything over again?

@Stebalien
Copy link
Member

Yes. However, it may be faster to do it over as this will still have to extract the blocks from flatfs and move them into badgerds.

@schomatis
Copy link
Contributor

Yes, keep in mind the conversion tool will require twice the size of the repo being converted.

@Stebalien
Copy link
Member

Also, I'd be interested in how big your datastore gets with badgerds.

@FortisFortuna
Copy link

I am unable to build the conversion tool. It stalls for me on the make inside ipfs-ds-convert
[0 / 22]

@Stebalien
Copy link
Member

Looks like you're having trouble fetching the dependencies. Try building ipfs-inactive/ipfs-ds-convert#11.

@FortisFortuna
Copy link

Ok thanks! The pull request you did let me build it. I will follow the instructions in this thread and #5013 now and try to convert the db (I backed up the flatfs version just in case). Thanks for the quick reply.

@FortisFortuna
Copy link

Works, but still the same slow speed

@schomatis
Copy link
Contributor

@FortisFortuna That's strange, I would definitely expect a speed up when using Badger instead of the flat datastore when adding files, I mean I can't say that it would be fast, but it should be noticeably faster than you're previous setup.

Could you, as a test, initialize a new repo with the --profile=badgerds option and add a small sample of your data set (say 30GB) to check if you experience different speeds when writing than with flatfs. (Badger's performance may degrade with bigger data sets but not to the point of being comparable with flatfs so this test should be representative enough to check that you're setting everything properly on your end, and in that case we should investigate further on our -or Badger's- side.)

@Stebalien
Copy link
Member

Hm. Actually, this may be pins. Are you adding one large directory or a bunch of individual files? Our pin logic is really optimized at the moment so if you add all the files individually, you'll end up with many pins and performance will be terrible.

@FortisFortuna
Copy link

FortisFortuna commented Jul 3, 2018

Everipedia has around 6 million pages, and I have IPFS'd about 710K of them in the past week on a 32 core 252G RAM machine. Something is bottlenecking because I am only getting about 5-10 hashes a second. I know for a fact the bottleneck is the ipfs add in the code. The machine isn't even running near full capacity.
I am using this:
https://github.com/ipfs/py-ipfs-api

import ipfsapi
api = ipfsapi.connect('127.0.0.1', 5001)
res = api.add('test.txt')

Specifically, a gzipped html file of average size ~15 kB is being added each loop.

@Stebalien
Copy link
Member

Ah. Yeah, that'd do it. We're trying to redesign how we do pins but that's currently under discussion.

So, the best way to deal with this is to just add the files all at once with ipfs add -r. Alternatively, you can disable garbage collection (don't run the daemon with the --enable-gc flag) and just add the files without pinning them (use pin=False).

@FortisFortuna
Copy link

FortisFortuna commented Jul 3, 2018

i will try pin=False. I need to keep track of which files get which hashes through, so I don't think I can simple pre-generate the html files, then add, unless you know a way.
If I skip the pinning, will I still be able to ipfs cat them?

@Stebalien
Copy link
Member

Once you've added a directory, you can get the hashes of the files in the directory by running either:

  • ipfs files stat --hash /ipfs/DIR_HASH/path/to/file to get the hash of an individual file.
  • ipfs ls /ipfs/DIR_HASH to list the hashes/names of all the files in a directory.

Note: If you're adding a massive directory, you'll need to enable (directory sharding](https://github.com/ipfs/go-ipfs/blob/master/docs/experimental-features.md#directory-sharding--hamt) (which is an experimental feature).

@FortisFortuna
Copy link

Thanks

@FortisFortuna
Copy link

So to clarify, if I put pin=False, I can still retrieve / cat the files right, as long as I keep garbage collection off? I noticed a gradual degradation in file addition speed as more files were added.

@FortisFortuna
Copy link

FortisFortuna commented Jul 3, 2018

You are a god among men @Stebalien. Setting pin=False int the Python script did it! To summarize

  1. Using badgerrds
  2. using ---offline
  3. ipfs config Reprovider.Strategy roots
  4. ipfs config profile apply server
  5. set pin=False when ipfs add-ing in my Python script.

Getting like 25 hashes a second now vs 3-5 before. ipfs cat works too

@FortisFortuna
Copy link

Hey guys. The IPFS server was working fine with the above helper options, until I needed to restart. When I did, the daemon tries to initialize, but freezes. I am attempting to update from 0.4.15 to 0.4.17 to see if that helps, but now it stalls on "applying 6-to-7 repo migration".
I have over 1 million IPFS hashes (everipedia.org).
Anything I am doing wrong?

@FortisFortuna
Copy link

I see this in the processes
"/tmp/ipfs-update-migrate590452412/fs-repo-migrations -to 7 -y"
could it be I/O limitations?

@FortisFortuna
Copy link

FortisFortuna commented Oct 30, 2018

Ok, so the migration did eventually finish, but it took a while (~ 1 hr). Once the update went through, the daemon started fast. It is working now.

@Stebalien
Copy link
Member

So, that migration should have been blazing fast. It may have been that "daemon freeze". That is, the migration literally uses the 0.4.15 repo code to load the repo for the migration.

It may also have been the initial repo size computation. We've switched to memoizing the repo size as it's expensive to compute for large repos but we have to compute it up-front so that might have delayed your startup.

@hoogw
Copy link

hoogw commented Feb 27, 2019

ipfs add -r 289GB( average file size < 10MB)
after add 70GB, speed noticeable slowing done,
spend 2 days to reach 200GB,

Do you means to speed up by (go-ipfs v0.4.18)
ipfs add pin=false -r xxxxxxxxx ?

Is this right?

@Stebalien
Copy link
Member

@hoogw please report a new issue.

@hoogw
Copy link

hoogw commented Feb 27, 2019

ipfs add (by default --pin = true)

To turn off pin to speed up,

D:\test>ipfs add --pin=false IMG_1427.jpg
4.18 MiB / 4.18 MiB [========================================================================================] 100.00%�added QmekTFtiQqrhiqms8FXZqPD1TfMc9kQUoNF8WVUNBGJF8h IMG_1427.jpg
4.18 MiB / 4.18 MiB [========================================================================================] 100.00%
D:\test>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/badger Topic badger topic/datastore Topic datastore topic/repo Topic repo
Projects
None yet
Development

No branches or pull requests

6 participants