Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index is apparently limited to 4 GB #351

Open
moio opened this issue Oct 1, 2020 · 7 comments
Open

Index is apparently limited to 4 GB #351

moio opened this issue Oct 1, 2020 · 7 comments

Comments

@moio
Copy link

moio commented Oct 1, 2020

👋 Hound developers!

I am trying to index a pretty large repo (144GB - all current sources of openSUSE), and unsurprisingly the index turns out to be larger than 4GB, thus I hit this fatal message:

log.Fatalf("index is larger than 4GB")

Would it be possible / how hard would it be to support larger indexes?

I only had a brief look at read.go and it seems to me that 32 bit offsets are part of the index file format, so changing that would require re-indexing/converting/supporting two file formats, is that correct?

Thanks for all your efforts on Hound!

@salemhilal
Copy link
Contributor

Oh yikes, I'm sorry you're running into that. It does look like this would involve supporting / moving to a 64-bit-based index file. That's not work we have slated, but I think a PR would be appreciated. We're actively running it on what I thought was a large repository, but it looks like the repo itself is only about 8 gigs.

@rfan-debug
Copy link
Contributor

This seems to be an important issue.

@salemhilal
Copy link
Contributor

@rfan-debug it's likely something we'll have to do at some point. Are you interested in tackling it?

@rfan-debug
Copy link
Contributor

I think giving it a fix is not difficult. However, I am not sure how to test it reliably if i change any code.

It seems that we don't have sufficient integration test..

@salemhilal
Copy link
Contributor

I think that's part of what makes this issue tricky. If you're willing to write unit or integration tests, I'd definitely welcome that as well.

@rfan-debug
Copy link
Contributor

I think the unit test is sufficient for the current codesearch but we lack real integration test.

I skimmed over the codesearch. I found the root cause of the 4GB limit is from the data type uint32 everywhere. I need some time to check all the places of where uint32 is used for indexing and replace it with uint64. Certainly, we need to add some functionality on the bit operations on 64-bit data types.

Now i think a good way to build up the integration test set is:

  • Use the current 32-bit code to build a code search system on a codebase (e.g. hound itself)
  • Add 100 example queries and record its results.
  • Migrate the datatype in codesearch from 32bit to 64bit
  • Verify the 100 example query's result on the new system.

@Urmeli0815
Copy link

Urmeli0815 commented Jan 1, 2021

I gave it a shot because I also thought that it would be straight forward but it's more difficult than expected.

The biggest hurdle is that the index size is tightly bound to the max-size of an array/slice. So a 64bit-sized index couldn't directly be mapped to a []byte because the max-size of an array/slice is MaxInt32. And with that the quite complex operations on slices need a migration.

I think a better approach would be to have the ability to have different backend implementations for the Index-Type. E.g. I could imagine that an implementation with an SQLite or bbolt backend would be quite easy and would automatically support very large index files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants