Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk indexing #62

Closed
mattonomics opened this issue Aug 6, 2014 · 5 comments
Closed

Bulk indexing #62

mattonomics opened this issue Aug 6, 2014 · 5 comments
Assignees
Milestone

Comments

@mattonomics
Copy link
Contributor

Rather than submit posts for indexing one at a time, it would be better to use the Elasticsearch bulk index functionality.

We will have to consider error handling when using the bulk api because it's possible that some posts don't get indexed and others do. Elasticsearch will tell us what happened but we will have to resubmit those posts for indexing.

Additionally, we'll need to be aware of the max post size on the server and the max amount of data elasticsearch/java can receive.

@AaronHolbrook
Copy link
Contributor

I'd like us to do some digging on exactly what Elasticsearch tells us on partial indexing. I haven't run into a case of partial index failure.

Going to split out the max post size/max amount of data problem into a separate issue. I think we can focus initially on building a bulk index with a fairly conservative amount of initial posts that is customizable to allow for smaller chunks if syncing fails.

@tlovett1
Copy link
Member

Yea I'd love to know how much of a performance difference we are talking about since the max POST size problem introduces some serious complexities.

@AaronHolbrook
Copy link
Contributor

Let's focus this issue on simply implementing a conservative default bulk index.

#66 can focus on implementing more intelligent size of bulk index.

@AaronHolbrook
Copy link
Contributor

@tlovett1 Was mentioning default bulk index size over in #69; bringing that conversation over here.

Unless someone has another idea, how about 200 as the default chunk size?

@tlovett1
Copy link
Member

200 sounds like a pretty good conservative number to me. Still some benchmarking information would be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants