Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cursor-based ItemReader for MongoDB #4323

Closed
wants to merge 1 commit into from

Conversation

juchanei
Copy link
Contributor

The MongoItemReader currently uses pagination with MongoDB's skip operation, which can cause unnecessary document scans when requesting next pages.

The skip() method requires the server to scan from the beginning of the input results set before beginning to return results. As the offset increases, skip() will become slower.
https://www.mongodb.com/docs/manual/reference/method/cursor.skip/#using-skip--

Therefore, I have implemented a new cursor-based MongoDB ItemReader that accesses the next documents without skipping.

@MinJunKweon
Copy link
Contributor

related #3824

@fmbenhassine
Copy link
Contributor

Awesome! Thank you for contributing this cursor-based implementation! I will plan this feature for Spring Batch 5.1.

Just curious, have you run a benchmark to compare the performance of this reader with the paging one?

@juchanei
Copy link
Contributor Author

juchanei commented Apr 3, 2023

Just curious, have you run a benchmark to compare the performance of this reader with the paging one?

@fmbenhassine

As you know, pagination-based item readers perform poorly in terms of performance due to the amount of documents that need to be skipped. This amount is related to the total number of documents and the batch size, making it difficult to quantify the performance improvement of cursor-based item readers in numbers.

In my case, when benchmarking with a batch size of 10k on 0-3.5 million documents, the elapsed time was reduced by about 20-40%. While different results may occur depending on the test conditions, I am attaching the scatter chart results from my test.

image

* @param batchSize size the batch size to apply to the cursor
*/
public void setBatchSize(Integer batchSize) {
this.batchSize = batchSize;
Copy link

@hmhuan hmhuan Apr 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking: Should we take care of batchSize when it is 0 or negative?
As I checked the spec of spring-data-mongo

  • Use {@literal 0 (zero)} for no limit. A negative limit closes the cursor after returning a single
  • batch indicating to the server that the client will not ask for a subsequent one.

ref: https://github.com/spring-projects/spring-data-mongodb/blob/main/spring-data-mongodb/src/main/java/org/springframework/data/mongodb/core/query/Query.java#L539-L546

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmhuan I'm sorry for the delay in responding.
Users would expect that this code is implemented using spring-data-mongodb, given the explicit use of MongoOperations as an argument in MongoCursorItemReader. Therefore, if setBatchSize handles 0 or negative values, I think it could reduce predictability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of the parameter should be the same as expected by the Query API. The setter should add the javadoc tag @see Query#cursorBatchSize(int) so the user knows that the value will be set on that field, and therefore knows the (default) values to use. I will take care of this change when merging the PR.

fmbenhassine pushed a commit that referenced this pull request Aug 22, 2023
fmbenhassine added a commit that referenced this pull request Aug 22, 2023
- Update parameter types
- Update tests
- Update Javadocs
- Fix code formatting
@fmbenhassine
Copy link
Contributor

Excellent! Thank you for sharing those benchmarks. Great to see up to 40% improvement!

Rebased and merged as b103900. Refined in d84057c.

Thank you for your contribution!

@fmbenhassine fmbenhassine modified the milestones: 5.1.0, 5.1.0-M2 Aug 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants