Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize date_bin (2x faster) #10215

Merged
merged 2 commits into from
Apr 25, 2024
Merged

Conversation

simonvandel
Copy link
Contributor

@simonvandel simonvandel commented Apr 24, 2024

Which issue does this PR close?

Closes #10228

Rationale for this change

date_bin could be faster.

What changes are included in this PR?

As mentioned in the docs for PrimaryArray::unary it is faster to apply an infallible operation across both valid and invalid values, rather than branching at every value.

  1. Make stride function infallible
  2. Use unary method

This gives this speedup on my machine:
Before: 22.345 µs
After: 10.558 µs

So around 2x faster

Are these changes tested?

Yes, existing tests.

Are there any user-facing changes?

The date_bin function runs faster.

As mentioned in the docs for `PrimaryArray::unary` it is faster to apply an infallible operation across both valid and invalid values, rather than branching at every value.

1) Make stride function infallible
2) Use `unary` method

This gives this speedup on my machine:
Before: 22.345 µs
After: 10.558 µs

So around 2x faster
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is beautiful @simonvandel -- thank you very much 😍 -- it makes a lot of sense to check the option once per batch rather than once per row.

I filed #10228 to track this improvement.

really nice 🥇

@Dandandan Dandandan merged commit 169701e into apache:main Apr 25, 2024
26 checks passed
@Dandandan
Copy link
Contributor

Nice work, thank you @simonvandel

ccciudatu pushed a commit to hstack/arrow-datafusion that referenced this pull request Apr 26, 2024
* add date_bin benchmark

* optimize date_bin

As mentioned in the docs for `PrimaryArray::unary` it is faster to apply an infallible operation across both valid and invalid values, rather than branching at every value.

1) Make stride function infallible
2) Use `unary` method

This gives this speedup on my machine:
Before: 22.345 µs
After: 10.558 µs

So around 2x faster
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

optimize date_bin
3 participants