Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix all-empty input column for strings split APIs #16466

Merged
merged 9 commits into from
Aug 13, 2024

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Aug 1, 2024

Description

Fixes specialized behavior for all empty input column on the strings split APIs.
Verifying behavior with Pandas str.split( pat, expand, regex )
pat=None -- whitespace
expand=False -- record APIs
regex=True -- re APIs

  • split
  • split - whitespace
  • rsplit
  • rsplit - whitespace
  • split_record
  • split_record - whitespace
  • rsplit_record
  • rsplit_record - whitespace
  • split_re
  • rsplit_re
  • split_record_re
  • rsplit_record_re

Closes #16453

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added bug Something isn't working 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Aug 1, 2024
@davidwendt davidwendt self-assigned this Aug 1, 2024
@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 7, 2024
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Aug 9, 2024
@davidwendt davidwendt changed the title [WIP] Fix all-empty input column for strings split APIs Fix all-empty input column for strings split APIs Aug 9, 2024
@davidwendt davidwendt marked this pull request as ready for review August 9, 2024 17:21
@davidwendt davidwendt requested review from a team as code owners August 9, 2024 17:21
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to verify — this aligns with pandas behavior, but what about Spark? Code looks fine otherwise.

@davidwendt
Copy link
Contributor Author

Just to verify — this aligns with pandas behavior, but what about Spark? Code looks fine otherwise.

Good point. I forgot to add the original issue which came from Spark.
#16453
@ttnghia Can you verify this will be OK for Spark?

@ttnghia
Copy link
Contributor

ttnghia commented Aug 12, 2024

Spark integration tests (after reverting workaround for empty strings due to issue #16453) all passed with this work 👍

@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 419fb99 into rapidsai:branch-24.10 Aug 13, 2024
82 checks passed
@davidwendt davidwendt deleted the fix-split-record branch August 13, 2024 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[BUG] split_record output empty list for empty input string
5 participants