Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use improved CUDF JSON validation #11464

Merged
merged 5 commits into from
Sep 17, 2024

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Sep 11, 2024

This depends on rapidsai/cudf#15968 that was merged in recently

This fixes #10457
This fixes #10479
This fixes #10534
This fixes #10911

It also fixes some issue with JSON scan where if all of the input is invalid/empty lines before it would error out, but now our work around fixes that. It is not a final solution though.

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
@sameerz sameerz added the test Only impacts tests label Sep 14, 2024
@revans2 revans2 self-assigned this Sep 17, 2024
@revans2
Copy link
Collaborator Author

revans2 commented Sep 17, 2024

build

@revans2
Copy link
Collaborator Author

revans2 commented Sep 17, 2024

build

Copy link
Collaborator

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I had one minor comment.

@revans2
Copy link
Collaborator Author

revans2 commented Sep 17, 2024

build

@revans2
Copy link
Collaborator Author

revans2 commented Sep 17, 2024

@abellina please take another look

abellina
abellina previously approved these changes Sep 17, 2024
@abellina
Copy link
Collaborator

build

ttnghia
ttnghia previously approved these changes Sep 17, 2024
@revans2 revans2 dismissed stale reviews from ttnghia and abellina via 787535f September 17, 2024 18:23
@revans2
Copy link
Collaborator Author

revans2 commented Sep 17, 2024

build

@revans2
Copy link
Collaborator Author

revans2 commented Sep 17, 2024

@abellina please take another look. My changes to the text buffering accidentally filtered out too much for hive text files and our tests caught it. I made the change so it is now specific to JSON and added a test file to verify that the case when all of the lines are bogus or empty works as expected.

override def createBufferer(estimatedSize: Long,
lineSeparatorInRead: Array[Byte]): HostLineBufferer =
new HostLineBufferer(estimatedSize, lineSeparatorInRead)
new HostLineBufferer(estimatedSize, lineSeparatorInRead, true)
}

/**
* Buffer the lines in a single HostMemoryBuffer with the separator inserted inbetween each of
* the lines.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, could you add a comment that HostLineBufferer is used for JSON and one for HostStringColBufferer that should be used for other formats?

Copy link
Collaborator

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a nit.

@revans2 revans2 merged commit 2589976 into NVIDIA:branch-24.10 Sep 17, 2024
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
4 participants