Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set the null count in output columns in the CSV reader #13221

Merged
merged 4 commits into from
Apr 26, 2023

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Apr 25, 2023

Description

CSV reader used to keep the null count of the output columns as UNKNOWN_NULL_COUNT. This has performance implications on further use of these columns.
This PR adds a validity counter to convert_csv_to_cudf kernel. This counter is then used to set the correct null count in the columns.
Also added a test that explicitly checks the null counts, as it is never explicitly used in the CSV reader tests.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 25, 2023
@vuule vuule added cuIO cuIO issue Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed libcudf Affects libcudf (C++/CUDA) code. labels Apr 25, 2023
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 25, 2023
@@ -859,7 +867,7 @@ table_with_metadata read_csv(cudf::io::datasource* source,
const std::string dblquotechar(2, parse_opts.quotechar);
std::unique_ptr<column> col = cudf::make_strings_column(*out_buffers[i]._strings, stream);
out_columns.emplace_back(
cudf::strings::replace(col->view(), dblquotechar, quotechar, -1, mr));
cudf::strings::detail::replace(col->view(), dblquotechar, quotechar, -1, stream, mr));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change, just noticed that the version that the detail API needs to be used here to ensure that the right stream is used.

Comment on lines +172 to +173
CUDF_EXPECTS(col_lhs.null_count() == 0 and col_rhs.null_count() == 0,
"All elements should be valid");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one test that was (kind of) checking the null count. With this additional check, it fails when the null count is incorrect.

@vuule vuule marked this pull request as ready for review April 26, 2023 17:58
@vuule vuule requested a review from a team as a code owner April 26, 2023 17:58
@vuule vuule assigned vuule and unassigned vyasr Apr 26, 2023
@vuule vuule requested a review from vyasr April 26, 2023 17:58
@vuule
Copy link
Contributor Author

vuule commented Apr 26, 2023

/merge

@rapids-bot rapids-bot bot merged commit 8b59663 into rapidsai:branch-23.06 Apr 26, 2023
@vuule vuule deleted the null-count-read_csv branch April 26, 2023 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants