-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add low memory JSON reader for cudf.pandas
#16204
Conversation
python/cudf/cudf/_lib/json.pyx
Outdated
if len(final_columns) == 0: | ||
final_columns = new_chunk | ||
else: | ||
for col_idx in range(len(meta_names)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming that the concatenation technique here is generally the same as done in the parquet reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one non-blocking question
python/cudf/cudf/_lib/json.pyx
Outdated
try: | ||
with nogil: | ||
c_result = move(libcudf_read_json(opts)) | ||
except (ValueError, OverflowError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious if we could use some logic such as _has_next()
in PQ chunked reader to break this loop instead of this exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that we're catching the exception from a datasource here. The memory mapping is very much an implementation detail.
How expensive would it be to get the total source(s) size? Then we can loop until all of it is read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is expensive, we basically have to seek to the end of file to get the size of all data sources. For remote data-sources it get's complicated to properly perform seek too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious if we could use some logic such as
_has_next()
in PQ chunked reader to break this loop instead of this exception?
We basically call into libcudf layer for that, is there any such provision for json reader in libcudf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, we already need the file size(s) to read JSON input(s). With the current implementation of the low memory JSON reader, we get the size of each input file for each byte range, so getting the sizes one more time to have a cleaner loop would not add much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke @mhaseeb123 I just moved the chunked reader to pylibcudf to resolve the conflicts. Would you be able to take a look at the changes again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this! I have just one comment based on the byte range reading behaviour we have seen in this issue.
Co-authored-by: Shruti Shivakumar <shruti.shivakumar@gmail.com>
/merge |
Description
Fixes: #16122
This PR introduces low-memory JSON reading for
cudf.pandas
read_json
.Checklist