Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] JSON: Basic Reader for String type #4135

Closed
revans2 opened this issue Nov 17, 2021 · 2 comments
Closed

[FEA] JSON: Basic Reader for String type #4135

revans2 opened this issue Nov 17, 2021 · 2 comments
Assignees
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf task Work required that improves the product but is not user facing

Comments

@revans2
Copy link
Collaborator

revans2 commented Nov 17, 2021

This is probably one of the largest tasks. Once we have a CUDF API #4133 we need to start using it. I think the first type we would want to support is just strings. Even then it will likely need to be off by default for compatibility reasons.

This will need to

  1. Setup a basic JSON reader similar to the CSV reader that uses the Hadoop Line Reader and possibly normalizes the line separation as it produces batches of data. Ideally as much of this batching as possible should be generalized to a common class/trait that can be shared by both the CVS and JSON code.
  2. Check the JSON config, similar to the CSV config, to verify that all of the settings are things that we can support. If we run into issues where a default setting is not something that we can really support we will need to evaluate if this is an incompatibility that we just document or if it is something we have to fix/have off by default. Note that I initially looked at these here, but this was from reading the code. All of these should be verified once we actually have java APIs we can play with.
  3. An extensive set of tests to verify that the string reading is working, and that we fall back to the CPU in various cases covered by the configs. This might be a little hard because Spark itself does not have a great set of tests.
@revans2 revans2 added ? - Needs Triage Need team to review and classify task Work required that improves the product but is not user facing labels Nov 17, 2021
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Nov 23, 2021
@nartal1 nartal1 self-assigned this Jan 7, 2022
@nartal1 nartal1 added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Jan 7, 2022
@nartal1
Copy link
Collaborator

nartal1 commented Jan 7, 2022

blocked by rapidsai/cudf#9990

@wbo4958
Copy link
Collaborator

wbo4958 commented Jan 24, 2022

For now, the experimental JSON reader can work, #4135 is merged. Close this issue

@wbo4958 wbo4958 closed this as completed Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf task Work required that improves the product but is not user facing
Projects
None yet
Development

No branches or pull requests

4 participants