Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read CSV format text from stdin or memory #54

Merged
merged 1 commit into from
Apr 26, 2021
Merged

Conversation

heymind
Copy link
Contributor

@heymind heymind commented Apr 25, 2021

Migrate from apache/arrow#10066.

Which issue does this PR close?

Close Jira issue ARROW-12306. Closes #198

Rationale for this change

Let CsvFile and CsvExec support reading from a reader (not only files)

What changes are included in this PR?

This pr adds the following new pub functions:

  • CsvFile::try_new_from_reader
  • CsvFile::try_new_from_reader_infer_schema
  • CsvExec::try_new_from_reader
  • CsvStream::try_new_from_reader

Are there any user-facing changes?

No.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is good -- the only thing that is wonky to me is that if you Clone one of these readers it won't work. However, in this case the error will be clear so I think this PR can be merged and if anyone cares about the Clone behavior we can fix it in a follow on PR.

Thanks again @heymind

What do you think @Dandandan / @andygrove ?

@codecov-commenter
Copy link

Codecov Report

Merging #54 (10f8c3b) into master (9ba214a) will decrease coverage by 0.04%.
The diff coverage is 73.29%.

❗ Current head 10f8c3b differs from pull request most recent head c938270. Consider uploading reports for the commit c938270 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master      #54      +/-   ##
==========================================
- Coverage   76.24%   76.20%   -0.05%     
==========================================
  Files         134      134              
  Lines       23051    23199     +148     
==========================================
+ Hits        17576    17679     +103     
- Misses       5475     5520      +45     
Impacted Files Coverage Δ
datafusion/src/datasource/csv.rs 73.33% <69.86%> (-9.28%) ⬇️
datafusion/src/physical_plan/csv.rs 79.09% <75.72%> (-4.13%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ba214a...c938270. Read the comment docs.

Path(String),

/// Read CSV data from a reader
Reader(Mutex<Option<Box<dyn Read + Send + Sync + 'static>>>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this look if we would support multiple readers/partitions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I am wondering, if we do it like this, we would we need to do the same for json/xml/etc sources that should support essentially the same to avoid reimplementing it for each format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we file a follow on ticket for this work (supporting Reader input for JSON / XML sources). As you say I don't think it needs to be part of this PR.

@Dandandan
Copy link
Contributor

Looks good to me, thanks @heymind .
Seems like a cool addition to integrate with much more sources. I have some small questions about partitioning and re-use across formats, but that shouldn't be a blocker.

@alamb alamb merged commit 3855473 into apache:master Apr 26, 2021
@houqp houqp added api change Changes the API exposed to users of the crate ballista datafusion Changes in the datafusion crate enhancement New feature or request labels Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read CSV format text from stdin or memory
5 participants