Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: data source from JSON array data #2306

Closed
wants to merge 12 commits into from

Conversation

Lilit0x
Copy link
Contributor

@Lilit0x Lilit0x commented Dec 26, 2023

Closes: #2218

How to Test

SELECT * FROM read_arjson('path/to/file/or/urls', max_size, <creds>)

Copy link
Contributor

@tychoish tychoish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm super excited about this!

I wonder if implementing the object store's file type ends up having you write more boilerplate than is needed.

I had great luck recently using the StreamingTable infrastructure in the bson functionality, though I don't know if it makes a lot of sense (also your code supports gzipped inputs and the BSON code does not.)

@universalmind303 universalmind303 self-requested a review December 27, 2023 00:39
@Lilit0x
Copy link
Contributor Author

Lilit0x commented Dec 27, 2023

I wonder if implementing the object store's file type ends up having you write more boilerplate than is needed.

You mean the FileFormat? Yes, it did. Especially in the create_physical_plan function because of the execution. I tried to look for a way around that, but I couldn't since the default JsonFormat is new line delimited which gives more credibility to @universalmind303's suggestion to making the change upstream. For now though, I think it is okay like this, albeit, I'll still have to write more for the create_writer_physical_plan.
The other option I had was to create a different type similar to FileType and then implement all the traits that datafusion's FileType did, but that would have been extremely longer and probably unnecessary.

I had great luck recently using the StreamingTable infrastructure in the bson functionality, though I don't know if it makes a lot of sense (also your code supports gzipped inputs and the BSON code does not.)

Hmm, I looked through it, I can't seem to see where and how it will fit this use case. Maybe I am missing something.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ Lilit0x
❌ Ubuntu


Ubuntu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@tychoish
Copy link
Contributor

I think given that df419aa has landed, we should let this one drop.

I think there's a lot of merit to this approach, and I'm super thankful that you did this work for us to learn from, and sorry that we didn't use the code directly.

@tychoish tychoish closed this Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

data source from JSON array data
4 participants