Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SQL Support for ADBC Drivers #53869
Add SQL Support for ADBC Drivers #53869
Changes from 29 commits
4f2b760
a4ebbb5
b2cd149
512bd00
f49115c
a8512b5
c1c68ef
3d7fb15
1093bc8
926e567
093dd86
fcc21a8
88642f7
156096d
dd26edb
4d8a233
5238e69
51c6c98
39b462b
428c4f7
84d95bb
a4d5b31
21b35f6
e709d52
6077fa9
5bba566
236e12b
b35374c
8d814e1
cfac2c7
47caaf1
a22e5d1
c51b7f4
7f5e6ac
9ee6255
a8b645f
902df4f
90ca2cb
d753c3c
d469e24
2bc11a1
f205f90
2755100
3577a59
c5bf7f8
98d22ce
4f72010
7223e63
c2cd90a
de65ec0
7645727
3dc914c
6dbaae5
3bf550c
492301f
fc463a4
cc72ecd
f5fd529
1207bc4
e8d93c7
3eed897
adef2f2
67101fd
ea5dcb9
fb38411
a0bed67
150e267
1e77f2b
97ed24f
7dc07da
52ee8d3
1de8488
21edaea
2d077e9
accbd49
64b63bd
f84f63a
391d045
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conda packages should be available nowadays, so I think you can move it to the normal packages list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet for Windows, so I think can leave it to a follow up to do this after apache/arrow-adbc#1149
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They raise an error now, I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't necessarily need to happen for this initial PR, but since we are generating the SQL query string below, it should be relatively straightforward to support selecting a subset of columns, instead of selecting
*
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that's a good point. I can't remember why I didn't do this in the first place. Should be straightforward to add here or in a follow up though, especially since ADBC should handle sanitizing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again not for this PR, but something to note for future improvements: I think it should be possible to support chunksize? Because we can get an RecordBatchReader from ADBC, and then read from that iterator and convert to pandas in chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a good question. I'm not sure I see anything in the ADBC specification around batch / chunk handling. Might be overlooking the general approach to that. @lidavidm always knows best
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily in ABDC itself, but you should be able to get a RecordBatchReader instead of a materialized Table. And then pyarrow provides APIs to consume that reader chunk by chunk. It might not exactly support the user-specified chunksize, but it does give you a similar result: a generator of pandas DataFrames.
(a RecordBatchReader allows you to
read_next_batch()
at a time, this returns a RecordBatch, and then this could be further split if necessary based onchunksize
, and then those chunks can be converted to pandas.DataFrame. The main issue is that if the native batch size you get from the database is much larger than the specifiedchunksize
, you get a larger memory usage than expected)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I missed this.
Drivers have various parameters to request a batch size but perhaps we should standardize on one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apache/arrow-adbc#1024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with
from pandas._config import using_pyarrow_string_dtype
For another PR, but we should probably factor out the above in a helper to give you the mapping based on the
dtype_backend
keyword + options.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
(I mentioned it before, but I don't fully understand how this PR is working, because if I try that locally with an adbc dbapi connection, I get "TypeError: 'Cursor' object is not callable")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think because you are basically only testing
to_sql
.. All the read tests seem to rely on additional fixtures that have a connection + iris table set up, and those are not added for adbc.With a quick hacky patch I can get the tests to fail with the "'Cursor' object is not callable" error:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah cool - nice find
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index
is now supported as well?