Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: to_pandas()/to_arrow() #443

Open
Hoeze opened this issue Aug 25, 2020 · 4 comments
Open

Feature request: to_pandas()/to_arrow() #443

Hoeze opened this issue Aug 25, 2020 · 4 comments

Comments

@Hoeze
Copy link

Hoeze commented Aug 25, 2020

Hi, is there an easy+performant way to convert the result of a query into a Pandas or PyArrow dataframe while keeping the column names and data types?

@methane
Copy link
Member

methane commented Aug 25, 2020

I don't use pandas or arrow.
If it is feature request, please write a pull request, or at least be specifically.
If you are just asking question, please ask elsewhere. Issue tracker is not forum and OSS maintainers are not free tech support.

@Hoeze
Copy link
Author

Hoeze commented Aug 25, 2020

You explain that this feature is non-existent; Then this is a feature request.
The target is to retrieve query results as Pandas dataframes.
Usually, this is done with pd.read_sql() but that method lacks knowledge of the column data types and is very slow.
The SQL library knows those data types and could directly convert between the underlying C library and e.g. Arrow tables.
This solution is used by e.g. Turbodbc.

From my side, I'm missing the knowledge of this library and time to do it.
I'm surprised that you do not use Pandas but I understand that you therefore do not have interest to support it.

Maybe there is someone else interested in Pandas support and knows how to actually implement it.

@hpca01
Copy link

hpca01 commented Nov 18, 2020

@Hoeze

Pandas has a read_sql function with a chunk_size option. As far as the implementation is concerned it piggybacks off the db connection to run a query therefore the column names and data types should be available to Pandas. Some dtypes vary though between Pandas implementation vs mysql implementation(usually dates/floats), which is where you need to tell pandas through parse_dates by listing the specific columns. Then there is also the coerce_float option as well for the float implementation of mysql converted to pandas implementation.

I believe it is fast already, but if you'd like you can chunk via chunk_size. There is also other things involved to such as network connection. Could also be the query too.

Another way I have done it before is to store the results as an in memory file and read that in.

Have you ran an odbc trace as well as looking at server throughput on MySQL db?

@Hoeze
Copy link
Author

Hoeze commented Jun 7, 2021

Hi @hpca01, thanks for your answer and sorry for my late reply.

The reason why I was asking for PyArrow support are the following:

  1. Fixed column schema. Pandas read_sql has no idea about column types without me telling it the type.
    (At least, telling the dtype is possible now: Improve type handling in read_sql and read_sql_table pandas-dev/pandas#13049)
  2. Zero-copy and interoperability. With Apache Arrow, the whole de-serialization inside Pandas gets obsolete.
    Also, I can re-use the same memory chunk in other languages, e.g. PySpark.
    See also ENH: Pluggable SQL performance pandas-dev/pandas#36893

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants