Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add gzip/bz2 compression to relevant read_* methods #15644

Closed
gfairchild opened this issue Mar 10, 2017 · 6 comments · Fixed by #17798
Closed

ENH: add gzip/bz2 compression to relevant read_* methods #15644

gfairchild opened this issue Mar 10, 2017 · 6 comments · Fixed by #17798
Labels
IO Data IO issues that don't fit into a more specific label IO JSON read_json, to_json, json_normalize IO SAS SAS: read_sas IO Stata read_stata, to_stata
Milestone

Comments

@gfairchild
Copy link

gfairchild commented Mar 10, 2017

This issue is a branch off of #11666, which implemented compression support for read_pickle. There are still a few other read_* methods that could possibly benefit from compression support. Looking at the I/O API reference, this jump out at me:

  • read_json - This can definitely benefit from compression. I've stored very large gzipped JSON files before. As a general rule, any read_* method that supports any kind of plaintext format should support compression.
  • read_stata- I don't use Stata, but it looks like a .dta file is not a plaintext file. Is it naturally compressed, or can they be compressed significantly like pickles?
  • read_sas - I've also never used SAS, and like Stata's .dta files, it looks like .xpt and .sas7bdat files are both some binary format. Can they be compressed well?
@jreback jreback added IO Data IO issues that don't fit into a more specific label Difficulty Intermediate IO JSON read_json, to_json, json_normalize IO SAS SAS: read_sas IO Stata read_stata, to_stata labels Mar 10, 2017
@jreback jreback added this to the Next Major Release milestone Mar 10, 2017
@jreback
Copy link
Contributor

jreback commented Mar 10, 2017

most important is json

@bashtage
Copy link
Contributor

Stata is not compressed but is just a fairly plan binary file format. This said, I don' t think there is much of a reason to add compression methods since the output file wouldn't be usable in Stata (presumable the reason to output in this format) without manual decompression.

@jreback
Copy link
Contributor

jreback commented Mar 10, 2017

IIRC read_sas has internal compression as well? or is it a different file extension?

@gfairchild
Copy link
Author

In this case, it looks like read_json may be the only method that needs compression support added.

@jreback
Copy link
Contributor

jreback commented Mar 21, 2017

@gfairchild want to take a stab at this? should be fairly straightforward as you can pretty much reuse the infrastructure (mainly just passing the compression arg thru). This is really just a couple of tests as well.

@gfairchild
Copy link
Author

I'd be happy to. Just got to find the time. Maybe I can do it this weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label IO JSON read_json, to_json, json_normalize IO SAS SAS: read_sas IO Stata read_stata, to_stata
Projects
None yet
3 participants