Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compressed resources #290

Closed
rufuspollock opened this issue Sep 11, 2016 · 21 comments
Closed

Compressed resources #290

rufuspollock opened this issue Sep 11, 2016 · 21 comments
Assignees

Comments

@rufuspollock
Copy link
Contributor

Idea: describe resource compression. This would allow compression to be "ignored" in describing format and e.g. allow a tabular data package to include compressed CSV not just plain CSV.

Why? In much data management especially with larger dataset compressions is important for economies of storage and transmission. At the moment, data file compression is not explicitly supported by the specification. This also means that profiles like tabular data package or geo data package require that resources are uncompressed.

Proposal

Introduce a compression property on resource

...
path: mydata.csv.gz
format: "csv",
compression: "gz" # | bz2 | lzo | zip
...

Question: what compression formats do we support?

  • Do we support zip? It is very common, however quite a bit of tooling e.g. AWS Redshift does not support it.
  • What about tar + gzip?

Research:

  • Standard linux: zip, gzip, bzip2
  • Standard Mac install: zip, gzip
    • lzop available via brew
  • Python: stdlib: zip, gzip, bz2; external: lzop
  • Node: stdlib: gzip; external: zip, bz2, lzop
  • AWS Redshift supports: gzip, bz2, lzop
  • Google BigQuery indicates support for at least gzip
@danfowler
Copy link
Contributor

@rufuspollock rufuspollock added this to the Current milestone Sep 27, 2016
@pwalsh
Copy link
Member

pwalsh commented Nov 10, 2016

@rgrp

I'm +1 on this. I think we can support gzip and bz2, and then only consider other formats if the community of implementors requests.

I don't think this needs a lot of discussion to get a PR for v1: it does not impact any backwards compatibility, and compressed data is an extremely common use case.

@pwalsh
Copy link
Member

pwalsh commented Nov 10, 2016

any comments on this from the @frictionlessdata/specs-working-group ?

@jpmckinney
Copy link

Looks good. Let's use gzip not gz since the media type is application/gzip. Similarly bzip2

@pwalsh
Copy link
Member

pwalsh commented Dec 21, 2016

@rufuspollock will you work on some wording for this? Maybe better in here until I finish on #337

@rufuspollock
Copy link
Contributor Author

@pwalsh yes - and this should come after you complete #337. This is definitely a small discrete issue.

@ezwelty
Copy link
Contributor

ezwelty commented Jul 14, 2017

Since the format property is optional and thus needs to be parsed from the path (or mediatype) when missing anyway, implementations will also look to path to guess the compression. (what happens if they don't match!?)

Call me crazy, but I would advocate for no compression (nor format or mediatype, frankly), and instead enforce standard file extensions to tell the story:

data.csv
data.json
data.csv.gz
data.csv.bz2

That is no doubt the easiest to implement.

Otherwise, it would be helpful to know what the behavior should be if compression and path aren't consistent (and likewise, if format, mediatype, and path aren't consistent).

@rufuspollock
Copy link
Contributor Author

@ezwelty very good suggestions and i like the parsimony attitude. I would agree with you that you could ofte be able to infer and note that compression or format are not required -- in fact I do a lot of this in libraries i work on e.g.

parsePath('data.csv.zip') => {
  name: 'data',
  format: 'csv',
  compression: 'zip'
}

In this sense you might want to think of these properties as the full descriptive structure for the files.

In terms of behaviour:

  • if format is there: it is used (whatever path says)
  • if compression is there: it is used (whatever path says)

@ezwelty
Copy link
Contributor

ezwelty commented Jul 14, 2017

Thanks @rufuspollock, sounds good. Then what about mediatype? I suppose it should supersede both format and path since it is more rigorous?

@rufuspollock
Copy link
Contributor Author

@ezwelth - yes, it would replace format if provided. However, many publishers esp those "hand publishing" may be able to supply a format but not a media type. (Plus some formats like geojson don't have their own mediatype).

I'm not sure i understand about replacing path - you need path for the path to the file.

@ezwelty
Copy link
Contributor

ezwelty commented Jul 17, 2017

@rufuspollock (By path, I mean the format parsed from the path). OK, got it. So by the rules you laid out, e.g., path = "data.csv" with mediatype = "application/geo+json" (see the full IANA list) should be read as a GeoJSON file.

@jpmckinney
Copy link

@ezwelty Yes, that makes sense to me. It's fairly common for file extensions to be incorrect or missing (like if an application URL returns CSV data without ending in .csv), and a file's extension may not be in the control of the person authoring the data package.

@rufuspollock
Copy link
Contributor Author

@ezwelty yes and 👍 to what @jpmckinney said.

@michaelamadi
Copy link
Contributor

Let's get the ball rolling on this once more...

I'm in favour of adding the 'compression' property and having 'gz' and 'bz2' to begin with, since they are both in broad use and can be transparently/effortlessly read as .csv files using tools like Pandas and R.

In the absence of the 'compression' property, I agree that inferring the compression from the file extension also makes sense, although I think the spec should state that all compressed files MUST have a file extension that allows the compression property to be correctly inferred as a supported compression type (e.g. data.csv.gz or data.csv.bz2) OR the 'compression' property MUST be included and specify a supported compression type. Doing this should get around the issue of 'compression ambiguity'.

@rufuspollock What would be the next steps to move this forward?

@roll
Copy link
Member

roll commented Jun 25, 2018

@michaelamadi

I think the spec should state that all compressed files MUST have a file extension that allows the compression property to be correctly inferred as a supported compression type (e.g. data.csv.gz or data.csv.bz2) OR the 'compression' property MUST be included and specify a supported compression type.

I think it should be the way to go. At least on the implementation level, it's a common pattern (e.g. for format etc). This approach is already implemented for the Python libs - https://github.com/frictionlessdata/tabulator-py#compression (though resource.compression is not yet exposed for datapackage-py waiting for standardization)

@rufuspollock
Copy link
Contributor Author

@michaelamadi next step is a pull request to add this to the patterns page https://github.com/frictionlessdata/specs/blob/master/specs/patterns.md

@zaneselvans
Copy link

This would be a great feature, especially in the context of publishing data on a potentially paid service like datahub.io -- we have lots of data that zips down by a factor of 10-20x. With compressed resources we could host our entire current collection of US utility data within the 50GB account tier.

@rufuspollock
Copy link
Contributor Author

@zaneselvans got you - and in case of datahub.io we could probably provide a probono bigger account if you wanted to host data there in the meantime.

@michaelamadi
Copy link
Contributor

@rufuspollock @roll I've added a pattern for this. Here's the PR: #629

@mitar
Copy link

mitar commented Dec 13, 2019

Could this be then closed now?

@rufuspollock
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

9 participants