Compressed resources #290

rufuspollock · 2016-09-11T06:35:34Z

Idea: describe resource compression. This would allow compression to be "ignored" in describing format and e.g. allow a tabular data package to include compressed CSV not just plain CSV.

Why? In much data management especially with larger dataset compressions is important for economies of storage and transmission. At the moment, data file compression is not explicitly supported by the specification. This also means that profiles like tabular data package or geo data package require that resources are uncompressed.

Proposal

Introduce a compression property on resource

...
path: mydata.csv.gz
format: "csv",
compression: "gz" # | bz2 | lzo | zip
...

Question: what compression formats do we support?

Do we support zip? It is very common, however quite a bit of tooling e.g. AWS Redshift does not support it.
What about tar + gzip?

Research:

Standard linux: zip, gzip, bzip2
Standard Mac install: zip, gzip
- lzop available via brew
Python: stdlib: zip, gzip, bz2; external: lzop
Node: stdlib: gzip; external: zip, bz2, lzop
AWS Redshift supports: gzip, bz2, lzop
Google BigQuery indicates support for at least gzip

The text was updated successfully, but these errors were encountered:

danfowler · 2016-09-27T13:01:24Z

Perhaps @GregoryRhysEvans has a thought here?

https://gist.github.com/GregoryRhysEvans/7c07337abbefd64ac06562d4e5a54243

pwalsh · 2016-11-10T08:15:25Z

@rgrp

I'm +1 on this. I think we can support gzip and bz2, and then only consider other formats if the community of implementors requests.

I don't think this needs a lot of discussion to get a PR for v1: it does not impact any backwards compatibility, and compressed data is an extremely common use case.

pwalsh · 2016-11-10T08:16:12Z

any comments on this from the @frictionlessdata/specs-working-group ?

jpmckinney · 2016-11-11T01:29:13Z

Looks good. Let's use gzip not gz since the media type is application/gzip. Similarly bzip2

pwalsh · 2016-12-21T12:37:10Z

@rufuspollock will you work on some wording for this? Maybe better in here until I finish on #337

rufuspollock · 2016-12-21T13:13:09Z

@pwalsh yes - and this should come after you complete #337. This is definitely a small discrete issue.

ezwelty · 2017-07-14T17:51:08Z

Since the format property is optional and thus needs to be parsed from the path (or mediatype) when missing anyway, implementations will also look to path to guess the compression. (what happens if they don't match!?)

Call me crazy, but I would advocate for no compression (nor format or mediatype, frankly), and instead enforce standard file extensions to tell the story:

data.csv
data.json
data.csv.gz
data.csv.bz2

That is no doubt the easiest to implement.

Otherwise, it would be helpful to know what the behavior should be if compression and path aren't consistent (and likewise, if format, mediatype, and path aren't consistent).

rufuspollock · 2017-07-14T20:57:53Z

@ezwelty very good suggestions and i like the parsimony attitude. I would agree with you that you could ofte be able to infer and note that compression or format are not required -- in fact I do a lot of this in libraries i work on e.g.

parsePath('data.csv.zip') => {
  name: 'data',
  format: 'csv',
  compression: 'zip'
}

In this sense you might want to think of these properties as the full descriptive structure for the files.

In terms of behaviour:

if format is there: it is used (whatever path says)
if compression is there: it is used (whatever path says)

ezwelty · 2017-07-14T22:06:24Z

Thanks @rufuspollock, sounds good. Then what about mediatype? I suppose it should supersede both format and path since it is more rigorous?

rufuspollock · 2017-07-15T16:41:47Z

@ezwelth - yes, it would replace format if provided. However, many publishers esp those "hand publishing" may be able to supply a format but not a media type. (Plus some formats like geojson don't have their own mediatype).

I'm not sure i understand about replacing path - you need path for the path to the file.

ezwelty · 2017-07-17T00:02:10Z

@rufuspollock (By path, I mean the format parsed from the path). OK, got it. So by the rules you laid out, e.g., path = "data.csv" with mediatype = "application/geo+json" (see the full IANA list) should be read as a GeoJSON file.

jpmckinney · 2017-07-17T00:11:18Z

@ezwelty Yes, that makes sense to me. It's fairly common for file extensions to be incorrect or missing (like if an application URL returns CSV data without ending in .csv), and a file's extension may not be in the control of the person authoring the data package.

rufuspollock · 2017-07-19T06:11:29Z

@ezwelty yes and 👍 to what @jpmckinney said.

michaelamadi · 2018-06-18T23:28:36Z

Let's get the ball rolling on this once more...

I'm in favour of adding the 'compression' property and having 'gz' and 'bz2' to begin with, since they are both in broad use and can be transparently/effortlessly read as .csv files using tools like Pandas and R.

In the absence of the 'compression' property, I agree that inferring the compression from the file extension also makes sense, although I think the spec should state that all compressed files MUST have a file extension that allows the compression property to be correctly inferred as a supported compression type (e.g. data.csv.gz or data.csv.bz2) OR the 'compression' property MUST be included and specify a supported compression type. Doing this should get around the issue of 'compression ambiguity'.

@rufuspollock What would be the next steps to move this forward?

roll · 2018-06-25T11:53:57Z

@michaelamadi

I think the spec should state that all compressed files MUST have a file extension that allows the compression property to be correctly inferred as a supported compression type (e.g. data.csv.gz or data.csv.bz2) OR the 'compression' property MUST be included and specify a supported compression type.

I think it should be the way to go. At least on the implementation level, it's a common pattern (e.g. for format etc). This approach is already implemented for the Python libs - https://github.com/frictionlessdata/tabulator-py#compression (though resource.compression is not yet exposed for datapackage-py waiting for standardization)

rufuspollock · 2018-07-02T09:22:32Z

@michaelamadi next step is a pull request to add this to the patterns page https://github.com/frictionlessdata/specs/blob/master/specs/patterns.md

zaneselvans · 2018-09-05T15:37:02Z

This would be a great feature, especially in the context of publishing data on a potentially paid service like datahub.io -- we have lots of data that zips down by a factor of 10-20x. With compressed resources we could host our entire current collection of US utility data within the 50GB account tier.

rufuspollock · 2019-01-31T19:31:30Z

@zaneselvans got you - and in case of datahub.io we could probably provide a probono bigger account if you wanted to host data there in the meantime.

michaelamadi · 2019-05-16T06:42:53Z

@rufuspollock @roll I've added a pattern for this. Here's the PR: #629

mitar · 2019-12-13T04:10:56Z

Could this be then closed now?

rufuspollock · 2019-12-13T22:31:07Z

FIXED. See https://frictionlessdata.io/specs/patterns/#compression-of-resources

rufuspollock added this to the Current milestone Sep 27, 2016

jpmckinney added the blocker label Sep 28, 2016

roll removed the blocker label Nov 16, 2016

sabas mentioned this issue Nov 17, 2016

Data Package Bundling (and maybe compression) #132

Closed

rufuspollock self-assigned this Nov 17, 2016

danfowler mentioned this issue Dec 6, 2016

Write a blog post about the latest features in the specs frictionlessdata/frictionlessdata.io#335

Closed

roll added the Data Resource label Dec 20, 2016

rufuspollock modified the milestones: Backlog, Version-1 Dec 22, 2016

rufuspollock added the faq-pattern-best-practice label Dec 22, 2016

rufuspollock mentioned this issue Jul 14, 2017

Implementing the R Data Package Library ropensci-archive/datapkg#12

Open

16 tasks

ezwelty mentioned this issue Jul 14, 2017

Format should supersede path file extension ezwelty/dpkg#1

Open

rufuspollock mentioned this issue Sep 13, 2017

Support saving/loading for zip file frictionlessdata/datapackage-js#93

Closed

This was referenced Oct 12, 2017

Support compressed resources frictionlessdata/datapackage-py#191

Closed

package.infer() failed for zipped csv frictionlessdata/datapackage-py#184

Closed

michaelamadi mentioned this issue May 16, 2019

Added a pattern for compressed resourses #629

Merged

rufuspollock closed this as completed Dec 13, 2019

roll added this to Open Knowledge Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compressed resources #290

Compressed resources #290

rufuspollock commented Sep 11, 2016

danfowler commented Sep 27, 2016

pwalsh commented Nov 10, 2016

pwalsh commented Nov 10, 2016

jpmckinney commented Nov 11, 2016

pwalsh commented Dec 21, 2016

rufuspollock commented Dec 21, 2016

ezwelty commented Jul 14, 2017 •

edited

Loading

rufuspollock commented Jul 14, 2017

ezwelty commented Jul 14, 2017

rufuspollock commented Jul 15, 2017

ezwelty commented Jul 17, 2017

jpmckinney commented Jul 17, 2017

rufuspollock commented Jul 19, 2017

michaelamadi commented Jun 18, 2018

roll commented Jun 25, 2018 •

edited

Loading

rufuspollock commented Jul 2, 2018

zaneselvans commented Sep 5, 2018

rufuspollock commented Jan 31, 2019

michaelamadi commented May 16, 2019

mitar commented Dec 13, 2019

rufuspollock commented Dec 13, 2019

Compressed resources #290

Compressed resources #290

Comments

rufuspollock commented Sep 11, 2016

Proposal

danfowler commented Sep 27, 2016

pwalsh commented Nov 10, 2016

pwalsh commented Nov 10, 2016

jpmckinney commented Nov 11, 2016

pwalsh commented Dec 21, 2016

rufuspollock commented Dec 21, 2016

ezwelty commented Jul 14, 2017 • edited Loading

rufuspollock commented Jul 14, 2017

ezwelty commented Jul 14, 2017

rufuspollock commented Jul 15, 2017

ezwelty commented Jul 17, 2017

jpmckinney commented Jul 17, 2017

rufuspollock commented Jul 19, 2017

michaelamadi commented Jun 18, 2018

roll commented Jun 25, 2018 • edited Loading

rufuspollock commented Jul 2, 2018

zaneselvans commented Sep 5, 2018

rufuspollock commented Jan 31, 2019

michaelamadi commented May 16, 2019

mitar commented Dec 13, 2019

rufuspollock commented Dec 13, 2019

ezwelty commented Jul 14, 2017 •

edited

Loading

roll commented Jun 25, 2018 •

edited

Loading