-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compressed resources #290
Comments
Perhaps @GregoryRhysEvans has a thought here? https://gist.github.com/GregoryRhysEvans/7c07337abbefd64ac06562d4e5a54243 |
I'm +1 on this. I think we can support I don't think this needs a lot of discussion to get a PR for v1: it does not impact any backwards compatibility, and compressed data is an extremely common use case. |
any comments on this from the @frictionlessdata/specs-working-group ? |
Looks good. Let's use |
@rufuspollock will you work on some wording for this? Maybe better in here until I finish on #337 |
Since the Call me crazy, but I would advocate for no
That is no doubt the easiest to implement. Otherwise, it would be helpful to know what the behavior should be if |
@ezwelty very good suggestions and i like the parsimony attitude. I would agree with you that you could ofte be able to infer and note that compression or format are not required -- in fact I do a lot of this in libraries i work on e.g.
In this sense you might want to think of these properties as the full descriptive structure for the files. In terms of behaviour:
|
Thanks @rufuspollock, sounds good. Then what about |
@ezwelth - yes, it would replace format if provided. However, many publishers esp those "hand publishing" may be able to supply a format but not a media type. (Plus some formats like geojson don't have their own mediatype). I'm not sure i understand about replacing |
@rufuspollock (By |
@ezwelty Yes, that makes sense to me. It's fairly common for file extensions to be incorrect or missing (like if an application URL returns CSV data without ending in |
@ezwelty yes and 👍 to what @jpmckinney said. |
Let's get the ball rolling on this once more... I'm in favour of adding the 'compression' property and having 'gz' and 'bz2' to begin with, since they are both in broad use and can be transparently/effortlessly read as .csv files using tools like Pandas and R. In the absence of the 'compression' property, I agree that inferring the compression from the file extension also makes sense, although I think the spec should state that all compressed files MUST have a file extension that allows the compression property to be correctly inferred as a supported compression type (e.g. data.csv.gz or data.csv.bz2) OR the 'compression' property MUST be included and specify a supported compression type. Doing this should get around the issue of 'compression ambiguity'. @rufuspollock What would be the next steps to move this forward? |
I think it should be the way to go. At least on the implementation level, it's a common pattern (e.g. for |
@michaelamadi next step is a pull request to add this to the patterns page https://github.com/frictionlessdata/specs/blob/master/specs/patterns.md |
This would be a great feature, especially in the context of publishing data on a potentially paid service like datahub.io -- we have lots of data that zips down by a factor of 10-20x. With compressed resources we could host our entire current collection of US utility data within the 50GB account tier. |
@zaneselvans got you - and in case of datahub.io we could probably provide a probono bigger account if you wanted to host data there in the meantime. |
@rufuspollock @roll I've added a pattern for this. Here's the PR: #629 |
Could this be then closed now? |
Idea: describe resource compression. This would allow compression to be "ignored" in describing format and e.g. allow a tabular data package to include compressed CSV not just plain CSV.
Why? In much data management especially with larger dataset compressions is important for economies of storage and transmission. At the moment, data file compression is not explicitly supported by the specification. This also means that profiles like tabular data package or geo data package require that resources are uncompressed.
Proposal
Introduce a
compression
property on resourceQuestion: what compression formats do we support?
zip
? It is very common, however quite a bit of tooling e.g. AWS Redshift does not support it.Research:
The text was updated successfully, but these errors were encountered: