Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support saving/loading for zip file #93

Closed
roll opened this issue Sep 12, 2017 · 15 comments · Fixed by #97
Closed

Support saving/loading for zip file #93

roll opened this issue Sep 12, 2017 · 15 comments · Fixed by #97
Assignees
Labels

Comments

@roll
Copy link
Member

roll commented Sep 12, 2017

Overview

It's part of specs implementation requirements:

  • zip a Data Package descriptor and its co-located references (more generically: "zip a data package")
  • read a zip file that "claims" to be a data package
  • save a zipped Data Package to disk

http://specs.frictionlessdata.io/implementation/

@roll roll changed the title Support saving to a zip file Support saving/loading from a zip file Sep 12, 2017
@roll roll changed the title Support saving/loading from a zip file Support saving/loading for zip files Sep 12, 2017
@roll roll changed the title Support saving/loading for zip files Support saving/loading for zip file Sep 12, 2017
@rufuspollock
Copy link
Contributor

My 2c: this sort of stuff should not go into this core lib - we'll get really bloated. Instead we have help libraries to do writing / serializing of data package to specific forms.

Also i didn't realize this was part of implementation guide. We don't yet have an agreed spec for compressing or bundling - frictionlessdata/datapackage#132 + frictionlessdata/datapackage#290. I think it best to get those complete before people start implementing.

/cc @pwalsh as czar of implementations

@rufuspollock
Copy link
Contributor

@roll what happened re my comment on this? /cc @vitorbaptista

@roll
Copy link
Member Author

roll commented Dec 29, 2017

@rufuspollock
It's a part of the specs implementation requirements - https://frictionlessdata.io/specs/implementation/. At the moment datapackage implementations on 8 languages support saving/loading a data package to zip. So this implementations just follows the general approach.

Also it's a key feature for the project we work now with our partners.

I would say bloating is a problem only for client-side JavaScript. Server-side libraries don't really sensitive for things this like this. So if datapackage-py could load/save zip there is no reason to remove this capability. It's good for users and don't really affect installation speed/size etc.

And for the client-side JavaScript lib we just could do bundle-size optimization/splitting. I've done a lot of profiling and for now I would say the most problematic size-wise is shipping profiles not deps like this.

@rufuspollock
Copy link
Contributor

@roll my point was there was no response to the comment left on 13 sep until you shipped the feature now. In my original comment i pointed out that i did not understand why this was part of the implementation specs when it is not part of the spec itself.

But the real point is about how we factor this library - something i've mentioned a lot. /cc @vitorbaptista

@roll
Copy link
Member Author

roll commented Dec 30, 2017

@rufuspollock

In my original comment i pointed out that i did not understand why this was part of the implementation specs when it is not part of the spec itself.

That's the question to @pwalsh. As mentioned it's a part of the implementation specs. Also it was a part of every tool fund implementation contract. I could agree on having it (and I do) or disagree but I work on the implementations level. High level requirements like this are something to decide on the yours/@pwalsh specs level.

@pwalsh
Copy link
Member

pwalsh commented Dec 30, 2017

The implementation document was designed to make it easier for new implementors (people not working on the Python or Javascript libraries). In particular, for those implementing for the tool fund.

I personally would not add zip support to the javascript library, unless we already have a good story in place for removing certain code paths in a browser distribution of the lib (I can't imagine reading or writing zips in the browser is very useful). On the other hand, I don't think we should keep talking about bloat without looking at the facts (distribution size of lib, etc.).

@Stephen-Gates
Copy link

I’ll be exploring the zip feature for use in an electron app.

@pwalsh
Copy link
Member

pwalsh commented Dec 30, 2017

I’ll be exploring the zip feature for use in an electron app.

Great! Will this be handled via a Node.js process, or the client-side Javascript?

@roll
Copy link
Member Author

roll commented Dec 30, 2017

@Stephen-Gates
I'll be finishing this work around Jan 7 (when I'll be back from the break) and also I'd like to sync with you/your team on some questions like the one mentioned by Paul.

@rufuspollock
Copy link
Contributor

@roll before you complete could we consider whether stuff like this should go out of package as a separate mini library. /cc @pwalsh @vitorbaptista

See the data library analysis and recommendation document here:

https://hackmd.io/CwZgnOCMDs0LQEMAckBMdSsnMAjaIcADKgKYBmArGOSACbRJhA==?view

@pwalsh the bloat i'm talking about is less about package zip and more about conceptual complexity and library structure.

@vitorbaptista
Copy link

@rufuspollock If we think just on the library API level, loading/saving a datapackage ZIP adds:

  • Loading a datapackage accepts a ZIP file
  • A new datapackage.to_zip(path) method

Ignore the method names, as they might be different. The risk I see is if we add support for other packaging methods in the future (tar.bz2, lzma, ...), which would requiring adding new methods (or renaming the ones we have). Other than that, I don't see how this adds much conceptual complexity to the libraries' interfaces (it does add some to their implementation).

On a technical level, it would be great if we could avoid having a datapackage_with_zip_support.js and datapackage_without_zip_support.js. Instead, we could dynamically detect if a ZIP library is available. Something like:

function to_zip(path) {
    if (!has_zip_library()) {
        throw Error('ZIP support needs library X')
    }
    // ...
}

As, considering our experience with datapackage-py, the ZIP support adds just a few lines of code, so file size increase is negligible.

@Stephen-Gates
Copy link

@roll we have our kick off for Data Curator v2 on Wed 10 Jan. Perhaps we can catch up after that?

@rufuspollock
Copy link
Contributor

@vitorbaptista the issue is not the size of the code bundle but the overall approach and conceptual complexity. All software is made up of features that are small in themselves but which in aggregate are complex. The point here is not about zip per se but about the general principal of where are "loaders" and "dumpers" located - as per our discussion back in November.

@vitorbaptista
Copy link

@rufuspollock I understand, I just don't think this adds much complexity to the library interface. I mean the API of the package won't change much, as conceptually reading from a ZIP is basically the same as reading from the local filesystem or HTTP. It's just a different protocol, but the intent (reading a file) is the same.

It does add some complexity to the implementation of the library, but that's OK IMHO. Also, implementing before having it as part of the spec is a risk. However, as this pattern is already working on datapackage-py for a while, I don't see it becoming much different whenever it enters the spec. We can tweak the code as needed.

On where "loaders" and "dumpers" should be located, which discussion are you referring to? The data.js one? Are you referring to the dump_to_zip(datapackage, path) vs datapackage.to_zip(path) different styles ("thin" vs "fat" objects)?

@roll roll closed this as completed in #97 Jan 10, 2018
@roll roll removed the {review} label Jan 10, 2018
@roll
Copy link
Member Author

roll commented Jan 10, 2018

I think https://github.com/frictionlessdata/implementations/issues is better place to discuss high-level architecture.

My 2c on this:

I don't think that fat objects vs thin objects or something like this is a key question. TBH each of this tableschema/datapackage libs are relatively simple.

But there 9(!) implementations of the tableschema + datapackage libraries. And there are almost 3 implementations are exact ports of each others - Python/JavaScript/Ruby. For example zip support here is just copied from datapackage-py.

I think the key question is a maintainability and mass adoption of this libraries. On Frictionless Data project level fat/thin.. I don't see the difference is a key. What really matters in my eyes - how this libraries could be maintained with such a limited resources and how it will be adopted.

So I just want to note that considering any global changes to the libs we better see the whole picture of the implementation stack. For example if we will dramatically change datapackage-js lib how it affects the whole project:

  • one more completely different implementation to maintain (for now we just do porting Python/JavaScript/Ruby which is 3x faster)
  • one more completely different set of guides/tutorials (for now we could write tutorials for Python and it will work for the majority of other languages - users could literally change only language syntax)

PS.
If I start from scratch datapackage lib now for only one environment (e.g. JavaScript) I will probably goes with inversion of control and etc. It's smart and great for many use case (esp. great for JavaScript). But what I always was trying to emphasize - Frictionless Data implementations stack is much more that just 1 lib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants