-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
partial resolve-fetch #20
Comments
Seems reasonable to me. Unfortunately, while adding this functionality to the API would be pretty easy, in the CLI it is going to be a little bit trickier. Some issues I can think of off the top of my head:
I can see the utility here, it just needs to be spec'ed out some more. |
If we're following IPFS practices, the only way to get the thing you want is to call it by its hash. That way you are sure you are getting the thing you want. But with a CLI, that would require some cutting and pasting which is a task in some shells. Calling by line number is easiest but also problematic. For my specific usecase, it would be most convenient to call by file extension. Or by the same wildcards you can use with the mv and cp commands. I'm looking at this tool for ingest into an internal preservation system and not for public sharing (so minids with a public lookup are a conundrum). It would be great for my use case for a somewhat empty bag to be thrown into the system as a .tgz and for the system to then fetch all the non-massive files. So .json yes, but mp4 no. |
Being able to specify wildcard patterns sounds like a good compromise between utility and scope creep. One relatively simple way to implement this would be using Python's Alternatively, it could be implemented using regular expression matching. That would ultimately be more flexible/powerful but introduces more complexity. However, in this case an input array of filters is not really necessary since any disjunctive conditions could be expressed in the regex. This approach is probably more future-proof; i.e., I could see things getting to a point where |
re is certainly more powerful than fnmatch or glob, but the syntax is not not quite as easy. I haven't quite figured out your code enough to contribute cleanly, so it is your choice. I could go back and add docstrings to your classes and functions though if you'd like. I just noticed your other question.
Throwing a file together can be a lot more powerful than a regular expression in some cases. For instance, wget takes a url or a text file full of urls separated by line breaks. If you set it up right, you can wget an awful lot of disparate stuff from one text file. |
We have a release pending this week, so it is not likely I will be able to do anything with this for the upcoming release. However, after we get the release cut, I'll create a branch for this and prototype something and we can take it from there. Regarding docstrings, point noted. That should definitely be done at some point. Feel free to file an issue about that if you like, but a PR is probably not necessary. In the meantime, there is API documentation here which can be used as a reference. |
So, for another part of the code ( It's pretty flexible while at the same time tries to keep things simple. There is a new argument
With this mechanism you can do various string-based pattern matching on
The above commands will get all files ending with ".txt", all files beginning with "README", the exact file "data/change.log", and all urls containing "/requirements/" in the url path. You can also use
Would this general filter mechanism satisfy your use case? Your feedback is welcome. Also, I am not opposed to providing a file-based solution to this as well, but I think something like that should be much simpler by design, e.g., a simple newline-delimited list of URLs to cross-reference against URLs in |
I am not quite to the part of my project where I would be implementing a partial fetch. Your solutions sounds good. I'm not sure exactly what "missing" means though. I'm pretty unclear though on where you store 'remote-file-manifests' in a ro-bag. Is remote-files.json a working file unique to bdbags? JSON is certainly cleaner than the somewhat ugly text files LoC invented, but I feel weird adding arbitrary metadata files that aren't clearly specified somewhere. |
For The The |
I've included this functionality in the latest release. I'd like to close the issue if there are no more comments. |
If the dbbag is multiple TBs in size and I just want to pull in select files, shouldn't I be able to --resolve-fetch selectively? 'missing' and 'all' can both be fire hoses.
The text was updated successfully, but these errors were encountered: