Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partial resolve-fetch #20

Closed
ddierkes opened this issue May 21, 2018 · 9 comments
Closed

partial resolve-fetch #20

ddierkes opened this issue May 21, 2018 · 9 comments
Assignees

Comments

@ddierkes
Copy link
Contributor

ddierkes commented May 21, 2018

If the dbbag is multiple TBs in size and I just want to pull in select files, shouldn't I be able to --resolve-fetch selectively? 'missing' and 'all' can both be fire hoses.

@mikedarcy
Copy link
Collaborator

Seems reasonable to me. Unfortunately, while adding this functionality to the API would be pretty easy, in the CLI it is going to be a little bit trickier.

Some issues I can think of off the top of my head:

  • How should the input set be passed to the CLI? As a delimited array of file paths that match file paths found in fetch.txt? As a file? Either?
  • There are potential filename encoding/quoting issues to deal with here as well when processing the input set of file paths.
  • The arguments to this function would not be compatible with the way the CLI currently handles the --resolve-fetch argument, so that either argument would have to be refactored (breaking backward compatibility), or we'd have to add a new arg, e.g.,--select-fetch or somesuch.

I can see the utility here, it just needs to be spec'ed out some more.

@ddierkes
Copy link
Contributor Author

If we're following IPFS practices, the only way to get the thing you want is to call it by its hash. That way you are sure you are getting the thing you want. But with a CLI, that would require some cutting and pasting which is a task in some shells. Calling by line number is easiest but also problematic. For my specific usecase, it would be most convenient to call by file extension. Or by the same wildcards you can use with the mv and cp commands.

I'm looking at this tool for ingest into an internal preservation system and not for public sharing (so minids with a public lookup are a conundrum). It would be great for my use case for a somewhat empty bag to be thrown into the system as a .tgz and for the system to then fetch all the non-massive files. So .json yes, but mp4 no.

@mikedarcy
Copy link
Collaborator

Being able to specify wildcard patterns sounds like a good compromise between utility and scope creep.

One relatively simple way to implement this would be using Python's fnmatch library, which would give you the same syntax as Unix shell commands like mv and cp. Supporting a disjunctive array of filters would also be nice so that you could include multiple file types via extension, e.g., [*.txt, *.json], etc. Unfortunately, expressing anything more complex (like negation) is a bit of a pain.

Alternatively, it could be implemented using regular expression matching. That would ultimately be more flexible/powerful but introduces more complexity. However, in this case an input array of filters is not really necessary since any disjunctive conditions could be expressed in the regex. This approach is probably more future-proof; i.e., I could see things getting to a point where fnmatch wasn't good enough anymore and needed to be switched out for re, while the reverse is less likely.

@ddierkes
Copy link
Contributor Author

ddierkes commented May 23, 2018

re is certainly more powerful than fnmatch or glob, but the syntax is not not quite as easy. I haven't quite figured out your code enough to contribute cleanly, so it is your choice. I could go back and add docstrings to your classes and functions though if you'd like.

I just noticed your other question.

As a file? Either?

Throwing a file together can be a lot more powerful than a regular expression in some cases. For instance, wget takes a url or a text file full of urls separated by line breaks. If you set it up right, you can wget an awful lot of disparate stuff from one text file.

@mikedarcy
Copy link
Collaborator

We have a release pending this week, so it is not likely I will be able to do anything with this for the upcoming release. However, after we get the release cut, I'll create a branch for this and prototype something and we can take it from there.

Regarding docstrings, point noted. That should definitely be done at some point. Feel free to file an issue about that if you like, but a PR is probably not necessary. In the meantime, there is API documentation here which can be used as a reference.

@mikedarcy
Copy link
Collaborator

mikedarcy commented Jun 10, 2018

So, for another part of the code (bdbag-utils), I recently needed to implement a simple filter expression mechanism for generating remote-file-manifests from various sources. It occurred to me that the same mechanism could be used to implement partial-fetch, so I have done so in an experimental branch: https://github.com/fair-research/bdbag/tree/partial-fetch.

It's pretty flexible while at the same time tries to keep things simple. There is a new argument --fetch-filter that takes a string of the form: <column><operator><value> where:

  • column is one of the following literal values corresponding to the field names in fetch.txt: url, length, or filename

  • <operator> is one of the following predefined tokens:

    Operator Description
    == equal
    != not equal
    =* wildcard substring equal
    !* wildcard substring not equal
    ^* wildcard starts with
    $* wildcard ends with
    > greater than
    >= greater than or equal to
    < less than
    <= less than or equal to
  • value is a string or integer

With this mechanism you can do various string-based pattern matching on filename and url. Using missing as the mode for --resolve-fetch, you can invoke the command multiple times with a different filter to perform a effective disjunction. For example:

  • bdbag --resolve-fetch missing --fetch-filter filename$*.txt ./my-bag
  • bdbag --resolve fetch missing --fetch-filter filename^*README ./my-bag
  • bdbag --resolve fetch missing --fetch-filter filename==data/change.log ./my-bag
  • bdbag --resolve fetch missing --fetch-filter url=*/requirements/ ./my-bag

The above commands will get all files ending with ".txt", all files beginning with "README", the exact file "data/change.log", and all urls containing "/requirements/" in the url path.

You can also use length and the integer relation operators to easily limit the size of the files retrieved, for example:

  • bdbag --resolve-fetch all --fetch-filter length<=1000000

Would this general filter mechanism satisfy your use case? Your feedback is welcome.

Also, I am not opposed to providing a file-based solution to this as well, but I think something like that should be much simpler by design, e.g., a simple newline-delimited list of URLs to cross-reference against URLs in fetch.txt and only download on intersections.

@ddierkes
Copy link
Contributor Author

I am not quite to the part of my project where I would be implementing a partial fetch. Your solutions sounds good. I'm not sure exactly what "missing" means though.

I'm pretty unclear though on where you store 'remote-file-manifests' in a ro-bag. Is remote-files.json a working file unique to bdbags? JSON is certainly cleaner than the somewhat ugly text files LoC invented, but I feel weird adding arbitrary metadata files that aren't clearly specified somewhere.

@mikedarcy
Copy link
Collaborator

For --resolve-fetch, the all and missing keywords just specify how to handle files that may or may not have already been fetched. With all, everything is downloaded again, regardless of whether it already exists in the bag payload or not, where missing just gets files that are not already in the payload. In the context of a multi-command fetch with filters, using missing just helps so that files are not re-downloaded in the case of filters that overlap and include the same content more than once.

The remote-file-manifest is just a bdbag-specific format (or working file, as you write) for generating bags with remote and/or local payloads. It does not get added to the bag in any way, it is simply a metadata "driver" file for creation of bags.

The bagit-ro support is completely optional (and doesn't have anything to do with partial-fetch), and you do not need to use it if you feel it is unnecessary for your use case. If you are looking at the entire diff of this branch against master you are seeing other changes unrelated to partial-fetch, but intended to be included in the next software release.

@mikedarcy
Copy link
Collaborator

I've included this functionality in the latest release. I'd like to close the issue if there are no more comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants