-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request: ArchiveRecord.archiveFile #164
Comments
Thanks as always (we're getting a bit of a back log of issues as we're stretched in many different directions, so don't take lack of work as lack of interest!). Just so I am clear, the idea would be that you'd have a command like:
Where |
Hi, i am not sure I understand your question. here it is an example:
|
OK thanks for this. We are quite swamped right now but if you have a cycle we always enthusiastically look for pull requests too. 😄 |
I posted the feature request here, but I am not sure that it's useful for other people.
for the previous example, this would create the files:
which is actually even better for my needs. |
I am querying CommonCrawl archive, which is divided into hundreds of warc.gz files. I use RecordLoader.loadArchives to read all the warc files at once. Sometimes the log contains an Exception when processing a page, and I'd need to find out from which of the individual warc.gz files comes from (so that I can re-run the program in that file only).
Would it be possible for
ArchiveRecord
class to have also a field with the input archive name? (with that, I could catch exceptions and show not only the url but also the input archive file).The text was updated successfully, but these errors were encountered: