-
Notifications
You must be signed in to change notification settings - Fork 24
IPFS as a backend to a web archiving #28
Comments
This would also entail support for the memento protocol relating to ipfs-inactive/faq#35 |
@ikreymer thanks for initiating this, I'd really love to get this working, as it would be very helpful for ipfs/archives :)
SGTM :)
Looking at this, this would require storing a TimeMap in IPFS for each URI-R? If we assume each TimeGate has its own TimeMap, then this could easily be achieved by pushing TimeMap updates to an IPNS address (ipfs/kubo#1716). However, ideally we'd also like to be able to aggregate/federate TimeMaps across all TimeGates storing mementos for a given resource, cf ipfs/notes#40 . @ikreymer Does the memento protocol support something like this?
IPFS doesn't yet support this, but it is planned (cc @jbenet @whyrusleeping) |
We can probably build a data structure and custom importer for WARC files so that we can traverse into the WARCs with ipfs. |
Hmm, well, the TimeMap does not need to exist as a discreet file, it's basically a query for 'all mementos (archives) of a given url', which can be the result of a query, etc.. The TimeGate is basically a query for 'closest memento (archives) of a given url to a given date (and maybe next, prev dates available)' I was reading https://ipfs.io/ipfs/QmdPtC3T7Kcu9iJg6hYzLBWR5XCDcYMY7HV685E3kH3EcS/2015/09/15/hosting-a-website-on-ipfs/ -- it seems that perhaps the best solution is just through the file system itself. Is there support for nested directories? Also, how does the One idea is to just use a url/date method of A url,
Yes, I think this may be close to what I was thinking too. A WARC file consists of concatenated gzipped records, and so I'm proposing to store each record as above. The record contains WARC headers, including the url, date, hash digest of the payload, and a unique id (amongst other fields), followed by HTTP headers + HTTP payload. Since there can be multiple WARC records per timestamp and url, and its often useful to further filter by
If this is possible, then searching to see if the archive has Searching for all records by date (eg. the TimeMap) would just be:
Service the HTTP response from 201509250000000 would be just reading the first file from:
(Actually url is usually canonicalized into a reverse order form, eg. I'm not at all sure if this would work and/or be efficient.
Yes, there is also a concept of a Memento Aggregator! It is mentioned here and described at: http://mementoweb.org/depot/ and they host one which aggregates across multiple web archives. And here is a new one, being written: https://github.com/oduwsdl/memgator Based on the above idea, this would just query multiple hashes:
Let me know if any of this makes sense. |
Also wanted to add here: A key distinction between 'WARC records' and plan static files is that the records are the raw HTTP request and response data (not files), including headers, encoding, etc... The HTTP headers are often important to accurate 'replay' web content. I can add some examples of WARC files to make it more clear. |
Ah, fair enough.
Most definitely
The API will give you back a JSON object (example, you'll need to copy and paste the link to avoid a CORS error) which you could then process.
That's definitely a possibility. I'd even suggest leaving in the slashes (like
If we assume that most sites don't have identically named subdomains and subdirectories with differing content, then there should be minimal ambiguity (which could be resolved by double-checking the URL in the WARC file). Alternatively, we could separate the domain and path parts like
Yeah, that would work. I'm also interested in being able to merge everything into a single global tree so the client only has to query a single hash (see #8), but that's still a fair while off.
I'm roughly familiar with WARC from Common Crawl Also, the homepage of webrecorder says:
How does verification work? |
Well, taking a step back, I realized that the directory structure is immutable, so perhaps this directory structure idea isn't as useful as I had thought.. I had assumed ti could be used as updatable index, but of course that's not the case.. Hmm.. I think perhaps key-value store or simple database ipfs/ipfs#82 would be useful for querying and update the index.. The directory structure isn't really needed, as it would just indicate a particular set of files in a WARC, or the order in which something was recorded, which is arbitrary.. and not the total archive. Can the URL and datetime can be embedded as file system metadata stored with the file data? Is that possible? What is needed is some sort of updatable index... Still trying to understand how IPFS works, sorry :)
Oh I should probably update that.. It just signs the WARC using https://github.com/ikreymer/warcsigner |
IPNS provides mutable files and directories, in the same way git does (commits are immutable, but HEAD is not).
Ah, fair enough (also note that IPNS provides this natively). I tried looking into whether TLS could be (ab)used to provide server-signed content, but apparently not (the closest thing I found was https://tlsnotary.org/ which isn't really helpful). |
Hm. I see, how are simultaneous updates handled? Is there an equivalent of a merge operation? |
not built yet, but yes it's doable. highly app dependent, so we're still playing with designs |
You mean different difftool depending on the data format type? |
Thanks for the quick responses everyone. I guess the next step is for me to try and build a quick prototype of writing WARC records into IPFS and playing back archived content from IPFS, in the simplest way possible. Hopefully will get a chance to try that soon. My tools are all python based, so I think I should be able to use https://github.com/ipfs/python-ipfs-api |
@ikreymer SGTM, looking forward to it :) |
Cc: @amstocker |
@ikreymer I would be glad to help you out, so please let me know if you have questions. The python API client is pretty much stable but if you have any issues also definitely let me know. |
We now have a place to coordinate porting/building apps on top of IPFS, so I've opened an issue there to discuss the details of integrating WebRecorder.io specifically with IPFS (ipfs/apps#3). I'll leave this issue open to discuss archiving/recording web pages onto IPFS more generally, including discussions about how to store such data on IPFS such as directory naming conventions, etc. |
relevant href https://news.ycombinator.com/item?id=3946856 |
Thanks everyone for your help. Sorry for delay, I have been busy releasing a new version of webrecorder.io, now out.. Also, happy to announce that webrecorder is now fully open source, at https://github.com/webrecorder/webrecorder There's still a lot to be done before integration would be possible, but I am planning to work on a separate prototype write and replaying WARCs for now.. Also, look forward to stopping by the Tuesday meetup in SF and meeting folks in person.. |
@ikreymer Great news, looking forward to when integration becomes possible :) PS: I just noticed you seem to be involved with hypothes.is? I've just started a discussion about IPFS integration you might be interested in :) |
@davidar @amstocker Here is a very very rough prototype, that allows users to browse and "record" into IPFS as they browse, and replay back from IPFS. Each (gzipped) WARC record is stored individually under a url-encoded name of the URL. https://github.com/ikreymer/pywb-ipfs/ After running the app, visit Redis is used to update a sorted index of URLs in real-time, although a copy is then pushed into IPFS every few seconds.. I'm not sure if this is the right approach, but just a start. |
This is a really cool recorder. would be awesome to get it into a chrome extension. (can py be put into chrome? might have to be js) |
Hopefully this should become easier when |
Well, I am actually trying to avoid browser plugins, as I think that is limiting to a specific browser, requires user installation, and is harder to maintain. I think this makes sense as a server-side service, which in my experience is more robust, and can support any modern browser, including mobile. I do have some questions about how to structure the data.. right now, each HTTP response is its own warc record, and there is only an index that points to each one. It could be interesting to use the dag nature of IPFS to create links between urls, but not sure what the best approach would be.. Since the recording is open-ended, there is no definite end.. for example, could "commit" a linked page structure (based on referrer) on page load, then if user interacts with page, or scrolls down, additional content could be recorded, so the recorded could "commit" again, and then if user navigates to another page and that loads another page, that could also be committed. Then there is the issue of merging multiple recordings.. currently, I'm just updating the IPNS name with a cumulative index, but of course ideal is to merge multiple indices.. And it was great to present at SF meetup. Perhaps a discussion on IRC is better, if so, let me know, and I can jump in. |
@ikreymer indeed, thanks for coming.
I suspect we can do some clever importing (transform) of a WARC file into IPFS dag nodes. (like what
Can use a commit chain, like with git. we'll have those soon. ipfs/notes#23 |
Is there more information on how this should work? Perhaps we should discuss at some point? I'm thinking it would be good to figure out how to properly deal with WARC files in a general sense, as this may also affect ArchiveTeam #36 and #39 as these will have a lot of WARC files (though of course not only WARC files) |
Feel free to open other notes anywhere-- if love for discussions to happen in our archives or notes repo, but whatever works! — On Fri, Nov 6, 2015 at 7:11 PM, David A Roberts notifications@github.com
|
@davidar Can't do now unfortunately, but lets pick a time that works for everyone.. |
0700-1200 UTC usually works for me |
sorry my avail will be sparse before thu this week. meet without me i'd say, and i can look over a proposed WARC design? |
@travisfw this thread may interest you... |
To restart the discussion, I thought it probably make sense to delve into the structure of a WARC record. The WARC record is basically consists of WARC (mime-style) headers, followed by HTTP response (HTTP headers + HTTP payload). It is designed to be easily appended to a previous entry (for example, by a crawler)
Thus far, I have been storing this entire block into IPFS, but this may not be the optimum way. By design, each WARC record will be different, as it contains a unique id and a unique timestamp. The Perhaps then, to store a WARC record, it makes sense to then serialize the HTTP payload separately from WARC headers + HTTP headers? (Just for reference, here are some slides I have about the structure of the WARC format: When storing duplicate content, only a new WARC headers + HTTP headers entry can be added, and the HTTP payload would be matched by an existing hash. The WARC spec already supports this exact use case for deduplication. The larger goal here is to be able to accurately ingest existing web archives (WARC records) into IPFS, and create new archives compatible with existing web archiving software. This ensures, for instance, that HTTP headers are preserved as well, which are often needed for accurate replay in some cases (cookies, custom headers, etc..) |
👍 IPLD might also allow us to store the headers as proper key-value mappings (metadata). Cc: @mildred |
IPLD would allow you to do the same thing that you can do with the current format, except that IPLD already provides you with a structured data model compatible with JSON. What you probably want to do is store the headers and then a link to the payload. |
I'm having trouble finding the IPLD spec. I found this: https://github.com/candeira/specs/blob/52f2a673df33b06e4408100fc468eea78d0f2cae/merkledag/ipld.md and I found two implementations: https://github.com/ipfs/go-ipld and https://github.com/diasdavid/js-ipld (edit: Ah, maybe this is the definitive PR? ipfs/specs#37 ) |
IPLD is not yet ready (that's why it's still a pull request) and I don't think the implementations are ready yet (at least go-ipld isn't). Bu basically, it replaces the current protocol buffer implementation in go-ipfs/merkledag with a JSON-compatible data structure. This data structure is free for application implementors (you for example that want to store some specific data structure) to use. If you think of your data structure in JSON, you can be sure to be able to store it in IPLD. IPLD adds a link mechanism to allow linking IPLD documents together. A link in IPLD is represented by a JSON object like this one:
This object can contain other properties you might want to store for the link. |
@jbenet and I chatted Various ways to do it: either split HTTP payload, HTTP headers, WARC headers as separate objects linked together, or add all headers as part of IPLD structure. Probably separate objects make sense so that WARC digest entries can just be IPFS hashes. I will look at existing spec and offer more specific thoughts. |
@ikreymer SGTM :) |
For reference @ibnesayeed and I hacked together InterPlanetary Wayback for the Archives Unleashed Hackathon in early March to get our hands dirty and experiment with WARC+IPFS interfacing. The approach we initially took was similar to the first way @ikreymer described: we chopped up WARC files into WARC headers, HTTP headers, and HTTP payload; extracted relevant values from the WARC headers, discarding the rest; then added the temp files created from the extracted parts via a local IPFS daemon instance. It sorta worked but we hope to develop it further in a less hacky way. |
I am building a new on-demand web archiving system, called webrecorder.io, which allows for on-demand archiving of any web site (by acting as a rewriting + recording proxy).
This version (actually beta.webrecorder.io) will soon be open-sourced and will be available for users to deploy on their own.
The system allows for a user to create recording of any web site, including dynamic content, by browsing it through a the recorder, eg. https://webrecorder.io/record/example.com/ and replay by browsing through replay, https://webrecorder.io/replay/example.com/
The recording is a WARC file, a standard used by Internet Archive and other archiving orgs. The file can be broken down into records (basically contents of HTTP response + request and extra metadata), and each of these records could be put individually into IPFS.
I suppose this sort of relates to #7 but perhaps in a more sophisticated way.
Most obvious mode of operation: Store each WARC record in IPFS individually.
Some unknowns (to me):
For more reference:
The system is built using these tools: https://github.com/ikreymer/pywb , https://github.com/ikreymer/warcprox
An older simplified version of the "webrecorder" concept: https://github.com/ikreymer/pywb-webrecorder.
The text was updated successfully, but these errors were encountered: