Mirroring Web to IPFS #94

lidel · 2018-07-24T13:25:23Z

This is a meta-issue tracking related work and discussions (moved from ipfs/ipfs-companion#96).

Feasible

Image Rehosting via HTTP API (ipfs-companion/#599)
Creating simplified website snapshot:
- ipfs-companion/#91
- 2read extension is a great poc!

More Design Work Required

Saving reproducible snapshot of entire page load

This includes all JS/CSS/XHR and other assets that were loaded by the page, respecting Origin and other constraints that could impact page load
The only standard in web archiving at the moment is the ISO WARC file format:
- https://www.iso.org/standard/68004.html, https://en.wikipedia.org/wiki/Web_ARChive
- it specifies raw data captured from the web. However, the WARC files often lack any context or metadata about how this data was captured
- WABAC.js proof-of-concept web archive replay system implemented entirely via Service Workers
  - https://github.com/webrecorder/wabac.js hosted at https://wab.ac/
  - supports replay of WARC and HAR files
  - could probably be extended to support signed exchanges (below)
We don't have means to do it yet, but Bundles from WebPackage (Signed/Bundled HTTP Exchanges and WebPackage #121) could also unlock Archival use case:
- https://tools.ietf.org/html/draft-yasskin-webpackage-use-cases-01#section-2.2.10
👉 👉 Update 2021: High fidelity solution exists! And can be used with IPFS 🥳
- https://replayweb.page/ (https://github.com/webrecorder/replayweb.page) which is a full browser-based web archive replay system ('wayback machine'), using service workers
- https://replayweb.page/docs/
- Relevant demo/status update of @ikreymer's work: https://www.youtube.com/watch?v=evcSETnTBf0

Automatic mirroring of standard websites to IPFS as you browse them (ipfs/ipfs-companion#535)

IMMUTABLE assets: very limited feasibility, so far only two types of immutable resources on the web exist:
- JS, CSS etc marked with SRI hash (Subresource Integrity) (mapping SRI→CID) (see discussion from 2016-03-26 below, and Using CID in HTML SRI (Subresource Integrity attributes) #214 for future work)
- URLs for things explicitly marked as immutable via Cache-Control: public, (..) immutable (mapping URL→CID)
MUTABLE assets: what if we we add every page to IPFS store mapping between URL and CID, then if page disappear, we could fallback to IPFS version?
- a can of worms: a safe version would be like web.archive.org, but limited to a local machine. Sharing cache with other people would require centralized mapping service (single point of failure, vector for privacy leaks)
- So what is needed to make it "right"?
  - keep it simple but robust: no http, no centralization, no single point of failure
  - Ideally, URL2IPFS lookups would not rely on centralized index.
    - rough idea (Automatic mirroring of HTTP websites to IPFS as you browse them ipfs-companion#535 (comment)): what if we create pubsub-based room per URL? for example:
      - When you open a website, you subscribe to pubsub room unique for that URL
      - If pubsub room has entries under "keepalive" treshold, grab the latest one
      - If room is empty or keepalive timeout is hit, fallback to HTTP, but in background add HTTP page to IPFS and announce updated hash on pubsub (with new timestamp) for next visitor
      - There are still pubsub performance and privacy problems to solve (eg. publishing banking pages), but at least we don't rely on HTTP server anymore.
        
        Automatic mirroring of HTTP websites to IPFS as you browse them ipfs-companion#535 (comment):
        
        I feel the safe way to do it to just follow semantics of Cache-Control and max-age (if present).
        This header is already respected by browsers and website owners and could be parsed as indicator if specific asset can be cached in IPFS. AFAIK all browsers (well, at least Chrome, Firefox) cache HTTPS content by default for some arbitrary time (if Cache-Control is missing), unless explicitly told not to cache via Cache-Control header.
Other notes
- "webpackage" standard proposal surfaced recently, among other things, it aims to address website snapshoting use case in a safe and reproducible manner:
  - webpackage: Save and share a web page (Use Case)
  - Sounds super relevant to what we want as the endgame here
Prior art: existing browser extensions
- Arweave: https://chrome.google.com/webstore/detail/arweave/iplppiggblloelhoglpmkmbinggcaaoc?hl=en-GB
- Archiveror: https://chrome.google.com/webstore/detail/archiveror/cpjdnekhgjdecpmjglkcegchhiijadpb
- https://github.com/inkandswitch/xcrpt - PoC browser extension produces page snapshot with a note at the top of the page publishes it to IPFS
  - Demo: https://www.bonappetit.com/recipe/kimchi-jjigae gets saved as https://ipfs.io/ipfs/QmS1pj7nUBvCSTaMjSrtH1EYfhWWpr4sZFyfdi7zfAm5Wc/

Related Discussions

2016-03-26

IRC log about mirroring SRI2IPFS

165958           geir_ │ lgierth: The web sites would have to link to ipfs content for this plugin to work. What i propose is a proxy that works like a transparent proxy and puts content into ipfs if it's not already there
170124            ed_t │ anyone know anything about ipfs-boards
170141            ed_t │ it keeps telling me I am in limited mode
170202            ed_t │ a full ipfs 0.40-rc3 node is running on localhost:5001
170217            ed_t │ but it does not seem to see it using the demo link
170228        +lgierth │ geir_: ah got what you wanna do -- i'm not sure you can easily just rewrite anything
170253        +lgierth │ for completely static pages, yes, but for slightly more dynamic stuff?
170303        +lgierth │ i'll be back in a bit, getting some coffee
170422           geir_ │ lgierth: I mean only for the static stuff like images, libs and so on. Should be pretty strait forward to implement. And a big bandwidth save for big networks
171542           lidel │ geir_, we are planning to add "host to ipfs" feature to the addon
171614           lidel │ when that is done, it should be easy to add option to automatically add every visited page
171634           lidel │ not sure how addon would do lookups tho
171734           lidel │ (meaning, how do i know the multihash of the page, how do we handle ipfs-cache expiration when page gets updated, etc)
171831           geir_ │ lidel: I see, thanks for the info. I still like the idea of a transparent proxy so every user/device on the network will use the "cdn" automatically
171852           lidel │ perhaps we could start with mirroring static assets that have SRI hash (https://www.srihash.org/)
171920           lidel │ and come up with a way for doing SRI2IPFS lookups

2015+

IPFS as a backend to a web archiving - IPFS as a backend to a web archiving ipfs-inactive/archives#28

2018-01-14

https://discuss.ipfs.io/t/web-browser-with-integrated-ipfs-node-support-for-browser-cache/1799/5

2018-03-08

[Suggestion] : IPFS browser extension as lite-node? https://github.com/ipfs/ipfs/issues/310

2018-07-09

https://discuss.ipfs.io/t/mirroring-standard-websites-to-ipfs-as-you-browse-them/3355

2018-07-23

http->ipfs translator proposal Automatic mirroring of HTTP websites to IPFS as you browse them ipfs-companion#535
webpackage standard draft
- https://github.com/WICG/webpackage/blob/master/explainer.md#save-and-share-a-web-page
- https://wicg.github.io/webpackage/draft-yasskin-webpackage-use-cases.html#snapshot

The text was updated successfully, but these errors were encountered:

ghost · 2019-02-21T10:54:28Z

It might be interesting to talk to https://archive.fo/ and https://archive.org who might have already written something very similar.

mitra42 · 2019-02-21T20:39:28Z

Sure, I'd be happy to talk. - dweb.archive.org doesn't do it for web pages (yet) but does mirror some of the content accessed through dweb-gateway to the IPFS http api. (Not all of it, because of the combination of IPFS losing data, and no error result/fallback when it cant find something).

Note that we also use urlstore as our primary mirroring mechanism, because we have the opposite concern to you, i.e. that we can't replicate 50 peta-bytes, so just push the reference so that the most used items will get mirrored by IPFS, and an upcoming version will also pull items via IPFS as alternative to a direct fetch from the archive.

I also wrote dweb.mirror which is a crawler, specialized to crawl archive.org items (not wayback machine yet) and that mirrors everything to IPFS.

jimpick · 2019-05-03T20:28:31Z

I'll be going to csv,conf next week. It will be another chance to talk more with @ikreymer, who is giving a talk on WARC files: https://csvconf.com/speakers/#ilya-kreymer

RubenKelevra · 2020-01-21T20:51:15Z

It might be interesting to talk to https://archive.fo/ and https://archive.org who might have already written something very similar.

How about asking archive.org if we could help them by cooperating, I'm sure they have issues with crawling capacity?

Archive.org could provide data in ipfs when a given URL has been captured. If this is some days ago, we could ask the user, if he likes to capture the URL (since he might be logged in or personal information is currently inserted in a form or similar). If he agrees we share the snapshot in IPFS (somehow - I have no idea how this would technically work to make it locatable by URL and timestamp). archive.org could pin it or download it, for displaying it on their website.

ikreymer · 2020-06-13T00:23:07Z

Hi, I've just recently launched https://replayweb.page/ (https://github.com/webrecorder/replayweb.page) which is a full browser-based web archive replay system ('wayback machine'), using service workers. The system can load web archives from a variety of locations, and could be expanded to support IPFS.

In fact, it can trivially work using an IPFS gateway already:
https://gateway.pinata.cloud/ipfs/QmS8pUs87xvn13yymY7JFLoKfUyL2sYvFL73Mz86XFi8XX/?source=https%3A%2F%2Fgateway.pinata.cloud%2Fipfs%2FQmS8pUs87xvn13yymY7JFLoKfUyL2sYvFL73Mz86XFi8XX%2Fexamples%2Fnetpreserve-twitter.warc#view=replay&url=https%3A%2F%2Ftwitter.com%2Fnetpreserve&ts=20190603053135

It should be possible to extend to support ipfs:// urls, or perhaps using the gateway could work as well (though cloudflare specifically does not allow service workers).

ReplayWeb.page is the latest tool from Webrecorder, here's also a blog post announcing it:
https://webrecorder.net/2020/06/11/webrecorder-conifer-and-replayweb-page.html#introducing-replaywebpage

lidel · 2021-02-26T14:24:18Z

Relevant demo/status update of @ikreymer's work: https://www.youtube.com/watch?v=evcSETnTBf0

RubenKelevra · 2021-03-12T11:40:40Z

This proposal touches this topic:

https://discuss.ipfs.io/t/ipfs-records-for-urn-uri-resolving-via-a-dht/10456/4

lidel mentioned this issue Jul 24, 2018

Mirroring Web to IPFS ipfs/ipfs-companion#96

Closed

3 tasks

lidel added web extension help wanted needs clarification ux status/ready Ready to be worked labels Jul 24, 2018

chpio mentioned this issue Jul 24, 2018

Save entire Web page to IPFS ipfs/ipfs-companion#91

Open

lidel mentioned this issue Aug 6, 2018

Collecting P2P/DWeb Use Cases arewedistributedyet/arewedistributedyet#22

Open

lidel mentioned this issue Jan 31, 2019

Intercept URLs for JS libs at public CDNs ipfs/ipfs-companion#674

Open

autonome mentioned this issue Apr 1, 2019

How can we support the package managers effort from web browsers? #145

Open

lidel mentioned this issue May 3, 2019

Signed/Bundled HTTP Exchanges and WebPackage #121

Open

galargh added this to IPFS-GUI (PL EngRes) Sep 22, 2022

galargh moved this to To do in IPFS-GUI (PL EngRes) Sep 22, 2022

SgtPooki removed this from IPFS-GUI (PL EngRes) Jan 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mirroring Web to IPFS #94

Mirroring Web to IPFS #94

lidel commented Jul 24, 2018 •

edited

Loading

ghost commented Feb 21, 2019

mitra42 commented Feb 21, 2019

jimpick commented May 3, 2019

RubenKelevra commented Jan 21, 2020

ikreymer commented Jun 13, 2020

lidel commented Feb 26, 2021

RubenKelevra commented Mar 12, 2021

Mirroring Web to IPFS #94

Mirroring Web to IPFS #94

Comments

lidel commented Jul 24, 2018 • edited Loading

Feasible

More Design Work Required

Saving reproducible snapshot of entire page load

Automatic mirroring of standard websites to IPFS as you browse them (ipfs/ipfs-companion#535)

Related Discussions

ghost commented Feb 21, 2019

mitra42 commented Feb 21, 2019

jimpick commented May 3, 2019

RubenKelevra commented Jan 21, 2020

ikreymer commented Jun 13, 2020

lidel commented Feb 26, 2021

RubenKelevra commented Mar 12, 2021

lidel commented Jul 24, 2018 •

edited

Loading