Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway renders pretty 404 pages if available #4233

Merged
merged 1 commit into from
May 19, 2020

Conversation

jphastings
Copy link
Contributor

@jphastings jphastings commented Sep 16, 2017

I noticed ipfs/ipfs#167 and thought this might be an elegant solution for allowing the gateway to show helpful information when requested files aren't found.

When a requested file isn't found ipfs-404.html is looked for in the same directory, rolling up through any of its parents, and displayed (without immutable cache headers) if present.

Run this patch locally and query the following for a demonstration:

@whyrusleeping
Copy link
Member

Cool feature. cc @lgierth @diasdavid

@whyrusleeping
Copy link
Member

As for tests, you probably want to add something in core/corehttp/gateway_test.go, some of the tests in there construct directories, in particular TestIPNSHostnameRedirect creates a directory with an index.html to test that working properly. I imagine a test for this 404 feature would be quite similar.

@jphastings
Copy link
Contributor Author

Thanks for the advice @whyrusleeping, hopefully this covers the new cases. Shout if there's anything else I can do to help!

@whyrusleeping
Copy link
Member

Thinking about this, what is the need for not setting a cache? Everything is content addressed, right? There should be no need not to cache.

@dignifiedquire @victorbjelkholm I think you guys might be interested in this PR

@Kubuxu
Copy link
Member

Kubuxu commented Sep 18, 2017

Thinking about this, what is the need for not setting a cache? Everything is content addressed, right? There should be no need not to cache.

Choosing ETag for the cache is not simple in this case as the 404.html is taken for reverse recursive walk but I think using CID of the resolved 404.html should work but I am not 100% sure.

We would also have to resolve the 404.html when client caches it to confirm the ETag.

@jphastings
Copy link
Contributor Author

@whyrusleeping I'm thinking of the case where I request a non-existant file from the gateway today; I don't want my browser's cache to be prevented from trying again, say, tomorrow when that DNSlink entry/peerID points to a new hash where that file now exists.

Right now this PR ensures the gateway sets the ETag of a requested but missing file to be the Cid of the pretty 404 page that was found (as @Kubuxu suggests). Smart browsers can then include it in the If-None-Match header of their next request to the gateway for this file which will avoid sending unnecessary data. I reckon this is all we need to do here to allow smart clients to request transfer of the bare minimum to show the most recent data.

Does this address your concern?

@daviddias
Copy link
Member

Thinking about this, what is the need for not setting a cache? Everything is content addressed, right? There should be no need not to cache.

What if the gateway fails to find the file, timeouts and returns a 404 with cache-control: unlimited? An existing hash->file can be cached forever and the corollary is that a non-existing hash->file will someday become an existing hash->file.

@victorb
Copy link
Member

victorb commented Sep 19, 2017

If you load ipfs.io/ipfs/$HASH/$FILE and $FILE isn't found, a redirect to a 404 page under $HASH would always be a 404.

As both Etag and Cache-Control uses the URL, it has to be a redirect and if it redirects to $HASH/404.html, that redirect would always happen in the future.

Introducing IPNS would make things work differently though, and probably in that case it's best to dump the Cache-Control header. However, if IPFS isn't being used with IPNS, the 404 should be permanent.

What if the gateway fails to find the file, timeouts and returns a 404 with cache-control: unlimited?

In the case of ipfs.io/ipfs/$HASH/$FILE, don't we already see all the links of $HASH and if we cannot see the match for $FILE, we can safely show the 404 page?

@ivan386
Copy link
Contributor

ivan386 commented Sep 19, 2017

As both Etag and Cache-Control uses the URL, it has to be a redirect and if it redirects to $HASH/404.html, that redirect would always happen in the future.

No redirect for 404 page please. I need to know original url where 404 happen. Then i can redirect to right location by Javascript.

I think about .htaccess for ipfs.

@jphastings
Copy link
Contributor Author

jphastings commented Sep 20, 2017

There's no redirecting happening at the moment. I believe @victorbjelkholm was talking about internally, ie. the data is being delivered when a missing file is requested.

In the case of ipfs.io/ipfs/$HASH/$FILE, don't we already see all the links of $HASH and if we cannot see the match for $FILE, we can safely show the 404 page?

I agree — in this case we know that $FILE definitely doesn't exist under the $HASH and we can safely deliver the data from an appropriate 404.html page from that URL provided we also set the cache headers to be non-permanent and the ETag to be of the data delivered (ie. the 404.html page) so that if $FILE becomes present (in the case where we're addressing via an IPNS link) the new data can be retrieved.

Further to this, in the case of ipfs.io/ipfs/$HASH/$FILE (ie. not IPNS), we could also deliver the data with an immutable cache header, as the contents of the $HASH will never change. If we like the idea of this I can change this conditional so that it allows setting the immutable header for /ipfs paths with 404s. In fact, we could set the immutable cache header for any /ipfs path, including for directories (which is not currently the case on HEAD). The only argument against this is that we are also declaring this business logic to be immutable (if we ever decided to move from 404.html to 404_error.html as the page too look for, what a customer sees would be dependent on which version of IPFS they first viewed that URL with).

Because of this I'm up for leaving the 404.html page without the immutable cache header, but leave the ETag in place, as it stands. It's extremely unlikely we'd ever need to change the 404.html to something else, but better to not force immutability on the business logic at this stage I feel!

@mitra42
Copy link

mitra42 commented Sep 20, 2017

I think doing this at the Gateway is the wrong place, because if I've understood this correctly, its indicating to the caller (which might be an application, not a browser) that its retrieved the page. Shouldnt this be part of the logic in whatever is calling the Gateway (browser) to decide what to do when the page isn't found.

For example ... I can see situations where if the file can't be delivered by the Gateway, the app decides to go to somewhere else (such as an Archive) to try a different way to retrieve the same content.

@Stebalien
Copy link
Member

@mitra42 The gateway will still return a 404 status so the application can decide what to do with it.

However, I believe this should be checking the client's Accept header. If the client doesn't want an HTML page, we shouldn't be returning an HTML 404 page.

@jphastings
Copy link
Contributor Author

jphastings commented Sep 20, 2017

That's an excellent idea @Stebalien; only if the request header matches text/html should this functionality be present - I'll add that to this PR now!

@mitra42 while I agree with your point in pure terms, my understanding of the Gateway's purpose is that it should bring some of the benefits of IPFS to browsers which don't have native IPFS support, which leads me to two counter arguments:

  • If a browser had native support for IPFS then this could be implemented in that plugin, but there isn't one of those (yet!) and the gateway is the bit of software that is the first step in that direction.
  • If browsers were mostly used for displaying non-dynamically generated content then this is a proposal which could be put into effect there (so that it would benefit non-IPFS sourced sites too) but, because HTTP responses are most frequently generated dynamically, this is more easily implemented "on the [centralised] server" in the current world of HTTP (a concept which doesn't exist in IPFS land!) In fact the closest analogue to "on the server" in the world of an IPFS-HTTP bridge is the gateway.

I hope that makes sense!

@jphastings jphastings force-pushed the feat/gateway/pretty-404 branch 3 times, most recently from d7846d7 to 81bfa5e Compare September 20, 2017 23:54
@mitra42
Copy link

mitra42 commented Sep 21, 2017

Agreed,

@jphastings
Copy link
Contributor Author

jphastings commented Sep 21, 2017

It might be worth noting that if a person wishing to use this feature wants their pretty 404 page's assets to render correctly without specifying absolute paths (eg. if they want the 404s to work with accessing the file from an IPFS or IPNS URL like gateway.server/ipfs/HASH/missing_resource as well as from a DNS link like example.com/missing_resource) then some logic would need to be applied to give the browser enough information to know where to get assets from (as neither absolute nor relative paths will work on their own). I've found this snippet works well, though it will need to be tailored to how assets are set up:

<head>
  <!-- things that don't refernce assets -->
  <script>
    const ipfsPath = /^\/ip[fn]s\/(.+?)\//.exec(window.location.pathname);
    const basePath = (ipfsPath == null) ? '/' : ipfsPath[0];
    document.write('<base href="' + basePath + '"/>');
  </script>
  <!-- things that reference assets, eg. CSS and JS -->
</head>

@jphastings
Copy link
Contributor Author

@whyrusleeping is there anything further you'd like me to alter on this PR?

@whyrusleeping
Copy link
Member

This LGTM on a technical level. Now we need to get a signoff from the others on whether or not this is something we really want.

cc @diasdavid @dignifiedquire

Copy link
Member

@Kubuxu Kubuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from those 2 style quirks, SGTM.

core/corehttp/gateway_handler.go Outdated Show resolved Hide resolved
core/corehttp/gateway_handler.go Outdated Show resolved Hide resolved
@jphastings jphastings requested a review from Stebalien May 8, 2020 23:32
@jphastings
Copy link
Contributor Author

jphastings commented May 8, 2020

This should cover the changes you suggested @Stebalien — feedback is welcome! ☺️

@@ -290,6 +295,10 @@ func (i *gatewayHandler) getOrHeadHandler(w http.ResponseWriter, r *http.Request
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, if a 404 page is present, we should return 404 here. But I'm fine leaving that edge-case unfixed.

core/corehttp/gateway_handler.go Outdated Show resolved Hide resolved
In the same way that an `index.html` file is rendered, if one is present, when the
requested path is a directory, now an `ipfs-404.html` file is rendered if
the requested file is not present within the specified IPFS object.

`ipfs-404.html` files are looked for in the directory of the requested path and each
parent until one is found, falling back on the well-known 404 error message.

License: MIT
Signed-off-by: JP Hastings-Spital <jphastings@gmail.com>
@Stebalien Stebalien merged commit 1744eb2 into ipfs:master May 19, 2020
@Stebalien
Copy link
Member

🥳

Thanks for your patience on this, this has been a long time coming.

@Lekensteyn
Copy link
Contributor

If I am not mistaken, this change implies a lookup for ipfs-404.html on every attempt to obtain a directory listing, and subsequently breaks directory listings if the file is found. Is that intentional?

In core/corehttp/gateway_test.go, I think that the following tests are wrong:

{"/deeper/", "text/html", http.StatusNotFound, "Deep custom 404"},
{"/deeper", "text/html", http.StatusNotFound, "Deep custom 404"},

These and / should probably end up with a directory listing.

@jphastings
Copy link
Contributor Author

That was intentional by me — the approach that web servers like NGinx take is index.html goes first, a 404 page goes second, then a directory listing if there's no 404 defined. I worked to echo that here.

@Lekensteyn
Copy link
Contributor

That sounds rather unusual. By default directory indexes are disabled unless autoindex is changed. The error_page directive only affects the page shown on errors, it does not affect the routing logic (whether to display an index or not).

Did you have a custom config such as try_files $uri $uri/index.html =404; in mind?

With the current approach, there is no way to obtain a directory index if ipfs-404.html exists.

@jphastings
Copy link
Contributor Author

That's a good point; I guess it comes down to whether IPFS wants directory listings to always be possible to retrieve on the HTTP gateway.

Given this is my first PR to IPFS I don't feel like I'm qualified to make such a judgement call (beyond what I'd personally like) — so maybe we can lean on @Stebalien?

@Lekensteyn
Copy link
Contributor

If directory listings are not desirable, one could add an explicit index.html file to hide the contents. There is probably no point in using this custom 404 page feature for hiding files since people can just query the directory directly, that is where IPFS differs from nginx.

I am also new here and do not know if special files such as index.html and ipfs-404.html are documented anywhere, but that would be nice to have :-)

@Stebalien
Copy link
Member

Hm. Yeah, I didn't fully think through the ramifications of this. I actually have at least one case where I want to support custom 404 pages and directory listings (dist.ipfs.io).

Stebalien added a commit that referenced this pull request May 21, 2020
fixes #4233 (comment)

Basically, there's a trade-off here:

1. We can support directory listings while supporting 404 pages (this PR).
2. If a 404 page is present, directory listings don't work.

Given that option 1 is more flexible and users shouldn't be _too_ confused if
they land on a directory with no index.html page, I've gone with that option.
@DavidBurela
Copy link

DavidBurela commented Jun 2, 2020

I love that there is thinking around supporting 404 pages.
I've published my blog to IPFS/ENS and lacking 404 was something I wanted
e.g. http://blog.davidburela.eth/broken

But because it is a blog with pages in this folder format

/
  /2019
    /01
    /02
    /03
    ...
  /2020
    /01
    /02

it means I'd need to put a ipfs-404.html in every single subfolder. Maybe specifying it once at root of a merkle tree is enough?
<cid>/ipfs-404.html

@Stebalien
Copy link
Member

Specifying it once at the root should be enough. The gateway will search all parent directories up the path for an ipfs-404.html file.

ralendor pushed a commit to ralendor/go-ipfs that referenced this pull request Jun 6, 2020
fixes ipfs#4233 (comment)

Basically, there's a trade-off here:

1. We can support directory listings while supporting 404 pages (this PR).
2. If a 404 page is present, directory listings don't work.

Given that option 1 is more flexible and users shouldn't be _too_ confused if
they land on a directory with no index.html page, I've gone with that option.
ralendor pushed a commit to ralendor/go-ipfs that referenced this pull request Jun 6, 2020
fixes ipfs#4233 (comment)

Basically, there's a trade-off here:

1. We can support directory listings while supporting 404 pages (this PR).
2. If a 404 page is present, directory listings don't work.

Given that option 1 is more flexible and users shouldn't be _too_ confused if
they land on a directory with no index.html page, I've gone with that option.
ralendor pushed a commit to ralendor/go-ipfs that referenced this pull request Jun 8, 2020
fixes ipfs#4233 (comment)

Basically, there's a trade-off here:

1. We can support directory listings while supporting 404 pages (this PR).
2. If a 404 page is present, directory listings don't work.

Given that option 1 is more flexible and users shouldn't be _too_ confused if
they land on a directory with no index.html page, I've gone with that option.
ralendor pushed a commit to ralendor/go-ipfs that referenced this pull request Jun 8, 2020
fixes ipfs#4233 (comment)

Basically, there's a trade-off here:

1. We can support directory listings while supporting 404 pages (this PR).
2. If a 404 page is present, directory listings don't work.

Given that option 1 is more flexible and users shouldn't be _too_ confused if
they land on a directory with no index.html page, I've gone with that option.
ralendor pushed a commit to ralendor/go-ipfs that referenced this pull request Jun 8, 2020
fixes ipfs#4233 (comment)

Basically, there's a trade-off here:

1. We can support directory listings while supporting 404 pages (this PR).
2. If a 404 page is present, directory listings don't work.

Given that option 1 is more flexible and users shouldn't be _too_ confused if
they land on a directory with no index.html page, I've gone with that option.
ralendor pushed a commit to ralendor/go-ipfs that referenced this pull request Jun 8, 2020
fixes ipfs#4233 (comment)

Basically, there's a trade-off here:

1. We can support directory listings while supporting 404 pages (this PR).
2. If a 404 page is present, directory listings don't work.

Given that option 1 is more flexible and users shouldn't be _too_ confused if
they land on a directory with no index.html page, I've gone with that option.
@jphastings jphastings deleted the feat/gateway/pretty-404 branch May 2, 2022 16:14
hacdias pushed a commit to ipfs/boxo that referenced this pull request Jan 27, 2023
…-404

Gateway renders pretty 404 pages if available

This commit was moved from ipfs/kubo@1744eb2
hacdias pushed a commit to ipfs/boxo that referenced this pull request Jan 27, 2023
fixes ipfs/kubo#4233 (comment)

Basically, there's a trade-off here:

1. We can support directory listings while supporting 404 pages (this PR).
2. If a 404 page is present, directory listings don't work.

Given that option 1 is more flexible and users shouldn't be _too_ confused if
they land on a directory with no index.html page, I've gone with that option.


This commit was moved from ipfs/kubo@6a2fe0a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/gateway Topic gateway
Projects
None yet
Development

Successfully merging this pull request may close these issues.