fix: add robots.txt to exclude gateway paths #330

lidel · 2019-08-28T10:46:10Z

This PR adds /robots.txt and closes #328

cc @andrew

Context: #328 License: MIT Signed-off-by: Marcin Rataj <lidel@lidel.org>

andrew

👍

cwaring

Good call, I only see positives excluding these paths from the machines 🤖

lidel · 2019-08-28T11:32:18Z

cc @olizilla for visibility, as it impacts gateway

cwaring · 2019-08-28T14:07:37Z

Related: you may find it interesting to review what traffic this will exclude from search indexes, here is an export of the top inbound links/queries over the past 16 months. A total of 31.3m clicks and 2.07B impressions 😮

lidel · 2019-08-28T14:15:56Z

@cwaring Was there a significant drop in traffic coming from Google search results in past few weeks or months?

I think they tweaked their crawler and wikipedia mirror from ipfs.io does now show in search results anymore for me. Still, merging robots.txt remains a good idea, as it would remove bogus load from the gateway when a crawler traverses entire wikipedia etc.

cwaring · 2019-08-28T14:26:42Z

@lidel this is traffic over the last 3 months so nothing substantial. Since activating GSC I'm also seeing a few takedown notices for copyrighted content under /ipfs/, which could impact this root ipfs.io domain authority. Possibly something else to consider.

cwaring · 2019-08-28T14:29:28Z

ref:

olizilla · 2019-08-28T14:53:23Z

I'm not clear on why we want to prevent content on IPFS getting indexed?

lidel · 2019-08-28T15:14:45Z

@olizilla afaik everything starting with ipfs.io/ipfs/ already got removed from Google results as it was polluting search results with wikipedia mirror etc
if we are no longer listed in search results, automated crawler introduces unnecessary load and pollutes the cache without obvious benefits (or am I missing some?)

olizilla · 2019-08-28T15:37:50Z

I think as a general position we do want ipfs content to be indexed on search engines. The bots wont be guessing CIDs, so it's content that has been linked to from somewhere and, yes, in the case of wikipedia, it will eventually follow all the links, but in general if some site links to some content on ipfs, it should be indexed like any other.

lidel · 2019-08-28T17:16:11Z

What if only exclude the main offender?

I believe the main problem is the CID root of wikipedia that was published without proper meta tag: https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/*

Context on meta tag: ipfs/distributed-wikipedia-mirror#48

autonome · 2019-08-28T23:51:41Z

What is the problem with search engines indexing everything on all gateways? I thought we would want content to be indexed and findable on gateways as an intermediary step towards native protocol loading.

cwaring · 2019-08-29T11:03:22Z

This has highlighted a few issues for me, some thoughts:

Indexing sets a precedent that this is the permanent access point for this resource, I feel that is the wrong message (and makes IPFS feel broken when it cannot resolve a result).
Crawling seems to be causing substantial load on our infrastructure degrading the experience for everyone (at this point in time).
Using a subpath as our primary gateway instead of a different domain or subdomain.ipfs.io could potentially devalue the ipfs.io domain authority due to multiple flags for copyright violations. Disabling indexing would help however it might be a better idea to migrate to a new domain (I don't have all the information behind the current configuration to understand the implementation decisions here).

Keen to hear your ideas!

lidel · 2019-08-29T14:16:12Z

I agree with raised concerns and think we need more data to make this decision.
Closing this PR to ensure we don't merge by accident.

Ad 1. Gateways are the best we can do in browsers right now. UX around ipfs.io is pretty good and enables migration path in the future: if browser lands native support it will be ignoring gateway host and use content paths, reviving dead links (as long there are peers hosting the data).

Ad 2. Are we tracking user-agent header on our gateways?
It should tell what is the impact of automated crawlers (here is list of google ones), and maybe see what is the overlap between URLs accessed by crawlers and humans.

If it is not meaningful, we probably should open a PR to only exclude immutable wikipedia snapshot /ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco
I opened PR to fix "wikipedia seo issue" in upstream tooling (openzim/mwoffliner#963), but old snapshots would remain broken.

Ad 3. We've been talking about move to dweb.link, but it is subdomain gateway, meaning https://{cid}.dweb.link instead of https://ipfs.io/ipfs/{cid}, and makes it non-trivial for /ipns/ paths (IPNS and DNSLink, ipfs/kubo#5287).

At this point I worry ts may be too late to improve SEO of ipfs.io, /ipfs/* results are already gone from Google. Unless we reach out to search engines and make our case. Perhaps we should keep ipfs.io for path-based gateway (https://ipfs.io/ipfs/{cid}), use dweb.link for subdomain-based gateway and move project website from ipfs.io to something else? (the history behind .io is pretty grim)

hsanjuan · 2019-08-29T17:16:43Z

use dweb.link for subdomain-based gateway and move project website from ipfs.io to something else? (the history behind .io is pretty grim)

Based on some discussion in gateway team, this seems to be a way forward that has been brought up, as keeping ipfs.io as gateway is important to avoid fragmentation.

What if only exclude the main offender?

I believe the main problem is the CID root of wikipedia that was published without proper meta tag: https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/*

Context on meta tag: ipfs/distributed-wikipedia-mirror#48

I would suggest doing this. Also since everytime you "something ipfs" you get wikipedia results for that something, hosted by ipfs, rather than what you are looking for.

Crawling seems to be causing substantial load on our infrastructure degrading the experience for everyone (at this point in time).

I don't think crawlers are behind much of the pains but we could check: @lanzafame am I right that crawlers weren't anyone close to the top offenders?

lidel · 2019-09-02T19:23:53Z

Continued in #334

fix: add robots.txt to exclude gateway paths

bc6103e

Context: #328 License: MIT Signed-off-by: Marcin Rataj <lidel@lidel.org>

lidel requested review from cwaring and jessicaschilling August 28, 2019 10:46

andrew approved these changes Aug 28, 2019

View reviewed changes

cwaring approved these changes Aug 28, 2019

View reviewed changes

lidel requested a review from olizilla August 28, 2019 11:31

jessicaschilling approved these changes Aug 28, 2019

View reviewed changes

jessicaschilling assigned cwaring Aug 28, 2019

jessicaschilling added dif/medium Prior experience is likely helpful topic/docs Documentation labels Aug 28, 2019

lidel closed this Aug 29, 2019

lidel mentioned this pull request Sep 2, 2019

fix: exclude wikipedia published without canonical urls #334

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add robots.txt to exclude gateway paths #330

fix: add robots.txt to exclude gateway paths #330

lidel commented Aug 28, 2019

andrew left a comment

cwaring left a comment

lidel commented Aug 28, 2019

cwaring commented Aug 28, 2019

lidel commented Aug 28, 2019

cwaring commented Aug 28, 2019

cwaring commented Aug 28, 2019

olizilla commented Aug 28, 2019

lidel commented Aug 28, 2019 •

edited

Loading

olizilla commented Aug 28, 2019

lidel commented Aug 28, 2019

autonome commented Aug 28, 2019

cwaring commented Aug 29, 2019

lidel commented Aug 29, 2019 •

edited

Loading

hsanjuan commented Aug 29, 2019 •

edited by lidel

Loading

lidel commented Sep 2, 2019

fix: add robots.txt to exclude gateway paths #330

fix: add robots.txt to exclude gateway paths #330

Conversation

lidel commented Aug 28, 2019

andrew left a comment

Choose a reason for hiding this comment

cwaring left a comment

Choose a reason for hiding this comment

lidel commented Aug 28, 2019

cwaring commented Aug 28, 2019

lidel commented Aug 28, 2019

cwaring commented Aug 28, 2019

cwaring commented Aug 28, 2019

olizilla commented Aug 28, 2019

lidel commented Aug 28, 2019 • edited Loading

olizilla commented Aug 28, 2019

lidel commented Aug 28, 2019

autonome commented Aug 28, 2019

cwaring commented Aug 29, 2019

lidel commented Aug 29, 2019 • edited Loading

hsanjuan commented Aug 29, 2019 • edited by lidel Loading

lidel commented Sep 2, 2019

lidel commented Aug 28, 2019 •

edited

Loading

lidel commented Aug 29, 2019 •

edited

Loading

hsanjuan commented Aug 29, 2019 •

edited by lidel

Loading