-
Notifications
You must be signed in to change notification settings - Fork 128
fix: add robots.txt to exclude gateway paths #330
Conversation
Context: #328 License: MIT Signed-off-by: Marcin Rataj <lidel@lidel.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, I only see positives excluding these paths from the machines 🤖
cc @olizilla for visibility, as it impacts gateway |
Related: you may find it interesting to review what traffic this will exclude from search indexes, here is an export of the top inbound links/queries over the past 16 months. A total of 31.3m clicks and 2.07B impressions 😮 |
@cwaring Was there a significant drop in traffic coming from Google search results in past few weeks or months? I think they tweaked their crawler and wikipedia mirror from ipfs.io does now show in search results anymore for me. Still, merging robots.txt remains a good idea, as it would remove bogus load from the gateway when a crawler traverses entire wikipedia etc. |
@lidel this is traffic over the last 3 months so nothing substantial. Since activating GSC I'm also seeing a few takedown notices for copyrighted content under |
I'm not clear on why we want to prevent content on IPFS getting indexed? |
@olizilla afaik everything starting with |
I think as a general position we do want ipfs content to be indexed on search engines. The bots wont be guessing CIDs, so it's content that has been linked to from somewhere and, yes, in the case of wikipedia, it will eventually follow all the links, but in general if some site links to some content on ipfs, it should be indexed like any other. |
What if only exclude the main offender? I believe the main problem is the CID root of wikipedia that was published without proper meta tag: https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/* Context on meta tag: ipfs/distributed-wikipedia-mirror#48 |
What is the problem with search engines indexing everything on all gateways? I thought we would want content to be indexed and findable on gateways as an intermediary step towards native protocol loading. |
This has highlighted a few issues for me, some thoughts:
Keen to hear your ideas! |
I agree with raised concerns and think we need more data to make this decision. Ad 1. Gateways are the best we can do in browsers right now. UX around Ad 2. Are we tracking If it is not meaningful, we probably should open a PR to only exclude immutable wikipedia snapshot Ad 3. We've been talking about move to At this point I worry ts may be too late to improve SEO of |
Based on some discussion in gateway team, this seems to be a way forward that has been brought up, as keeping
I would suggest doing this. Also since everytime you "something ipfs" you get wikipedia results for that something, hosted by ipfs, rather than what you are looking for.
I don't think crawlers are behind much of the pains but we could check: @lanzafame am I right that crawlers weren't anyone close to the top offenders? |
Continued in #334 |
This PR adds
/robots.txt
and closes #328cc @andrew