Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate new URLs for pages that redirect #163

Closed
Mr0grog opened this issue Jun 29, 2021 · 1 comment
Closed

Evaluate new URLs for pages that redirect #163

Mr0grog opened this issue Jun 29, 2021 · 1 comment
Labels

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Jun 29, 2021

Late last year, I added the ability for pages to have multiple URLs associated with them (see edgi-govdata-archiving/web-monitoring-db#492, edgi-govdata-archiving/web-monitoring-db#793).

However, we don’t automatically create new PageUrl records or change a page’s canonical URL when we import versions that have redirects. There are some good reasons for this:

  • It’s really tough to determine programmatically what the “canonical” URL for something should be.

    • What we want is usually whatever the “vanity” URL for a page is, so if its meaning changes, we start seeing the new content. That’s not always consistent with what a site marks up as the canonical URL, if it includes that in its markup at all.
    • Because we want the “vanity” URL, that also means we can’t tell the difference between a vanity URL and a page that’s moved when redirects are involved. Technically, 302, 303, and 307 status codes might be a useful indicator here, but many sites don’t carefully distinguish between those redirects and 301/308 (permanent) redirects, so this isn’t a very reliable signal in practice.
  • It’s similarly difficult to determine if a redirect is taking you to the same (or at least equivalent) page at a new location, or to something totally.

    • Sometimes servers respond with a redirect to 404 url instead of responding directly with a 404 status code. EPA’s “signpost” page that replaced its climate change site for years was a great example of this. In these cases, the 404 URL should not be a URL that belongs to the page, since it’s shared with a zillion other pages.
    • Sometimes servers redirect to an entirely different site, or to a parent page instead of responding with a 404 error when a page has been removed. For example, if you visit https://www.census.gov/econ/services.html right now, you get redirected to https://www.census.gov/topics/business-economy.html, which is not an equivalent page. The same goes for most other https://www.census.gov/econ/* URLs.

However, if we look at pages in aggregate, we can probably do a little better.

  • If multiple pages redirect to the same URL, we know that URL is probably only equivalent to at most one of them. But if only one page is redirecting to a given URL, it might be reasonable to automatically add the new URL.
  • We might be able to identify patterns in sites that are using temporary vs. permanent redirects well or develop some better heuristics we can use here.
  • Ditto for identifying patterns in how <link rel="canonical"> is used, and where it might make sense to follow it.
  • We might find it’s useful to output a weekly or monthly report on redirects to analysts to evaluate and help make decisions about how we should adjust URLs.

The first step here is to develop some code to create a report about redirects, and see what might be useful to do from there.

@stale
Copy link

stale bot commented Jan 9, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

@stale stale bot added the stale label Jan 9, 2022
@stale stale bot closed this as completed Apr 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant