Evaluate new URLs for pages that redirect #163

Mr0grog · 2021-06-29T06:19:12Z

Late last year, I added the ability for pages to have multiple URLs associated with them (see edgi-govdata-archiving/web-monitoring-db#492, edgi-govdata-archiving/web-monitoring-db#793).

However, we don’t automatically create new PageUrl records or change a page’s canonical URL when we import versions that have redirects. There are some good reasons for this:

It’s really tough to determine programmatically what the “canonical” URL for something should be.
- What we want is usually whatever the “vanity” URL for a page is, so if its meaning changes, we start seeing the new content. That’s not always consistent with what a site marks up as the canonical URL, if it includes that in its markup at all.
- Because we want the “vanity” URL, that also means we can’t tell the difference between a vanity URL and a page that’s moved when redirects are involved. Technically, 302, 303, and 307 status codes might be a useful indicator here, but many sites don’t carefully distinguish between those redirects and 301/308 (permanent) redirects, so this isn’t a very reliable signal in practice.
It’s similarly difficult to determine if a redirect is taking you to the same (or at least equivalent) page at a new location, or to something totally.
- Sometimes servers respond with a redirect to 404 url instead of responding directly with a 404 status code. EPA’s “signpost” page that replaced its climate change site for years was a great example of this. In these cases, the 404 URL should not be a URL that belongs to the page, since it’s shared with a zillion other pages.
- Sometimes servers redirect to an entirely different site, or to a parent page instead of responding with a 404 error when a page has been removed. For example, if you visit https://www.census.gov/econ/services.html right now, you get redirected to https://www.census.gov/topics/business-economy.html, which is not an equivalent page. The same goes for most other https://www.census.gov/econ/* URLs.

However, if we look at pages in aggregate, we can probably do a little better.

If multiple pages redirect to the same URL, we know that URL is probably only equivalent to at most one of them. But if only one page is redirecting to a given URL, it might be reasonable to automatically add the new URL.
We might be able to identify patterns in sites that are using temporary vs. permanent redirects well or develop some better heuristics we can use here.
Ditto for identifying patterns in how <link rel="canonical"> is used, and where it might make sense to follow it.
We might find it’s useful to output a weekly or monthly report on redirects to analysts to evaluate and help make decisions about how we should adjust URLs.

The first step here is to develop some code to create a report about redirects, and see what might be useful to do from there.

The text was updated successfully, but these errors were encountered:

stale · 2022-01-09T01:12:12Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

stale bot added the stale label Jan 9, 2022

stale bot closed this as completed Apr 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate new URLs for pages that redirect #163

Evaluate new URLs for pages that redirect #163

Mr0grog commented Jun 29, 2021

stale bot commented Jan 9, 2022

Evaluate new URLs for pages that redirect #163

Evaluate new URLs for pages that redirect #163

Comments

Mr0grog commented Jun 29, 2021

stale bot commented Jan 9, 2022