You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, we don’t automatically create new PageUrl records or change a page’s canonical URL when we import versions that have redirects. There are some good reasons for this:
It’s really tough to determine programmatically what the “canonical” URL for something should be.
What we want is usually whatever the “vanity” URL for a page is, so if its meaning changes, we start seeing the new content. That’s not always consistent with what a site marks up as the canonical URL, if it includes that in its markup at all.
Because we want the “vanity” URL, that also means we can’t tell the difference between a vanity URL and a page that’s moved when redirects are involved. Technically, 302, 303, and 307 status codes might be a useful indicator here, but many sites don’t carefully distinguish between those redirects and 301/308 (permanent) redirects, so this isn’t a very reliable signal in practice.
It’s similarly difficult to determine if a redirect is taking you to the same (or at least equivalent) page at a new location, or to something totally.
Sometimes servers respond with a redirect to 404 url instead of responding directly with a 404 status code. EPA’s “signpost” page that replaced its climate change site for years was a great example of this. In these cases, the 404 URL should not be a URL that belongs to the page, since it’s shared with a zillion other pages.
Sometimes servers redirect to an entirely different site, or to a parent page instead of responding with a 404 error when a page has been removed. For example, if you visit https://www.census.gov/econ/services.html right now, you get redirected to https://www.census.gov/topics/business-economy.html, which is not an equivalent page. The same goes for most other https://www.census.gov/econ/* URLs.
However, if we look at pages in aggregate, we can probably do a little better.
If multiple pages redirect to the same URL, we know that URL is probably only equivalent to at most one of them. But if only one page is redirecting to a given URL, it might be reasonable to automatically add the new URL.
We might be able to identify patterns in sites that are using temporary vs. permanent redirects well or develop some better heuristics we can use here.
Ditto for identifying patterns in how <link rel="canonical"> is used, and where it might make sense to follow it.
We might find it’s useful to output a weekly or monthly report on redirects to analysts to evaluate and help make decisions about how we should adjust URLs.
The first step here is to develop some code to create a report about redirects, and see what might be useful to do from there.
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.
Late last year, I added the ability for pages to have multiple URLs associated with them (see edgi-govdata-archiving/web-monitoring-db#492, edgi-govdata-archiving/web-monitoring-db#793).
However, we don’t automatically create new PageUrl records or change a page’s canonical URL when we import versions that have redirects. There are some good reasons for this:
It’s really tough to determine programmatically what the “canonical” URL for something should be.
It’s similarly difficult to determine if a redirect is taking you to the same (or at least equivalent) page at a new location, or to something totally.
https://www.census.gov/econ/*
URLs.However, if we look at pages in aggregate, we can probably do a little better.
<link rel="canonical">
is used, and where it might make sense to follow it.The first step here is to develop some code to create a report about redirects, and see what might be useful to do from there.
The text was updated successfully, but these errors were encountered: