You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In edgi-govdata-archiving/web-monitoring-db#492, we’re planning to update the -db project so our page records can finally handle having multiple URLs. On a first pass, that just means making it possible to model. However, we also have a lot of pages that actively tell us they are the same page using the Link header or the <link> element in the HTML head.
For example, we have 63 page records all representing the BLM’s Colorado oil and gas leases page. Here are two of them:
Some pages will probably only use one or the other, but it would be really good to key off of these in order to determine what page record a new version should belong to.
I’m adding this to the overview project because I’m not really sure whether the right home for this logic is in the Wayback import script in -processing or in the -db project as part of the import job. Also not sure whether it’s important to follow the link to the canonical URL and make sure it has the same response body.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.
In edgi-govdata-archiving/web-monitoring-db#492, we’re planning to update the -db project so our page records can finally handle having multiple URLs. On a first pass, that just means making it possible to model. However, we also have a lot of pages that actively tell us they are the same page using the
Link
header or the<link>
element in the HTMLhead
.For example, we have 63 page records all representing the BLM’s Colorado oil and gas leases page. Here are two of them:
(In this case, the querystring just affects what sections of the page are collapsed or expanded with JavaScript; the HTML response is the same.)
These pages use the
Link
header to identify that they both belong to the same canonical URL:They also use the HTML
<link>
element:Some pages will probably only use one or the other, but it would be really good to key off of these in order to determine what page record a new version should belong to.
I’m adding this to the overview project because I’m not really sure whether the right home for this logic is in the Wayback import script in
-processing
or in the-db
project as part of the import job. Also not sure whether it’s important to follow the link to the canonical URL and make sure it has the same response body.Any thoughts, @danielballan?
The text was updated successfully, but these errors were encountered: