Handle/utilize `canonical` link types for pages #139

Mr0grog · 2019-09-10T01:23:52Z

In edgi-govdata-archiving/web-monitoring-db#492, we’re planning to update the -db project so our page records can finally handle having multiple URLs. On a first pass, that just means making it possible to model. However, we also have a lot of pages that actively tell us they are the same page using the Link header or the <link> element in the HTML head.

For example, we have 63 page records all representing the BLM’s Colorado oil and gas leases page. Here are two of them:

(In this case, the querystring just affects what sections of the page are collapsed or expanded with JavaScript; the HTML response is the same.)

These pages use the Link header to identify that they both belong to the same canonical URL:

GET https://www.blm.gov/programs/energy-and-minerals/oil-and-gas/leasing/regional-lease-sales/colorado?qt-colorado_2014_oil_and_gas_lease_=1

HTTP/1.1 200 OK
Date: Tue, 10 Sep 2019 01:11:39 GMT
Content-Type: text/html; charset=utf-8
Link: <https://www.blm.gov/programs/energy-and-minerals/oil-and-gas/leasing/regional-lease-sales/colorado>; rel="canonical",<https://www.blm.gov/node/6303>; rel="shortlink"
...more headers...

They also use the HTML <link> element:

<link rel="canonical" href="https://www.blm.gov/programs/energy-and-minerals/oil-and-gas/leasing/regional-lease-sales/colorado" />
<link rel="shortlink" href="https://www.blm.gov/node/6303" />

Some pages will probably only use one or the other, but it would be really good to key off of these in order to determine what page record a new version should belong to.

I’m adding this to the overview project because I’m not really sure whether the right home for this logic is in the Wayback import script in -processing or in the -db project as part of the import job. Also not sure whether it’s important to follow the link to the canonical URL and make sure it has the same response body.

Any thoughts, @danielballan?

The text was updated successfully, but these errors were encountered:

Mr0grog · 2019-09-10T01:25:57Z

Updated above: Also not sure whether it’s important to follow the link to the canonical URL and verify it has the same response body.

stale · 2020-09-02T05:23:45Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

stale bot added the stale label Sep 2, 2020

stale bot closed this as completed Sep 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle/utilize `canonical` link types for pages #139

Handle/utilize `canonical` link types for pages #139

Mr0grog commented Sep 10, 2019 •

edited

Loading

Mr0grog commented Sep 10, 2019

stale bot commented Sep 2, 2020

Handle/utilize canonical link types for pages #139

Handle/utilize canonical link types for pages #139

Comments

Mr0grog commented Sep 10, 2019 • edited Loading

Mr0grog commented Sep 10, 2019

stale bot commented Sep 2, 2020

Handle/utilize `canonical` link types for pages #139

Handle/utilize `canonical` link types for pages #139

Mr0grog commented Sep 10, 2019 •

edited

Loading