Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle/utilize canonical link types for pages #139

Closed
Mr0grog opened this issue Sep 10, 2019 · 2 comments
Closed

Handle/utilize canonical link types for pages #139

Mr0grog opened this issue Sep 10, 2019 · 2 comments
Labels

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Sep 10, 2019

In edgi-govdata-archiving/web-monitoring-db#492, we’re planning to update the -db project so our page records can finally handle having multiple URLs. On a first pass, that just means making it possible to model. However, we also have a lot of pages that actively tell us they are the same page using the Link header or the <link> element in the HTML head.

For example, we have 63 page records all representing the BLM’s Colorado oil and gas leases page. Here are two of them:

(In this case, the querystring just affects what sections of the page are collapsed or expanded with JavaScript; the HTML response is the same.)

These pages use the Link header to identify that they both belong to the same canonical URL:

GET https://www.blm.gov/programs/energy-and-minerals/oil-and-gas/leasing/regional-lease-sales/colorado?qt-colorado_2014_oil_and_gas_lease_=1

HTTP/1.1 200 OK
Date: Tue, 10 Sep 2019 01:11:39 GMT
Content-Type: text/html; charset=utf-8
Link: <https://www.blm.gov/programs/energy-and-minerals/oil-and-gas/leasing/regional-lease-sales/colorado>; rel="canonical",<https://www.blm.gov/node/6303>; rel="shortlink"
...more headers...

They also use the HTML <link> element:

<link rel="canonical" href="https://www.blm.gov/programs/energy-and-minerals/oil-and-gas/leasing/regional-lease-sales/colorado" />
<link rel="shortlink" href="https://www.blm.gov/node/6303" />

Some pages will probably only use one or the other, but it would be really good to key off of these in order to determine what page record a new version should belong to.

I’m adding this to the overview project because I’m not really sure whether the right home for this logic is in the Wayback import script in -processing or in the -db project as part of the import job. Also not sure whether it’s important to follow the link to the canonical URL and make sure it has the same response body.

Any thoughts, @danielballan?

@Mr0grog
Copy link
Member Author

Mr0grog commented Sep 10, 2019

Updated above: Also not sure whether it’s important to follow the link to the canonical URL and verify it has the same response body.

@stale
Copy link

stale bot commented Sep 2, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

@stale stale bot added the stale label Sep 2, 2020
@stale stale bot closed this as completed Sep 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant