Allow pages to have multiple URLs #492

Mr0grog · 2019-02-25T05:12:15Z

This is an idea/open question that I’d like to resolve relatively quickly.

We’ve avoided figuring out how to best model pages that move or have multiple locations over time (and probably at the same time, too). While doing some routine updates today, I discovered that the domain http://www.cpc.noaa.gov/ was just shut down earlier this month. It looks like all those pages have been redirecting to identical paths at https://www.cpc.ncep.noaa.gov/ for quite a while (months, maybe longer?) and they’ve finally stopped responding to the older domain.

In the long term, we need to actually model this correctly. In the short term, we need to keep monitoring these pages, so my quick-and-dirty suggestion is to add a notes text field to the Page record in DB, update the url field for those pages, and add some human-readable text to the notes field describing the situation for for those pages.

Thoughts?

/cc @jsnshrmn @danielballan @gretchengehrke

The text was updated successfully, but these errors were encountered:

Mr0grog · 2019-02-25T05:14:59Z

This may not be a huuuuge emergency since the old domain appears to have shut down on February 10th, so we’ve been failing to monitor these pages for 14 days (on the other hand, you could consider that making it more urgent, I guess). We are also only monitoring 3 pages at the dead domain (and 53 at the newer domain).

jsnshrmn · 2019-02-27T20:29:02Z

As we discussed today, my feeling is that since this isn't on fire currently, we should take the time to do a fix that we like. I really like @Mr0grog's idea of storing the URLs for a page in a jsonb field, as that would allow us handle page URLs as a simple list or as richer multidimensional data, and it wouldn't preclude indexing.

Mr0grog · 2020-10-21T21:11:54Z

After some other discussions and recent projects (e.g. quite a mess trying to generate a sane seed list for End of Term 2020), I think this is pretty important to get done now, and relatively quickly. Web-Monitoring wide, I have some higher priorities, but this should be near the top of the list for -db.

Current Plan

Add a new model/table called PageUrl (model) / page_urls (table) with:

Field	Type	Notes
id	uuid (primary key)
page_uuid	uuid (not null)	Foreign key to `Page`
url	text (not null)	A URL that the page can be reached from
url_key	text (not null)	SURT-formatted version of `url`
from_time	datetime (nullable)	Earliest time at which the page/url combination is valid.
to_time	datetime (nullable)	Latest time at which the page/url combination is valid.

The primary key is really just for lookup/framework purposes. Rows are unique by (page_uuid, url).

I’m not sure we’ll fill in from_time/to_time very often or use them in a technical way, but this feels like important information to be able to keep track of when we have it. For example, the page at http://www.nrel.gov/about/himmel.html moved to https://www.nrel.gov/research/michael-himmel.html and then to https://www.nrel.gov/research/staff/michael-himmel.html, and none of the old URLs redirect. It would be good to have a place to track when old URLs stop being valid and new URLs start, if possible.

When we import pages, their URLs will be checked against this table, rather than Page#url/Page#url_key.

The url and url_key fields on Page will stay, and will be the current “canonical” URL for the page, insofar as we can determine it. Canonical should probably mean:

Latest redirect target that is at the same ETLD+1 as the current
Code should carry a note that type of redirect (permanent/temporary/found/see-other) should factor in, but we don’t currently track that info so can’t use it right now.
Check for rel="canonical" in the Link: header or in the page’s HTML and prefer that if it’s in the redirect chain.

Add a Page#merge(*other_pages) method that merges other page models into the current one. It should move the versions to the target page, add the maintainers and tags to the target page, add any relevant URLs to the target page, and delete merged page records. We have a function kind of like this in one of our old migrations, but I think discovering that two pages are the same will be an ongoing concern (we already know about a few thousand that we track that already are), so it deserves a better place than in a migration.

Mr0grog · 2020-12-04T02:39:22Z

Update: I partially implemented this when I discovered Postgres has a tsrange type that represents ranges between timestamps — somehow I’d never known about it, and it seemed perfect for this task.

Unfortunately, it has serious issues when it comes to time ranges that have no lower bound, have a lower bound of -infinity, or that exclude the lower bound. Basically, Rails really wants to represent the value as a Ruby Range type, but you can’t mix floats and times in them (-infinity comes through as -Float::INFINITY since Ruby’s Time type has no concept of +/-infinity), it tries to convert a non-bound to +/-infinity (and ranges can’t have no lower bound before Ruby 2.7, the latest version, and a newer one than we’re on), and ranges where the lower bound is exclusive simply aren’t allowed in Ruby. ¯\_(ツ)_/¯

On the other hand, there are some unpleasant hacks to make it work, either by monkey-patching Float or by monkey-patching ActiveRecord. More here: rails/rails#39833

I said earlier that:

Rows are unique by (page_uuid, url).

But I think this was probably wrong. It’s unlikely, but we might want to somehow model that a page was not available during a given time range, but was available before and after. Perhaps the URL pointed to a conceptually different page in the middle time:

page	url	from_time	to_time
a	http://example.com/a	-infinity	2017-01-01 00:00:00Z
b	http://example.com/a	2017-01-01 00:00:00Z	2020-10-01 00:00:00Z
a	http://example.com/a	2020-10-01 00:00:00Z	infinity

Or viewed as a timeline:

What page does "http://example.com/a" point to?

        | 2016       | 2017       | 2018       | 2019       | 2020       | 2021
        |            |            |            |            |            |
Page A  -------------o                                               x--------------
Page B               x-----------------------------------------------o

Ideally, we’d say (where timeframe is either [from_time,to_time) or a tsrange:

A (page_uuid, url, timeframe) combination is unique.
For a given url, no records should have overlapping timeframe.

If we use tsrange, Postgres can enforce condition (2) for us. If we use the two column approach, that might not be so feasible, and we can enforce condition (1), which would still allow overlaps, or just not worry about it. Talking with @danielballan a couple of days ago, it seems like just not worrying about it may honestly be the better solution for now. As a related point, the exact times here will always be a guess anyway, and getting new data from another archival source could change things, so it’s probably best to think of this timeframe info as anecdotal or advisory rather than canonical, reliable data. (That said, we might still enforce it in the Rails business logic layer, or provide warnings, or something, since overlaps will be confusing.)

Finally! In the two column approach, Postgres also allows +/-infinity for timestamps! We could use that instead of NULL, buuuuuut dealing with those values gets a little funny in Rails. When you read them, you get a float instead of a time (Ruby times can't represent +/-infinity; see tsrange issues above). That’s a little messy, but OK, except I also had trouble setting Float::INFINITY for a value. So while it seems like the right way to model this in Postgres, it’s a little iffy on Rails side and we might want to avoid it.

Mr0grog · 2020-12-04T03:05:18Z

And lest you think the above scenario with different pages trading off ownership of a URL being unlikely or unusual, we have several examples of this. Here’s one:

https://monitoring.envirodatagov.org/page/6cc0784b-eeaa-44f6-b6c3-4ce75f1c2497/58615510-6c67-4544-a483-3706002550b4..5f9b8e16-7403-4e3f-9e59-ccdf53a0dfa7

In April 2017, the EPA’s “Clean Power Plan” page at https://www.epa.gov/cleanpowerplan started redirecting to a new “Complying with President Trump's Executive Order on Energy Independence” page at https://www.epa.gov/Energy-Independence. It might not be unreasonable to classify this as two separate pages. Given the character of the current page, it also won’t be surprising if it changes back, gets removed, or changes to something else new under the Biden administration.

This is a phenomenon that the data should make more explorable.

Mr0grog added question priority labels Feb 25, 2019

Mr0grog self-assigned this Feb 25, 2019

Mr0grog added in progress ready and removed in progress labels Feb 25, 2019

Mr0grog changed the title ~~Add notes field to Pages~~ Allow pages to have multiple URLs May 23, 2019

Mr0grog mentioned this issue Sep 10, 2019

Handle/utilize canonical link types for pages edgi-govdata-archiving/web-monitoring#139

Closed

stale bot added the stale label Nov 20, 2019

stale bot closed this as completed Nov 27, 2019

Mr0grog added never-stale and removed stale labels Nov 28, 2019

Mr0grog reopened this Nov 28, 2019

edgi-govdata-archiving deleted a comment from stale bot Nov 28, 2019

This was referenced Nov 11, 2020

Refactor Version Model #776

Closed

Roadmap for the rest of 2020 edgi-govdata-archiving/web-monitoring#158

Closed

This was referenced Dec 4, 2020

Upgrade Ruby to v2.6.6 (latest 2.6.x) #792

Merged

Add PageUrl model to describe pages with multiple URLs #793

Merged

Mr0grog closed this as completed in #793 Dec 13, 2020

Mr0grog mentioned this issue Jun 29, 2021

Evaluate new URLs for pages that redirect edgi-govdata-archiving/web-monitoring#163

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow pages to have multiple URLs #492

Allow pages to have multiple URLs #492

Mr0grog commented Feb 25, 2019

Mr0grog commented Feb 25, 2019

jsnshrmn commented Feb 27, 2019

Mr0grog commented Oct 21, 2020 •

edited

Loading

Mr0grog commented Dec 4, 2020 •

edited

Loading

Mr0grog commented Dec 4, 2020 •

edited

Loading

Allow pages to have multiple URLs #492

Allow pages to have multiple URLs #492

Comments

Mr0grog commented Feb 25, 2019

Mr0grog commented Feb 25, 2019

jsnshrmn commented Feb 27, 2019

Mr0grog commented Oct 21, 2020 • edited Loading

Current Plan

Mr0grog commented Dec 4, 2020 • edited Loading

Mr0grog commented Dec 4, 2020 • edited Loading

Mr0grog commented Oct 21, 2020 •

edited

Loading

Mr0grog commented Dec 4, 2020 •

edited

Loading

Mr0grog commented Dec 4, 2020 •

edited

Loading