Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow pages to have multiple URLs #492

Closed
Mr0grog opened this issue Feb 25, 2019 · 5 comments · Fixed by #793
Closed

Allow pages to have multiple URLs #492

Mr0grog opened this issue Feb 25, 2019 · 5 comments · Fixed by #793

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Feb 25, 2019

This is an idea/open question that I’d like to resolve relatively quickly.

We’ve avoided figuring out how to best model pages that move or have multiple locations over time (and probably at the same time, too). While doing some routine updates today, I discovered that the domain http://www.cpc.noaa.gov/ was just shut down earlier this month. It looks like all those pages have been redirecting to identical paths at https://www.cpc.ncep.noaa.gov/ for quite a while (months, maybe longer?) and they’ve finally stopped responding to the older domain.

In the long term, we need to actually model this correctly. In the short term, we need to keep monitoring these pages, so my quick-and-dirty suggestion is to add a notes text field to the Page record in DB, update the url field for those pages, and add some human-readable text to the notes field describing the situation for for those pages.

Thoughts?

/cc @jsnshrmn @danielballan @gretchengehrke

@Mr0grog
Copy link
Member Author

Mr0grog commented Feb 25, 2019

This may not be a huuuuge emergency since the old domain appears to have shut down on February 10th, so we’ve been failing to monitor these pages for 14 days (on the other hand, you could consider that making it more urgent, I guess). We are also only monitoring 3 pages at the dead domain (and 53 at the newer domain).

@jsnshrmn
Copy link
Contributor

As we discussed today, my feeling is that since this isn't on fire currently, we should take the time to do a fix that we like. I really like @Mr0grog's idea of storing the URLs for a page in a jsonb field, as that would allow us handle page URLs as a simple list or as richer multidimensional data, and it wouldn't preclude indexing.

@Mr0grog Mr0grog changed the title Add notes field to Pages Allow pages to have multiple URLs May 23, 2019
@stale stale bot added the stale label Nov 20, 2019
@stale stale bot closed this as completed Nov 27, 2019
@Mr0grog Mr0grog added never-stale and removed stale labels Nov 28, 2019
@Mr0grog Mr0grog reopened this Nov 28, 2019
@edgi-govdata-archiving edgi-govdata-archiving deleted a comment from stale bot Nov 28, 2019
@Mr0grog
Copy link
Member Author

Mr0grog commented Oct 21, 2020

After some other discussions and recent projects (e.g. quite a mess trying to generate a sane seed list for End of Term 2020), I think this is pretty important to get done now, and relatively quickly. Web-Monitoring wide, I have some higher priorities, but this should be near the top of the list for -db.

Current Plan

Add a new model/table called PageUrl (model) / page_urls (table) with:

Field Type Notes
id uuid (primary key)
page_uuid uuid (not null) Foreign key to Page
url text (not null) A URL that the page can be reached from
url_key text (not null) SURT-formatted version of url
from_time datetime (nullable) Earliest time at which the page/url combination is valid.
to_time datetime (nullable) Latest time at which the page/url combination is valid.

The primary key is really just for lookup/framework purposes. Rows are unique by (page_uuid, url).

I’m not sure we’ll fill in from_time/to_time very often or use them in a technical way, but this feels like important information to be able to keep track of when we have it. For example, the page at http://www.nrel.gov/about/himmel.html moved to https://www.nrel.gov/research/michael-himmel.html and then to https://www.nrel.gov/research/staff/michael-himmel.html, and none of the old URLs redirect. It would be good to have a place to track when old URLs stop being valid and new URLs start, if possible.

When we import pages, their URLs will be checked against this table, rather than Page#url/Page#url_key.

The url and url_key fields on Page will stay, and will be the current “canonical” URL for the page, insofar as we can determine it. Canonical should probably mean:

  • Latest redirect target that is at the same ETLD+1 as the current
  • Code should carry a note that type of redirect (permanent/temporary/found/see-other) should factor in, but we don’t currently track that info so can’t use it right now.
  • Check for rel="canonical" in the Link: header or in the page’s HTML and prefer that if it’s in the redirect chain.

Add a Page#merge(*other_pages) method that merges other page models into the current one. It should move the versions to the target page, add the maintainers and tags to the target page, add any relevant URLs to the target page, and delete merged page records. We have a function kind of like this in one of our old migrations, but I think discovering that two pages are the same will be an ongoing concern (we already know about a few thousand that we track that already are), so it deserves a better place than in a migration.

@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 4, 2020

Update: I partially implemented this when I discovered Postgres has a tsrange type that represents ranges between timestamps — somehow I’d never known about it, and it seemed perfect for this task.

Unfortunately, it has serious issues when it comes to time ranges that have no lower bound, have a lower bound of -infinity, or that exclude the lower bound. Basically, Rails really wants to represent the value as a Ruby Range type, but you can’t mix floats and times in them (-infinity comes through as -Float::INFINITY since Ruby’s Time type has no concept of +/-infinity), it tries to convert a non-bound to +/-infinity (and ranges can’t have no lower bound before Ruby 2.7, the latest version, and a newer one than we’re on), and ranges where the lower bound is exclusive simply aren’t allowed in Ruby. ¯\_(ツ)_/¯

On the other hand, there are some unpleasant hacks to make it work, either by monkey-patching Float or by monkey-patching ActiveRecord. More here: rails/rails#39833


I said earlier that:

Rows are unique by (page_uuid, url).

But I think this was probably wrong. It’s unlikely, but we might want to somehow model that a page was not available during a given time range, but was available before and after. Perhaps the URL pointed to a conceptually different page in the middle time:

page url from_time to_time
a http://example.com/a -infinity 2017-01-01 00:00:00Z
b http://example.com/a 2017-01-01 00:00:00Z 2020-10-01 00:00:00Z
a http://example.com/a 2020-10-01 00:00:00Z infinity

Or viewed as a timeline:

What page does "http://example.com/a" point to?

        | 2016       | 2017       | 2018       | 2019       | 2020       | 2021
        |            |            |            |            |            |
Page A  -------------o                                               x--------------
Page B               x-----------------------------------------------o

Ideally, we’d say (where timeframe is either [from_time,to_time) or a tsrange:

  1. A (page_uuid, url, timeframe) combination is unique.
  2. For a given url, no records should have overlapping timeframe.

If we use tsrange, Postgres can enforce condition (2) for us. If we use the two column approach, that might not be so feasible, and we can enforce condition (1), which would still allow overlaps, or just not worry about it. Talking with @danielballan a couple of days ago, it seems like just not worrying about it may honestly be the better solution for now. As a related point, the exact times here will always be a guess anyway, and getting new data from another archival source could change things, so it’s probably best to think of this timeframe info as anecdotal or advisory rather than canonical, reliable data. (That said, we might still enforce it in the Rails business logic layer, or provide warnings, or something, since overlaps will be confusing.)


Finally! In the two column approach, Postgres also allows +/-infinity for timestamps! We could use that instead of NULL, buuuuuut dealing with those values gets a little funny in Rails. When you read them, you get a float instead of a time (Ruby times can't represent +/-infinity; see tsrange issues above). That’s a little messy, but OK, except I also had trouble setting Float::INFINITY for a value. So while it seems like the right way to model this in Postgres, it’s a little iffy on Rails side and we might want to avoid it.

@Mr0grog
Copy link
Member Author

Mr0grog commented Dec 4, 2020

And lest you think the above scenario with different pages trading off ownership of a URL being unlikely or unusual, we have several examples of this. Here’s one:

https://monitoring.envirodatagov.org/page/6cc0784b-eeaa-44f6-b6c3-4ce75f1c2497/58615510-6c67-4544-a483-3706002550b4..5f9b8e16-7403-4e3f-9e59-ccdf53a0dfa7

In April 2017, the EPA’s “Clean Power Plan” page at https://www.epa.gov/cleanpowerplan started redirecting to a new “Complying with President Trump's Executive Order on Energy Independence” page at https://www.epa.gov/Energy-Independence. It might not be unreasonable to classify this as two separate pages. Given the character of the current page, it also won’t be surprising if it changes back, gets removed, or changes to something else new under the Biden administration.

This is a phenomenon that the data should make more explorable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants