-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow pages to have multiple URLs #492
Comments
This may not be a huuuuge emergency since the old domain appears to have shut down on February 10th, so we’ve been failing to monitor these pages for 14 days (on the other hand, you could consider that making it more urgent, I guess). We are also only monitoring 3 pages at the dead domain (and 53 at the newer domain). |
As we discussed today, my feeling is that since this isn't on fire currently, we should take the time to do a fix that we like. I really like @Mr0grog's idea of storing the URLs for a page in a |
After some other discussions and recent projects (e.g. quite a mess trying to generate a sane seed list for End of Term 2020), I think this is pretty important to get done now, and relatively quickly. Web-Monitoring wide, I have some higher priorities, but this should be near the top of the list for -db. Current PlanAdd a new model/table called
The primary key is really just for lookup/framework purposes. Rows are unique by I’m not sure we’ll fill in When we import pages, their URLs will be checked against this table, rather than The
Add a |
Update: I partially implemented this when I discovered Postgres has a Unfortunately, it has serious issues when it comes to time ranges that have no lower bound, have a lower bound of On the other hand, there are some unpleasant hacks to make it work, either by monkey-patching I said earlier that:
But I think this was probably wrong. It’s unlikely, but we might want to somehow model that a page was not available during a given time range, but was available before and after. Perhaps the URL pointed to a conceptually different page in the middle time:
Or viewed as a timeline:
Ideally, we’d say (where
If we use Finally! In the two column approach, Postgres also allows |
And lest you think the above scenario with different pages trading off ownership of a URL being unlikely or unusual, we have several examples of this. Here’s one: In April 2017, the EPA’s “Clean Power Plan” page at https://www.epa.gov/cleanpowerplan started redirecting to a new “Complying with President Trump's Executive Order on Energy Independence” page at https://www.epa.gov/Energy-Independence. It might not be unreasonable to classify this as two separate pages. Given the character of the current page, it also won’t be surprising if it changes back, gets removed, or changes to something else new under the Biden administration. This is a phenomenon that the data should make more explorable. |
This is an idea/open question that I’d like to resolve relatively quickly.
We’ve avoided figuring out how to best model pages that move or have multiple locations over time (and probably at the same time, too). While doing some routine updates today, I discovered that the domain http://www.cpc.noaa.gov/ was just shut down earlier this month. It looks like all those pages have been redirecting to identical paths at https://www.cpc.ncep.noaa.gov/ for quite a while (months, maybe longer?) and they’ve finally stopped responding to the older domain.
In the long term, we need to actually model this correctly. In the short term, we need to keep monitoring these pages, so my quick-and-dirty suggestion is to add a
notes
text field to the Page record in DB, update theurl
field for those pages, and add some human-readable text to thenotes
field describing the situation for for those pages.Thoughts?
/cc @jsnshrmn @danielballan @gretchengehrke
The text was updated successfully, but these errors were encountered: