-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Version
Model
#776
Labels
Comments
Actually, the more I think about it, the less I want to propose adding |
This was referenced Nov 11, 2020
In a discussion yesterday, @danielballan concurred with the ideas here, so I feel a little better about actually doing it. The removal of |
Things that still need doing here:
|
8 tasks
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-ui
that referenced
this issue
Jun 6, 2021
This needs to be in place before doing edgi-govdata-archiving/web-monitoring-db#776.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-processing
that referenced
this issue
Jun 6, 2021
This needs to be in place before doing edgi-govdata-archiving/web-monitoring-db#776. This doesn't update the fields we *send*. The DB will initially be backwards compatible with the current import format, so we can ship this first, *then* upgrade the DB without anything breaking.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-task-sheets
that referenced
this issue
Jun 6, 2021
This needs to be in place before doing edgi-govdata-archiving/web-monitoring-db#776.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-ui
that referenced
this issue
Jun 17, 2021
This needs to be in place before doing edgi-govdata-archiving/web-monitoring-db#776.
Mr0grog
added a commit
to edgi-govdata-archiving/web-monitoring-processing
that referenced
this issue
Jun 17, 2021
This needs to be in place before doing edgi-govdata-archiving/web-monitoring-db#776. This doesn't update the fields we *send*. The DB will initially be backwards compatible with the current import format, so we can ship this first, *then* upgrade the DB without anything breaking.
Mr0grog
added a commit
that referenced
this issue
Jun 17, 2021
Refactor the `Version` model to rename and move some confusing fields. This makes the following changes to `Version`: - Rename `capture_url` → `url` - Rename `uri` → `body_url` - Rename `version_hash` → `body_hash` - Add `headers` (a.k.a. move `source_metadata.headers` → `headers`) Most of these names have been confusing in the past, and this helps align them with what people seem to more intuitively expect or makes them more clear. Renaming `capture_url` and moving `headers` also helps move `Version` more toward canonically representing a snapshot of an HTTP response, rather than just what data we had from Versionista (what drove much of our original design long ago). This also includes a data migration script for moving the header data out of `source_metadata` and into the new column. It’ll be pretty slow, so I’ll run it as a Kubernetes job in production. Fixes #776.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The
Version
model has had a tortured history, starting as a local copy of Versionista’s data, and slowly evolving into something much different. Today it could probably use some cleanup.Normally I’d say this is not worth the bugs it might introduce, but we developer involvement here is waning, and I think it’s valuable to do the work to make the data easier to learn and understand, whether that’s to make future development by other programmers easier to get started with or to make dead, archived data easier to use.
Current fields:
✅ = all good as-is,⚠️ = should change, ❌ = should remove, 🌱 = should add
uuid
(capture_time, capture_url, source_type)
created_at
updated_at
capture_time
capture_url
https://epa.gov/climate-change/
. I think we should rename this tourl
, which would be more concise and clearer. Component of the natural key.source_type
internet_archive
. Component of the natural key.status
uri
body_url
for clarity. Everyone new gets tripped up on this one.version_hash
body_hash
for clarity, and to go with the suggested rename ofuri
tobody_url
above.source_metadata
Version
.page_uuid
title
content_length
Content-Length
header). Doesn’t reside anywhere else in the DB, but is/can be automatically derived from the data found from theuri
field. I’d like to suggest renaming this tobody_length
orbody_size
for consistency with the otherbody_*
fields suggested above and to differentiate from the header of the same name, but don’t think it’s a big deal. The current name is not a problem.media_type
Content-Type
header. I don’t think the name should change, although we should consider sniffing content and canonicalizing the type here instead of just reflecting the header. (See also #752)media_type_parameters
different
headers
Version
as a canonical record of an archived HTTP response, which is what it should really be at this point. We do not always have headers (so this must be nullable, or maybe an empty object?), but when we do they should be here instead of insource_metadata
.🌱charset
Does not exist/should add. Character encoding of the HTTP response body. I’m less bullish on this one, since we can set it in the response headers ofuri
/body_url
and, for HTML, it is often specified (and specified more correctly) in the document itself. It can also be retrieved from the headers in cases where we didn’t have to sniff it during import.We also track redirects in
source_metadata
(so we know if the response is a direct response tocapture_url
or a response to a redirect from it), but I’m not sure that’s worth pulling up into the main model.I’ve made a point of calling out DB-centric vs. derived vs. denormalized because I think these are particularly useful distinctions. If there weren’t practical concerns, I’d almost suggest the derived & denormalized data belongs in a second table that has a 1:1 relationship with versions. That way we are separating out the canonical record of an archived HTTP response from other useful data we want to make easily available about it.
If it weren’t a total mess to do, I’d also suggest renaming
Version
toCapture
orSnapshot
, but I don’t think that’s worth the complex mess and migration headaches. Renaming fields is small potatoes in comparison.The text was updated successfully, but these errors were encountered: