Refactor `Version` Model #776

Mr0grog · 2020-11-11T09:15:26Z

The Version model has had a tortured history, starting as a local copy of Versionista’s data, and slowly evolving into something much different. Today it could probably use some cleanup.

Normally I’d say this is not worth the bugs it might introduce, but we developer involvement here is waning, and I think it’s valuable to do the work to make the data easier to learn and understand, whether that’s to make future development by other programmers easier to get started with or to make dead, archived data easier to use.

Current fields:

✅ = all good as-is, ⚠️ = should change, ❌ = should remove, 🌱 = should add

Status	Field	Category	Usage/Notes
✅	`uuid`	DB-centric	Acts as a convenient, single-field primary key, even though versions can also be uniquely identified by `(capture_time, capture_url, source_type)`
✅	`created_at`	DB-centric	Should never be modified; indicates when the record was created.
✅	`updated_at`	DB-centric	Indicates when the record was last modified. Only relevant insofar as we have non-canonical fields listed below.
✅	`capture_time`	Canonical	The time at which the version was originally captured in the source system. Component of the natural key.
⚠️	`capture_url`	Canonical	The URL this is a capture of, e.g. `https://epa.gov/climate-change/`. I think we should rename this to `url`, which would be more concise and clearer. Component of the natural key.
✅	`source_type`	Canonical	The system that originally captured this, e.g. `internet_archive`. Component of the natural key.
✅	`status`	Canonical	The status code of the HTTP response.
⚠️	`uri`	Canonical	The location of the captured HTTP response body. I think we should rename this to `body_url` for clarity. Everyone new gets tripped up on this one.
⚠️	`version_hash`	Canonical	Technically this is derived data, but it should be treated as canonical since we use it for fixity. I think we should rename this to `body_hash` for clarity, and to go with the suggested rename of `uri` to `body_url` above.
✅	`source_metadata`	Canonical	Arbitrary additional metadata from the source that we don’t have a good field for on `Version`.
✅	`page_uuid`	Derived	Note this is not canonical! Pages are really an arbitrary construct. We also want to start treating this as something user settable (rather than something exclusively calculated from underlying data), because pages having multiple URLs (#492) means pages are mergeable, and the page a version belongs to is a malleable concept.
✅	`title`	Derived	Title of the page/document/whatever, if we can figure it out.
⚠️	`content_length`	Derived	The actual size in bytes of the HTTP response body (not the `Content-Length` header). Doesn’t reside anywhere else in the DB, but is/can be automatically derived from the data found from the `uri` field. I’d like to suggest renaming this to `body_length` or `body_size` for consistency with the other `body_*` fields suggested above and to differentiate from the header of the same name, but don’t think it’s a big deal. The current name is not a problem.
✅	`media_type`	Derived	Currently the parsed media type from the `Content-Type` header. I don’t think the name should change, although we should consider sniffing content and canonicalizing the type here instead of just reflecting the header. (See also #752)
❌	`media_type_parameters`	Denormalized	This is complicated to use, not typically of much value, and not worth surfacing directly on the model. If someone ever has an actual use for it, they should get it by consulting the headers. We should remove it. (See also #752)
✅	`different`	Denormalized	This is the only field that merely denormalizes data found elsewhere in the DB, but we currently use it in most of our queries.
🌱	`headers`	Canonical	Does not exist/should add. The headers of the HTTP response. Having this goes a long way towards clarifying `Version` as a canonical record of an archived HTTP response, which is what it should really be at this point. We do not always have headers (so this must be nullable, or maybe an empty object?), but when we do they should be here instead of in `source_metadata`.
🌱	~~`charset`~~	Derived	Does not exist/should add. Character encoding of the HTTP response body. I’m less bullish on this one, since we can set it in the response headers of `uri`/`body_url` and, for HTML, it is often specified (and specified more correctly) in the document itself. It can also be retrieved from the headers in cases where we didn’t have to sniff it during import.

We also track redirects in source_metadata (so we know if the response is a direct response to capture_url or a response to a redirect from it), but I’m not sure that’s worth pulling up into the main model.

I’ve made a point of calling out DB-centric vs. derived vs. denormalized because I think these are particularly useful distinctions. If there weren’t practical concerns, I’d almost suggest the derived & denormalized data belongs in a second table that has a 1:1 relationship with versions. That way we are separating out the canonical record of an archived HTTP response from other useful data we want to make easily available about it.

If it weren’t a total mess to do, I’d also suggest renaming Version to Capture or Snapshot, but I don’t think that’s worth the complex mess and migration headaches. Renaming fields is small potatoes in comparison.

The text was updated successfully, but these errors were encountered:

Mr0grog · 2020-11-11T09:16:15Z

Actually, the more I think about it, the less I want to propose adding charset.

Mr0grog · 2020-11-12T21:52:54Z

In a discussion yesterday, @danielballan concurred with the ideas here, so I feel a little better about actually doing it.

The removal of media_type_parameters is already under way with #778.

Mr0grog · 2021-06-04T03:11:41Z

Things that still need doing here:

capture_url → url
uri → body_url
version_hash → body_hash
content_length → body_length
Add headers (a.k.a. source_metadata.headers → headers)

This needs to be in place before doing edgi-govdata-archiving/web-monitoring-db#776.

This needs to be in place before doing edgi-govdata-archiving/web-monitoring-db#776. This doesn't update the fields we *send*. The DB will initially be backwards compatible with the current import format, so we can ship this first, *then* upgrade the DB without anything breaking.

This needs to be in place before doing edgi-govdata-archiving/web-monitoring-db#776.

This needs to be in place before doing edgi-govdata-archiving/web-monitoring-db#776. This doesn't update the fields we *send*. The DB will initially be backwards compatible with the current import format, so we can ship this first, *then* upgrade the DB without anything breaking.

Refactor the `Version` model to rename and move some confusing fields. This makes the following changes to `Version`: - Rename `capture_url` → `url` - Rename `uri` → `body_url` - Rename `version_hash` → `body_hash` - Add `headers` (a.k.a. move `source_metadata.headers` → `headers`) Most of these names have been confusing in the past, and this helps align them with what people seem to more intuitively expect or makes them more clear. Renaming `capture_url` and moving `headers` also helps move `Version` more toward canonically representing a snapshot of an HTTP response, rather than just what data we had from Versionista (what drove much of our original design long ago). This also includes a data migration script for moving the header data out of `source_metadata` and into the new column. It’ll be pretty slow, so I’ll run it as a Kubernetes job in production. Fixes #776.

This was referenced Nov 11, 2020

Roadmap for the rest of 2020 edgi-govdata-archiving/web-monitoring#158

Closed

Refactor and normalize media type information #752

Closed

Mr0grog added API Changes to the public API Code Quality labels Nov 11, 2020

stale bot added the stale label Jun 3, 2021

stale bot removed the stale label Jun 4, 2021

edgi-govdata-archiving deleted a comment from stale bot Jun 4, 2021

Mr0grog mentioned this issue Jun 6, 2021

Refactor several poorly named fields on Version #856

Merged

8 tasks

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ui that referenced this issue Jun 6, 2021

Handle both old and new field names for versions

b174a1d

This needs to be in place before doing edgi-govdata-archiving/web-monitoring-db#776.

Mr0grog mentioned this issue Jun 6, 2021

Handle both old and new field names for versions edgi-govdata-archiving/web-monitoring-ui#749

Merged

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-task-sheets that referenced this issue Jun 6, 2021

Handle both old and new field names for versions

9a0e1ff

This needs to be in place before doing edgi-govdata-archiving/web-monitoring-db#776.

This was referenced Jun 6, 2021

Handle both old and new field names for versions edgi-govdata-archiving/web-monitoring-processing#727

Merged

Handle both old and new field names for versions edgi-govdata-archiving/web-monitoring-task-sheets#10

Merged

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ui that referenced this issue Jun 17, 2021

Handle both old and new field names for versions (#749)

16ce39c

This needs to be in place before doing edgi-govdata-archiving/web-monitoring-db#776.

Mr0grog closed this as completed in #856 Jun 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `Version` Model #776

Refactor `Version` Model #776

Mr0grog commented Nov 11, 2020 •

edited

Loading

Mr0grog commented Nov 11, 2020

Mr0grog commented Nov 12, 2020

Mr0grog commented Jun 4, 2021

Refactor Version Model #776

Refactor Version Model #776

Comments

Mr0grog commented Nov 11, 2020 • edited Loading

Mr0grog commented Nov 11, 2020

Mr0grog commented Nov 12, 2020

Mr0grog commented Jun 4, 2021

Refactor `Version` Model #776

Refactor `Version` Model #776

Mr0grog commented Nov 11, 2020 •

edited

Loading