Refactor several poorly named fields on `Version` #856

Mr0grog · 2021-06-06T00:06:28Z

This is a first pass at the remaining items in #776. The Version model is pretty central to the whole system, so this is kind of a big and very tedious change. :\

This makes the following changes to Version:

Rename capture_url → url
Rename uri → body_url
Rename version_hash → body_hash
Maybe rename content_length → body_length
Add headers (a.k.a. move source_metadata.headers → headers)

Note that the content_length change hasn’t yet been implemented here. I’m a little on the fence. The current name is clear, and references a well known HTTP header. The proposed new name (body_length) is still clear and is more concise, but departs from well-known convention.

Remaining work to do here:

Migration
Docs
Main code changes
Add data migration to fill in headers from existing source_metadata.headers data
Add tests to ensure imports based on the old format don't break
Decide on content_length → body_length
Update seed files
Update downstream systems (web-monitoring-processing, web-monitoring-task-sheets, web-monitoring-ui, web-monitoring-changed-terms-analysis, web-monitoring-versionista-scraper — not sure these last two really matter since they are not actively used anymore)

Fixes #776.

Mr0grog · 2021-06-06T00:39:53Z

OK, I think I’m not going to change content_length. It’s pretty clear as-is, and the shortness/consistency of body_length is a really minor benefit vs. the issues with refactoring the API here.

I did this using jq: $ cat db/seed_import.json | jq -c '. + {body_hash: .version_hash, body_url: .uri} | del(.version_hash) | del(.uri)' > db/seed_import.new.json $ mv db/seed_import.new.json db/seed_import.json

Mr0grog · 2021-06-06T01:02:40Z

Migrated the seed file with the following quick jq commands:

$ cat db/seed_import.json | jq -c '. + {body_hash: .version_hash, body_url: .uri} | del(.version_hash) | del(.uri)' > db/seed_import.new.json
$ mv db/seed_import.new.json db/seed_import.json

(No need to update capture_url because it’s not used here.)

Mr0grog · 2021-06-06T01:25:16Z

Downstream update PRs:

Not going to worry about -versionista-scraper and -changed-terms-analysis since they are no longer in active use.

Mr0grog · 2021-06-06T02:03:54Z

OK, this should be good to go pending updates to downstream consumers.

Going to let this bake for a while and come back with fresh eyes and and a careful review tomorrow or later in the week.

danielballan · 2021-06-09T17:50:53Z

Renames done in pairs like this that effectively change the meaning of term can create hard-to-find bugs:

Rename capture_url → url
Rename uri → body_url

Are we sure we like capture_url -> url better enough to risk it? Alternative would be deprecating url entirely so that anywhere it's left over we know unambiguously that it's wrong and what it should be changed to.

Mr0grog · 2021-06-09T18:34:43Z

Are we sure we like capture_url -> url better enough to risk it?

Ha! These are actually the ones I feel most confident about, even though you are right that they are the most technically risky. I feel confident about these for two reasons:

Technical issues are unlikely, because url and uri are still different, so any existing code won’t suddenly be doing the wrong thing.
This is resolving longstanding confusion and questions I’ve gotten based on the names; we’re aligning them with what people intuitively expect them to mean, which is the opposite of what they currently are:
- capture_url is ambiguous about whether it represents the URL that was captured or the URL where the capture resides (another way to clarify this might be capture_url → captured_url). But I think people mostly expect url or uri to be the URL that was captured, which leads me to…
- uri → body_url is similarly very clarifying because people often first expect that uri is representing the URL/URI that was captured, rather than where the capture is stored.

danielballan · 2021-06-16T21:20:57Z

Those justifications are convincing. I had actually missed the uri / url distinction in my first read. Now it's clear that's a nonissue.

Mr0grog · 2021-06-17T16:16:05Z

Going to merge this in an hour or so after weekly sheets are done building.

Mr0grog · 2021-06-17T21:27:43Z

Deployed to staging and and ran the data migration there. Took 79 minutes, but otherwise worked great. Deploying to production now. :)

Now that edgi-govdata-archiving/web-monitoring-db#856 is merged and fully migrated, we no longer need backwards compatibility with the old schema.

Mr0grog added 5 commits June 5, 2021 15:52

Update documentation to show new Version fields

afbdd69

Add migration to rename fields

3c883b7

Update the code to handle renamed version fields

a3c16f2

OBEY RUBOCOP (mostly)

bb9e02e

Add data migration

2b3af40

Mr0grog added 2 commits June 5, 2021 17:41

Don't rename content_length

68f0ab5

Update seed data to use new field names

a33f3d9

I did this using jq: $ cat db/seed_import.json | jq -c '. + {body_hash: .version_hash, body_url: .uri} | del(.version_hash) | del(.uri)' > db/seed_import.new.json $ mv db/seed_import.new.json db/seed_import.json

Add tests for backwards-compatible imports

63a0337

Mr0grog marked this pull request as ready for review June 6, 2021 01:50

OBEY RUBOCOP

675df92

Mr0grog merged commit d971f19 into main Jun 17, 2021

Mr0grog deleted the 776-versions-but-with-slightly-better-names branch June 17, 2021 18:55

Mr0grog added a commit that referenced this pull request Jun 17, 2021

Release #856

69802f8

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ops that referenced this pull request Jun 17, 2021

Deploy edgi-govdata-archiving/web-monitoring-db#856 to staging

80f322a

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ops that referenced this pull request Jun 17, 2021

Deploy edgi-govdata-archiving/web-monitoring-db#856 to production

48b7eda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor several poorly named fields on `Version` #856

Refactor several poorly named fields on `Version` #856

Mr0grog commented Jun 6, 2021 •

edited

Loading

Mr0grog commented Jun 6, 2021

Mr0grog commented Jun 6, 2021

Mr0grog commented Jun 6, 2021

Mr0grog commented Jun 6, 2021

danielballan commented Jun 9, 2021

Mr0grog commented Jun 9, 2021

danielballan commented Jun 16, 2021

Mr0grog commented Jun 17, 2021

Mr0grog commented Jun 17, 2021

Refactor several poorly named fields on Version #856

Refactor several poorly named fields on Version #856

Conversation

Mr0grog commented Jun 6, 2021 • edited Loading

Mr0grog commented Jun 6, 2021

Mr0grog commented Jun 6, 2021

Mr0grog commented Jun 6, 2021

Mr0grog commented Jun 6, 2021

danielballan commented Jun 9, 2021

Mr0grog commented Jun 9, 2021

danielballan commented Jun 16, 2021

Mr0grog commented Jun 17, 2021

Mr0grog commented Jun 17, 2021

Refactor several poorly named fields on `Version` #856

Refactor several poorly named fields on `Version` #856

Mr0grog commented Jun 6, 2021 •

edited

Loading