Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor several poorly named fields on Version #856

Merged
merged 9 commits into from
Jun 17, 2021

Conversation

Mr0grog
Copy link
Member

@Mr0grog Mr0grog commented Jun 6, 2021

This is a first pass at the remaining items in #776. The Version model is pretty central to the whole system, so this is kind of a big and very tedious change. :\

This makes the following changes to Version:

  • Rename capture_urlurl
  • Rename uribody_url
  • Rename version_hashbody_hash
  • Maybe rename content_lengthbody_length
  • Add headers (a.k.a. move source_metadata.headersheaders)

Note that the content_length change hasn’t yet been implemented here. I’m a little on the fence. The current name is clear, and references a well known HTTP header. The proposed new name (body_length) is still clear and is more concise, but departs from well-known convention.

Remaining work to do here:

Fixes #776.

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 6, 2021

OK, I think I’m not going to change content_length. It’s pretty clear as-is, and the shortness/consistency of body_length is a really minor benefit vs. the issues with refactoring the API here.

I did this using jq:

    $ cat db/seed_import.json | jq -c '. + {body_hash: .version_hash, body_url: .uri} | del(.version_hash) | del(.uri)' > db/seed_import.new.json
    $ mv db/seed_import.new.json db/seed_import.json
@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 6, 2021

Migrated the seed file with the following quick jq commands:

$ cat db/seed_import.json | jq -c '. + {body_hash: .version_hash, body_url: .uri} | del(.version_hash) | del(.uri)' > db/seed_import.new.json
$ mv db/seed_import.new.json db/seed_import.json

(No need to update capture_url because it’s not used here.)

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 6, 2021

Downstream update PRs:

Not going to worry about -versionista-scraper and -changed-terms-analysis since they are no longer in active use.

@Mr0grog Mr0grog marked this pull request as ready for review June 6, 2021 01:50
@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 6, 2021

OK, this should be good to go pending updates to downstream consumers.

Going to let this bake for a while and come back with fresh eyes and and a careful review tomorrow or later in the week.

@danielballan
Copy link
Contributor

Renames done in pairs like this that effectively change the meaning of term can create hard-to-find bugs:

  • Rename capture_url → url
  • Rename uri → body_url

Are we sure we like capture_url -> url better enough to risk it? Alternative would be deprecating url entirely so that anywhere it's left over we know unambiguously that it's wrong and what it should be changed to.

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 9, 2021

Are we sure we like capture_url -> url better enough to risk it?

Ha! These are actually the ones I feel most confident about, even though you are right that they are the most technically risky. I feel confident about these for two reasons:

  • Technical issues are unlikely, because url and uri are still different, so any existing code won’t suddenly be doing the wrong thing.
  • This is resolving longstanding confusion and questions I’ve gotten based on the names; we’re aligning them with what people intuitively expect them to mean, which is the opposite of what they currently are:
    • capture_url is ambiguous about whether it represents the URL that was captured or the URL where the capture resides (another way to clarify this might be capture_urlcaptured_url). But I think people mostly expect url or uri to be the URL that was captured, which leads me to…
    • uribody_url is similarly very clarifying because people often first expect that uri is representing the URL/URI that was captured, rather than where the capture is stored.

@danielballan
Copy link
Contributor

Those justifications are convincing. I had actually missed the uri / url distinction in my first read. Now it's clear that's a nonissue.

@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 17, 2021

Going to merge this in an hour or so after weekly sheets are done building.

@Mr0grog Mr0grog merged commit d971f19 into main Jun 17, 2021
@Mr0grog Mr0grog deleted the 776-versions-but-with-slightly-better-names branch June 17, 2021 18:55
Mr0grog added a commit that referenced this pull request Jun 17, 2021
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ops that referenced this pull request Jun 17, 2021
@Mr0grog
Copy link
Member Author

Mr0grog commented Jun 17, 2021

Deployed to staging and and ran the data migration there. Took 79 minutes, but otherwise worked great. Deploying to production now. :)

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ops that referenced this pull request Jun 17, 2021
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-task-sheets that referenced this pull request Jun 29, 2021
Now that edgi-govdata-archiving/web-monitoring-db#856 is merged and fully migrated, we no longer need backwards compatibility with the old schema.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor Version Model
2 participants