Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and normalize media type information #752

Closed
Mr0grog opened this issue Sep 8, 2020 · 1 comment · Fixed by #778
Closed

Refactor and normalize media type information #752

Mr0grog opened this issue Sep 8, 2020 · 1 comment · Fixed by #778
Labels
API Changes to the public API enhancement never-stale

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Sep 8, 2020

Right now, we track media type information on versions with two fields:

  • media_type: The main type, e.g. text/html or application/pdf
  • media_type_parameters: Any additional parameters for the type, e.g. charset=utf-8; some-other=param

I didn’t feel like this was ideal when I set it up, but was having trouble thinking of what would be better. In hindsight, I think we should ideally refactor this to:

  • media: A normalized version of the main type. This is similar to the current media_type, BUT media types that are known to mean the same thing are normalized to a single name.

    For example, application/html, application/xhtml, and several others are all equivalent to text/html, the canonical type for HTML. What most applications really want to know in this case is “was it HTML?” and we should have a field that makes answering that easy. We have too much code all over the place that lists all the known “HTML” types and has to check against all of them. (There are other duplicate media types for non-HTML things; I’m less worried about handling those in the first attempt at this, although this should make those possible, too.)

  • charset (or encoding?): The character encoding of the response body. This information is by far and away the most commonly used parameter, and has special importance, making it worth separating out. It also applies to a huge variety of media types, so it’s more cross-cutting than almost any other parameter.

  • media_type: The full media type string, e.g. text/html; charset=utf-8; some-other=param. This is the true, canonical information that needs no interpretation or normalization (aside from the case normalization in Version.media_type should always be lower case #689).

    We might need to give this a different name so it doesn’t conflict with the existing media_type field. Maybe media_type_full or content_type?

This needs:

  • A schema migration to create the new fields.
  • A schema migration (later, in a separate PR) to remove the old fields.
  • A data migration to fill in the new fields. (This will be a long, slow job, which is why it’s separate from the schema migration. See lib/tasks/data/.)
  • Code in the Version model and import_versions_job to handle it.
  • Updates to analyze_change_job to take advantage of it.
@Mr0grog
Copy link
Member Author

Mr0grog commented Nov 11, 2020

Did some more thinking on this in #776. I’m now feeling like this should be much narrower:

  • Keep media_type as the main type, but don’t constrain it to whatever was in the header — it should be a normalized, potentially sniffed value (not going to add code to do sniffing or normalizing as part of this, but want to set up the expectation that this is the effective media type, not whatever was in the header). If people really need the data from the header, they can go grab it from source_metadata or from a new headers field. It’s not a common enough need to provide special support for it. What we really want here is the “effective” media type, without parameters.

    The main idea here is that if we’re going to bother with denormalizing something, we should always make it the most useful, cleaned up form.

  • Drop media_type_parameters. It’s not actually doing anything useful and is complex to get right.

  • Probably DO NOT add charset. It’s not as important as the media type and can be stored as metadata in S3, so it just comes as a header when requesting the body. If it was specified in the original headers, it’s still accessible through source_metadata or a new headers field.

  • DO NOT add media. We’re just going to change media_type to be what I originally suggested media should be (see first point in this comment).

Mr0grog added a commit that referenced this issue Nov 13, 2020
This removes the `Version#media_type_parameters` field (it wasn’t useful) and changes the `Versions#media_type` field to a cleaned-up, normalized, canonicalized field (instead of just reflecting whatever the HTTP response’s `Content-Type` header had). This makes it dramatically more useful without removing canonical information stored elsewhere.

Fixes #752.
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-processing that referenced this issue Nov 13, 2020
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-processing that referenced this issue Nov 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Changes to the public API enhancement never-stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant