Refactor and normalize media type information #752

Mr0grog · 2020-09-08T19:15:25Z

Right now, we track media type information on versions with two fields:

media_type: The main type, e.g. text/html or application/pdf
media_type_parameters: Any additional parameters for the type, e.g. charset=utf-8; some-other=param

I didn’t feel like this was ideal when I set it up, but was having trouble thinking of what would be better. In hindsight, I think we should ideally refactor this to:

media: A normalized version of the main type. This is similar to the current media_type, BUT media types that are known to mean the same thing are normalized to a single name.

For example, application/html, application/xhtml, and several others are all equivalent to text/html, the canonical type for HTML. What most applications really want to know in this case is “was it HTML?” and we should have a field that makes answering that easy. We have too much code all over the place that lists all the known “HTML” types and has to check against all of them. (There are other duplicate media types for non-HTML things; I’m less worried about handling those in the first attempt at this, although this should make those possible, too.)
charset (or encoding?): The character encoding of the response body. This information is by far and away the most commonly used parameter, and has special importance, making it worth separating out. It also applies to a huge variety of media types, so it’s more cross-cutting than almost any other parameter.
media_type: The full media type string, e.g. text/html; charset=utf-8; some-other=param. This is the true, canonical information that needs no interpretation or normalization (aside from the case normalization in Version.media_type should always be lower case #689).

We might need to give this a different name so it doesn’t conflict with the existing media_type field. Maybe media_type_full or content_type?

This needs:

A schema migration to create the new fields.
A schema migration (later, in a separate PR) to remove the old fields.
A data migration to fill in the new fields. (This will be a long, slow job, which is why it’s separate from the schema migration. See lib/tasks/data/.)
Code in the Version model and import_versions_job to handle it.
Updates to analyze_change_job to take advantage of it.

The text was updated successfully, but these errors were encountered:

Mr0grog · 2020-11-11T19:40:26Z

Did some more thinking on this in #776. I’m now feeling like this should be much narrower:

Keep media_type as the main type, but don’t constrain it to whatever was in the header — it should be a normalized, potentially sniffed value (not going to add code to do sniffing or normalizing as part of this, but want to set up the expectation that this is the effective media type, not whatever was in the header). If people really need the data from the header, they can go grab it from source_metadata or from a new headers field. It’s not a common enough need to provide special support for it. What we really want here is the “effective” media type, without parameters.

The main idea here is that if we’re going to bother with denormalizing something, we should always make it the most useful, cleaned up form.
Drop media_type_parameters. It’s not actually doing anything useful and is complex to get right.
Probably DO NOT add charset. It’s not as important as the media type and can be stored as metadata in S3, so it just comes as a header when requesting the body. If it was specified in the original headers, it’s still accessible through source_metadata or a new headers field.
DO NOT add media. We’re just going to change media_type to be what I originally suggested media should be (see first point in this comment).

This removes the `Version#media_type_parameters` field (it wasn’t useful) and changes the `Versions#media_type` field to a cleaned-up, normalized, canonicalized field (instead of just reflecting whatever the HTTP response’s `Content-Type` header had). This makes it dramatically more useful without removing canonical information stored elsewhere. Fixes #752.

We removed this functionality from the database in edgi-govdata-archiving/web-monitoring-db#752.

Mr0grog added API Changes to the public API never-stale enhancement labels Sep 8, 2020

This was referenced Nov 11, 2020

Refactor Version Model #776

Closed

Roadmap for the rest of 2020 edgi-govdata-archiving/web-monitoring#158

Closed

Mr0grog mentioned this issue Nov 12, 2020

Remove media_type_parameters #778

Merged

Mr0grog closed this as completed in #778 Nov 13, 2020

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-processing that referenced this issue Nov 13, 2020

Drop references to removed media_type_parameters

3ff4615

We removed this functionality from the database in edgi-govdata-archiving/web-monitoring-db#752.

Mr0grog mentioned this issue Nov 13, 2020

Drop references to removed media_type_parameters edgi-govdata-archiving/web-monitoring-processing#662

Merged

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-processing that referenced this issue Nov 13, 2020

Drop references to removed media_type_parameters (#662)

06e3e51

We removed this functionality from the database in edgi-govdata-archiving/web-monitoring-db#752.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor and normalize media type information #752

Refactor and normalize media type information #752

Mr0grog commented Sep 8, 2020

Mr0grog commented Nov 11, 2020 •

edited

Loading

Refactor and normalize media type information #752

Refactor and normalize media type information #752

Comments

Mr0grog commented Sep 8, 2020

Mr0grog commented Nov 11, 2020 • edited Loading

Mr0grog commented Nov 11, 2020 •

edited

Loading