-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor and normalize media type information #752
Comments
Did some more thinking on this in #776. I’m now feeling like this should be much narrower:
|
This removes the `Version#media_type_parameters` field (it wasn’t useful) and changes the `Versions#media_type` field to a cleaned-up, normalized, canonicalized field (instead of just reflecting whatever the HTTP response’s `Content-Type` header had). This makes it dramatically more useful without removing canonical information stored elsewhere. Fixes #752.
We removed this functionality from the database in edgi-govdata-archiving/web-monitoring-db#752.
We removed this functionality from the database in edgi-govdata-archiving/web-monitoring-db#752.
Right now, we track media type information on versions with two fields:
media_type
: The main type, e.g.text/html
orapplication/pdf
media_type_parameters
: Any additional parameters for the type, e.g.charset=utf-8; some-other=param
I didn’t feel like this was ideal when I set it up, but was having trouble thinking of what would be better. In hindsight, I think we should ideally refactor this to:
media
: A normalized version of the main type. This is similar to the currentmedia_type
, BUT media types that are known to mean the same thing are normalized to a single name.For example,
application/html
,application/xhtml
, and several others are all equivalent totext/html
, the canonical type for HTML. What most applications really want to know in this case is “was it HTML?” and we should have a field that makes answering that easy. We have too much code all over the place that lists all the known “HTML” types and has to check against all of them. (There are other duplicate media types for non-HTML things; I’m less worried about handling those in the first attempt at this, although this should make those possible, too.)charset
(orencoding
?): The character encoding of the response body. This information is by far and away the most commonly used parameter, and has special importance, making it worth separating out. It also applies to a huge variety of media types, so it’s more cross-cutting than almost any other parameter.media_type
: The full media type string, e.g.text/html; charset=utf-8; some-other=param
. This is the true, canonical information that needs no interpretation or normalization (aside from the case normalization inVersion.media_type
should always be lower case #689).We might need to give this a different name so it doesn’t conflict with the existing
media_type
field. Maybemedia_type_full
orcontent_type
?This needs:
lib/tasks/data/
.)Version
model andimport_versions_job
to handle it.analyze_change_job
to take advantage of it.The text was updated successfully, but these errors were encountered: