Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review Inferred vs Assumed charsets #5757

Closed
gregw opened this issue Dec 6, 2020 · 1 comment · Fixed by #5807
Closed

Review Inferred vs Assumed charsets #5757

gregw opened this issue Dec 6, 2020 · 1 comment · Fixed by #5807
Assignees

Comments

@gregw
Copy link
Contributor

gregw commented Dec 6, 2020

Jetty version

= 9.4

Description

There is something a bit strange, at least in our language about inferred vs assumed character encodings.

In MimeType we say:

    /**
     * Access a mutable map of mime type to the charset inferred from that content type.
     * An inferred encoding is used by when encoding/decoding a stream and is
     * explicitly set in any metadata (eg Content-Type).
     *
     * @return Map of mime type to charset
     */
    public static Map<String, String> getInferredEncodings()
    {
        return __inferredEncodings;
    }

    /**
     * Access a mutable map of mime type to the charset assumed for that content type.
     * An assumed encoding is used by when encoding/decoding a stream, but is not
     * explicitly set in any metadata (eg Content-Type).
     *
     * @return Map of mime type to charset
     */
    public static Map<String, String> getAssumedEncodings()
    {
        return __assumedEncodings;
    }

Note both of these are different to a charset explicitly set in a content type (eg "text/plain;charset=UTF16").
An assumed charset is for "text/json", ie there is a charset there, but we don't need to explicitly set it in the content type.
An inferred charset are more like defaults for particular mime types. In encoding.properties we have:

# Mapping of mime type to inferred or assumed charset
# inferred charsets are used for encoding/decoding and explicitly set in Content-Type
# assumed charsets are used for encoding/decoding, but are not set in Content-Type
# In this file, assumed charsets are indicated with a leading '-'

text/html=utf-8
text/plain=iso-8859-1
text/xml=utf-8
application/xhtml+xml=utf-8
text/json=-utf-8
application/json=-utf-8
application/vnd.api+json=-utf-8

But in Response we have:

    private enum EncodingFrom
    {
        NOT_SET, INFERRED, SET_LOCALE, SET_CONTENT_TYPE, SET_CHARACTER_ENCODING
    }

Which suggests that INFERRED is different from SET_CONTENT_TYPE. So does a call to setContentType("text/json") result in INFERRED or SET_CONTENT_TYPE ? The charset is assumed in this case, so at the very least I think we have mixed up our language.

When we get the writer we do:

            //first try explicit char encoding
            String encoding = _characterEncoding;

            //try char set from mime type
            if (encoding == null)
            {
                if (_mimeType != null && _mimeType.isCharsetAssumed())
                    encoding = _mimeType.getCharsetString();
            }

            //try char set assumed from content type
            if (encoding == null)
            {
                encoding = MimeTypes.getCharsetAssumedFromContentType(_contentType);
            }
            
            //try char set inferred from content type
            if (encoding == null)
            {
                encoding = MimeTypes.getCharsetInferredFromContentType(_contentType);
                setCharacterEncoding(encoding, EncodingFrom.INFERRED);
            }
            
            //try any default char encoding for the context
            if (encoding == null)
            {
                Context context = _channel.getRequest().getContext();
                if (context != null)
                    encoding = context.getResponseCharacterEncoding();
            }
            
            //fallback to last resort iso-8859-1
            if (encoding == null)
            {
                encoding = StringUtil.__ISO_8859_1;
                setCharacterEncoding(encoding, EncodingFrom.INFERRED);
            }

So I think we are probably doing the right thing.... but perhaps our language is a bit mangled??? So some fresh eyes looking over this would be useful to at least improve the javadoc.

@gregw
Copy link
Contributor Author

gregw commented Dec 6, 2020

See also jakartaee/servlet#377

lachlan-roberts added a commit that referenced this issue Dec 9, 2020
Signed-off-by: Lachlan Roberts <lachlan@webtide.com>
lachlan-roberts added a commit that referenced this issue Dec 9, 2020
Signed-off-by: Lachlan Roberts <lachlan@webtide.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants