Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new_audit: add charset declaration audit #10284

Merged
merged 23 commits into from
Feb 19, 2020
Merged

new_audit: add charset declaration audit #10284

merged 23 commits into from
Feb 19, 2020

Conversation

Beytoven
Copy link
Contributor

We want to add an audit that checks whether or not the character encoding for a page is properly declared. The audit will pass if any of the following are true:

  1. Charset is declared in a meta tag on the page.
  2. Charset is declared as part of the content-type metadata.
  3. The page has a Byte Order Mark (BOM) set.

image

image

Addresses #10023

@Beytoven Beytoven requested a review from a team as a code owner January 30, 2020 22:19
@Beytoven Beytoven requested review from paulirish and removed request for a team January 30, 2020 22:19
Copy link
Member

@paulirish paulirish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nicely done. on the overall, the audit looks great. just a few nits and ideas.

const Audit = require('../audit.js');
const i18n = require('../../lib/i18n/i18n.js');
const MainResource = require('../../computed/main-resource.js');
const CONTENT_TYPE_HEADER = 'content-type';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yesterday we moved these down to above the class dfn. so i think you have a few more commits to push

@@ -0,0 +1,84 @@
/**
* @license Copyright 2016 Google Inc. All Rights Reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. its a lighthouse reviewer's favorite thing to call out. 🤕

/** Title of a Lighthouse audit that provides detail on if the charset is set properly for a page. This title is shown when the charset is defined correctly. */
title: 'Properly defines charset',
/** Title of a Lighthouse audit that provides detail on if the charset is set properly for a page. This title is shown when the charset meta tag is missing or defined too late in the page. */
failureTitle: 'Charset element is missing or occurs too late on the page',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
failureTitle: 'Charset element is missing or occurs too late on the page',
failureTitle: 'Charset declaration is missing or occurs too late in the HTML',

/** Title of a Lighthouse audit that provides detail on if the charset is set properly for a page. This title is shown when the charset meta tag is missing or defined too late in the page. */
failureTitle: 'Charset element is missing or occurs too late on the page',
/** Description of a Lighthouse audit that tells the user why the charset needs to be defined early on. */
description: 'A character encoding declaration is required whether it is done explicitly ' +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slight rewording. let's remove mention of the BOM here, as I dont think we actually want to recommend it. the linked resource takes care of mentioning it anyway.

A character encoding declaration is required. It can be done with a <meta> tag in the first 1024 bytes of the HTML or in the Content-Type HTTP response header.

*/
static audit(artifacts, context) {
const devtoolsLog = artifacts.devtoolsLogs[Audit.DEFAULT_PASS];
return MainResource.request({devtoolsLog, URL: artifacts.URL}, context)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these days we'd write this with async await instead. a little nicer since you can drop the indentation below. i'd recommend it

// Check the http header 'content-type' to see if charset is defined there
if (mainResource.responseHeaders) {
const contentTypeHeader = mainResource.responseHeaders
.find(header => header.name.toLowerCase() === CONTENT_TYPE_HEADER);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing too fancy about this const so i'd inline it here.


describe('Charset defined audit', () => {
it('succeeds when the page contains the charset meta tag', () => {
const finalUrl = 'https://example.com/';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we try to strike a weird balance between DRY and WET in our tests.

in this case i think this test would benefit from a getArtifacts() helper method that all of these cases can use. that way it'd be a much easier to see at at glance how each of the artifacts differ.

if you search for artifacts( in the lh-core/test/ folder you'll find a few diff tests that use this sort of pattern.

const MainResource = require('../../computed/main-resource.js');
const CONTENT_TYPE_HEADER = 'content-type';
const CHARSET_META_REGEX = /<meta.*charset="?.{1,}"?.*>/gm;
const CHARSET_HTTP_REGEX = /charset=.{1,}/gm;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mentioned you could add these consts to the module.exports, and that way you can write some unit tests against them.

iirc, you wrote these regexs to handle some extra fancy cases that the current unit tests dont cover. like charset= (empty string value). so i think it'd be worth having a unit test for each regex just to give it a handful of cases it should match and not match.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that way you can write some unit tests against them.

let me know if you have any questions on this. i'm thinking 1 test with like 10-ish assertions using various html variants.
regexes are hilarious so its good to test out all sorts of edge cases.

@@ -0,0 +1,185 @@
/**
* @license Copyright 2016 Google Inc. All Rights Reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @license Copyright 2016 Google Inc. All Rights Reserved.
* @license Copyright 2020 Google Inc. All Rights Reserved.

lighthouse-core/audits/dobetterweb/charset.js Outdated Show resolved Hide resolved
lighthouse-core/audits/dobetterweb/charset.js Outdated Show resolved Hide resolved
const MainResource = require('../../computed/main-resource.js');
const CONTENT_TYPE_HEADER = 'content-type';
const CHARSET_META_REGEX = /<meta.*charset="?.{1,}"?.*>/gm;
const CHARSET_HTTP_REGEX = /charset=.{1,}/gm;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that way you can write some unit tests against them.

let me know if you have any questions on this. i'm thinking 1 test with like 10-ish assertions using various html variants.
regexes are hilarious so its good to test out all sorts of edge cases.


describe('Charset defined audit', () => {
it('succeeds when the page contains the charset meta tag', () => {
const htmlContent = '<meta charset="utf-8" />';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make these htmlContent stubs a little more realistic i'd recommend a doctype and some content after the meta... like

<!doctype html><meta charset="utf-8" /><h1>hello

const htmlContent = '<meta charset="utf-8" />';
const artifacts = generateArtifacts(htmlContent);
const context = {computedCache: new Map()};
return CharsetDefinedAudit.audit(artifacts, context).then(auditResult => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah we can also go async/await here in the tests, too..

async () => { at the top and then down here something like

    const auditResult = await CharsetDefinedAudit.audit(artifacts, context);
    assert.equal(auditResult.score, 1);

static async audit(artifacts, context) {
const devtoolsLog = artifacts.devtoolsLogs[Audit.DEFAULT_PASS];
const mainResource = await MainResource.request({devtoolsLog, URL: artifacts.URL}, context);
let charsetIsSet = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. typical convention is to start boolean vars with is/ has / etc. so i'd just rename to isCharsetSet or isCharsetDefined

lighthouse-core/audits/dobetterweb/charset.js Show resolved Hide resolved
it('succeeds when the page contains the charset meta tag', () => {
const htmlContent = '<meta charset="utf-8" />';
const artifacts = generateArtifacts(htmlContent);
const context = {computedCache: new Map()};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a huge deal but you could also move this computedCache thing into generateArtifacts and then you'd have something like...

const {artifacts, context} = generateArtifacts(htmlContent);

.find(header => header.name.toLowerCase() === CONTENT_TYPE_HEADER);

if (contentTypeHeader) {
isCharsetSet = contentTypeHeader.value.match(CHARSET_HTTP_REGEX) !== null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super js nit but i prefer regex.test(str) vs str.match(regex) when you just need the boolean result

lighthouse-core/audits/dobetterweb/charset.js Outdated Show resolved Hide resolved
lighthouse-core/audits/dobetterweb/charset.js Show resolved Hide resolved
assert.equal(HTTP_REGEX.test('text/html; charset= '), false);
});

it('handles charset name validation correctly', () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supa hot 🔥

Copy link
Member

@paulirish paulirish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last remaining nits. lgtm!

* @fileoverview Audits a page to ensure charset it configured properly.
* It must be defined within the first 1024 bytes of the HTML document, defined in the HTTP header, or the document source starts with a BOM.
*
* TODO: It doesn't yet validate the encoding is a valid IANA charset name. https://www.iana.org/assignments/character-sets/character-sets.xhtml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does now. at least for the html5 meta case. :)

you can drop this line but move the link down to L35

isCharsetSet = isCharsetSet || artifacts.MainDocumentContent.charCodeAt(0) === BOM_FIRSTCHAR;

// Check if charset-ish meta tag is defined within the first 1024 characters(~1024 bytes) of the HTML document
if (artifacts.MainDocumentContent.slice(0, 1024).match(CHARSET_HTML_REGEX) !== null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return (meta.charset && meta.charset.match(IANA_REGEX)) ||
(meta.httpEquiv === 'content-type' &&
meta.content &&
meta.content.match(CHARSET_HTTP_REGEX));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulirish paulirish changed the title core: add charset declaration audit new_audit: add charset declaration audit Feb 19, 2020
@googlebot
Copy link

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@paulirish paulirish merged commit ce529a1 into master Feb 19, 2020
@paulirish paulirish deleted the charset_header branch February 19, 2020 21:50
@paulirish
Copy link
Member

🎉 🎉 🎉

@hsivonen
Copy link

Thank you!

mathiasbynens added a commit to mathiasbynens/lighthouse that referenced this pull request Feb 27, 2020
While it would be overkill to implement full-blown HTML/HTTP parsers, by
making the regular expressions case-insensitive can can reduce the amount
of false negatives for the charset audit.

This patch also applies some drive-by nits/simplifications.

Ref. GoogleChrome#10023, GoogleChrome#10284.
mathiasbynens added a commit to mathiasbynens/lighthouse that referenced this pull request Feb 27, 2020
While it would be overkill to implement full-blown HTML/HTTP parsers, by simply making the regular expressions case-insensitive we can reduce the amount of false negatives for the charset audit.

This patch also applies some drive-by nits/simplifications.

Ref. GoogleChrome#10023, GoogleChrome#10284.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants