new_audit: add charset declaration audit #10284

Beytoven · 2020-01-30T22:19:05Z

We want to add an audit that checks whether or not the character encoding for a page is properly declared. The audit will pass if any of the following are true:

Charset is declared in a meta tag on the page.
Charset is declared as part of the content-type metadata.
The page has a Byte Order Mark (BOM) set.

Addresses #10023

paulirish

nicely done. on the overall, the audit looks great. just a few nits and ideas.

paulirish · 2020-01-31T21:10:12Z

lighthouse-core/audits/dobetterweb/charset.js

+const Audit = require('../audit.js');
+const i18n = require('../../lib/i18n/i18n.js');
+const MainResource = require('../../computed/main-resource.js');
+const CONTENT_TYPE_HEADER = 'content-type';


yesterday we moved these down to above the class dfn. so i think you have a few more commits to push

paulirish · 2020-01-31T21:11:02Z

lighthouse-core/audits/dobetterweb/charset.js

@@ -0,0 +1,84 @@
+/**
+ * @license Copyright 2016 Google Inc. All Rights Reserved.


its a lighthouse reviewer's favorite thing to call out. 🤕

paulirish · 2020-01-31T21:13:10Z

lighthouse-core/audits/dobetterweb/charset.js

+  /** Title of a Lighthouse audit that provides detail on if the charset is set properly for a page. This title is shown when the charset is defined correctly. */
+  title: 'Properly defines charset',
+  /** Title of a Lighthouse audit that provides detail on if the charset is set properly for a page. This title is shown when the charset meta tag is missing or defined too late in the page. */
+  failureTitle: 'Charset element is missing or occurs too late on the page',


Suggested change

failureTitle: 'Charset element is missing or occurs too late on the page',

failureTitle: 'Charset declaration is missing or occurs too late in the HTML',

paulirish · 2020-01-31T21:16:48Z

lighthouse-core/audits/dobetterweb/charset.js

+  /** Title of a Lighthouse audit that provides detail on if the charset is set properly for a page. This title is shown when the charset meta tag is missing or defined too late in the page. */
+  failureTitle: 'Charset element is missing or occurs too late on the page',
+  /** Description of a Lighthouse audit that tells the user why the charset needs to be defined early on. */
+  description: 'A character encoding declaration is required whether it is done explicitly ' +


slight rewording. let's remove mention of the BOM here, as I dont think we actually want to recommend it. the linked resource takes care of mentioning it anyway.

A character encoding declaration is required. It can be done with a <meta> tag in the first 1024 bytes of the HTML or in the Content-Type HTTP response header.

paulirish · 2020-01-31T21:20:10Z

lighthouse-core/audits/dobetterweb/charset.js

+   */
+  static audit(artifacts, context) {
+    const devtoolsLog = artifacts.devtoolsLogs[Audit.DEFAULT_PASS];
+    return MainResource.request({devtoolsLog, URL: artifacts.URL}, context)


these days we'd write this with async await instead. a little nicer since you can drop the indentation below. i'd recommend it

paulirish · 2020-01-31T21:21:29Z

lighthouse-core/audits/dobetterweb/charset.js

+      // Check the http header 'content-type' to see if charset is defined there
+      if (mainResource.responseHeaders) {
+        const contentTypeHeader = mainResource.responseHeaders
+          .find(header => header.name.toLowerCase() === CONTENT_TYPE_HEADER);


nothing too fancy about this const so i'd inline it here.

paulirish · 2020-01-31T21:25:57Z

lighthouse-core/test/audits/dobetterweb/charset-test.js

+
+describe('Charset defined audit', () => {
+  it('succeeds when the page contains the charset meta tag', () => {
+    const finalUrl = 'https://example.com/';


we try to strike a weird balance between DRY and WET in our tests.

in this case i think this test would benefit from a getArtifacts() helper method that all of these cases can use. that way it'd be a much easier to see at at glance how each of the artifacts differ.

if you search for artifacts( in the lh-core/test/ folder you'll find a few diff tests that use this sort of pattern.

paulirish · 2020-01-31T21:30:14Z

lighthouse-core/audits/dobetterweb/charset.js

+const MainResource = require('../../computed/main-resource.js');
+const CONTENT_TYPE_HEADER = 'content-type';
+const CHARSET_META_REGEX = /<meta.*charset="?.{1,}"?.*>/gm;
+const CHARSET_HTTP_REGEX = /charset=.{1,}/gm;


i mentioned you could add these consts to the module.exports, and that way you can write some unit tests against them.

iirc, you wrote these regexs to handle some extra fancy cases that the current unit tests dont cover. like charset= (empty string value). so i think it'd be worth having a unit test for each regex just to give it a handful of cases it should match and not match.

that way you can write some unit tests against them.

let me know if you have any questions on this. i'm thinking 1 test with like 10-ish assertions using various html variants.
regexes are hilarious so its good to test out all sorts of edge cases.

paulirish · 2020-01-31T21:30:35Z

lighthouse-core/test/audits/dobetterweb/charset-test.js

@@ -0,0 +1,185 @@
+/**
+ * @license Copyright 2016 Google Inc. All Rights Reserved.


Suggested change

* @license Copyright 2016 Google Inc. All Rights Reserved.

* @license Copyright 2020 Google Inc. All Rights Reserved.

lighthouse-core/audits/dobetterweb/charset.js

paulirish · 2020-02-04T20:31:36Z

lighthouse-core/audits/dobetterweb/charset.js

+const MainResource = require('../../computed/main-resource.js');
+const CONTENT_TYPE_HEADER = 'content-type';
+const CHARSET_META_REGEX = /<meta.*charset="?.{1,}"?.*>/gm;
+const CHARSET_HTTP_REGEX = /charset=.{1,}/gm;


that way you can write some unit tests against them.

let me know if you have any questions on this. i'm thinking 1 test with like 10-ish assertions using various html variants.
regexes are hilarious so its good to test out all sorts of edge cases.

paulirish · 2020-02-04T20:34:29Z

lighthouse-core/test/audits/dobetterweb/charset-test.js

+
+describe('Charset defined audit', () => {
+  it('succeeds when the page contains the charset meta tag', () => {
+    const htmlContent = '<meta charset="utf-8" />';


to make these htmlContent stubs a little more realistic i'd recommend a doctype and some content after the meta... like

<!doctype html><meta charset="utf-8" /><h1>hello

paulirish · 2020-02-04T20:36:31Z

lighthouse-core/test/audits/dobetterweb/charset-test.js

+    const htmlContent = '<meta charset="utf-8" />';
+    const artifacts = generateArtifacts(htmlContent);
+    const context = {computedCache: new Map()};
+    return CharsetDefinedAudit.audit(artifacts, context).then(auditResult => {


ah we can also go async/await here in the tests, too..

async () => { at the top and then down here something like

const auditResult = await CharsetDefinedAudit.audit(artifacts, context); assert.equal(auditResult.score, 1);

paulirish · 2020-02-04T20:38:04Z

lighthouse-core/audits/dobetterweb/charset.js

+  static async audit(artifacts, context) {
+    const devtoolsLog = artifacts.devtoolsLogs[Audit.DEFAULT_PASS];
+    const mainResource = await MainResource.request({devtoolsLog, URL: artifacts.URL}, context);
+    let charsetIsSet = false;


nit. typical convention is to start boolean vars with is/ has / etc. so i'd just rename to isCharsetSet or isCharsetDefined

lighthouse-core/audits/dobetterweb/charset.js

paulirish · 2020-02-04T20:42:39Z

lighthouse-core/test/audits/dobetterweb/charset-test.js

+  it('succeeds when the page contains the charset meta tag', () => {
+    const htmlContent = '<meta charset="utf-8" />';
+    const artifacts = generateArtifacts(htmlContent);
+    const context = {computedCache: new Map()};


not a huge deal but you could also move this computedCache thing into generateArtifacts and then you'd have something like...

const {artifacts, context} = generateArtifacts(htmlContent);

paulirish · 2020-02-15T00:17:46Z

lighthouse-core/audits/dobetterweb/charset.js

+        .find(header => header.name.toLowerCase() === CONTENT_TYPE_HEADER);
+
+      if (contentTypeHeader) {
+        isCharsetSet = contentTypeHeader.value.match(CHARSET_HTTP_REGEX) !== null;


super js nit but i prefer regex.test(str) vs str.match(regex) when you just need the boolean result

lighthouse-core/audits/dobetterweb/charset.js

paulirish · 2020-02-15T00:21:49Z

lighthouse-core/test/audits/dobetterweb/charset-test.js

+    assert.equal(HTTP_REGEX.test('text/html; charset=  '), false);
+  });
+
+  it('handles charset name validation correctly', () => {


supa hot 🔥

paulirish

last remaining nits. lgtm!

paulirish · 2020-02-19T20:23:09Z

lighthouse-core/audits/dobetterweb/charset.js

+ * @fileoverview Audits a page to ensure charset it configured properly.
+ * It must be defined within the first 1024 bytes of the HTML document, defined in the HTTP header, or the document source starts with a BOM.
+ *
+ * TODO: It doesn't yet validate the encoding is a valid IANA charset name. https://www.iana.org/assignments/character-sets/character-sets.xhtml


it does now. at least for the html5 meta case. :)

you can drop this line but move the link down to L35

paulirish · 2020-02-19T20:25:00Z

lighthouse-core/audits/dobetterweb/charset.js

+    isCharsetSet = isCharsetSet || artifacts.MainDocumentContent.charCodeAt(0) === BOM_FIRSTCHAR;
+
+    // Check if charset-ish meta tag is defined within the first 1024 characters(~1024 bytes) of the HTML document
+    if (artifacts.MainDocumentContent.slice(0, 1024).match(CHARSET_HTML_REGEX) !== null) {


same nit about .test vs .match

paulirish · 2020-02-19T20:25:04Z

lighthouse-core/audits/dobetterweb/charset.js

+        return (meta.charset && meta.charset.match(IANA_REGEX)) ||
+          (meta.httpEquiv === 'content-type' &&
+          meta.content &&
+          meta.content.match(CHARSET_HTTP_REGEX));


same nit about .test vs .match

googlebot · 2020-02-19T20:30:56Z

A Googler has manually verified that the CLAs look good.

(Googler, please make sure the reason for overriding the CLA status is clearly documented in these comments.)

ℹ️ Googlers: Go here for more info.

googlebot · 2020-02-19T20:47:22Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

Co-Authored-By: Paul Irish <paulirish@google.com>

googlebot · 2020-02-19T21:22:30Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

paulirish · 2020-02-19T21:50:41Z

🎉 🎉 🎉

hsivonen · 2020-02-26T07:07:57Z

Thank you!

While it would be overkill to implement full-blown HTML/HTTP parsers, by making the regular expressions case-insensitive can can reduce the amount of false negatives for the charset audit. This patch also applies some drive-by nits/simplifications. Ref. GoogleChrome#10023, GoogleChrome#10284.

While it would be overkill to implement full-blown HTML/HTTP parsers, by simply making the regular expressions case-insensitive we can reduce the amount of false negatives for the charset audit. This patch also applies some drive-by nits/simplifications. Ref. GoogleChrome#10023, GoogleChrome#10284.

Beytoven added 5 commits January 27, 2020 17:39

WIP: Adding charset audit

a4f7f95

Finishing where I left off. Added the 3 charset checks

c7e0bce

Adding unit tests. So far not so good

f70bbae

Finishing unit tests

7138424

Update regex to fail when charset set to empty

40cbf99

Beytoven requested a review from a team as a code owner January 30, 2020 22:19

Beytoven requested review from paulirish and removed request for a team January 30, 2020 22:19

googlebot added the cla: yes label Jan 30, 2020

Add a couple more tests

f7b7f78

vercel bot deployed to Preview January 30, 2020 22:21 View deployment

Update description to include 'Learn More' link

040d371

vercel bot deployed to Preview January 30, 2020 22:27 View deployment

Fix lint errors

7d0aa10

vercel bot deployed to Preview January 30, 2020 23:08 View deployment

paulirish requested changes Jan 31, 2020

View reviewed changes

Address some of PR comments

593a9ca

vercel bot deployed to Preview January 31, 2020 23:17 View deployment

Update sample.json

c203014

vercel bot deployed to Preview January 31, 2020 23:19 View deployment

Refactor unit tests

b3373a2

vercel bot deployed to Preview February 1, 2020 00:35 View deployment

Fix build errors

e8d6a59

vercel bot deployed to Preview February 4, 2020 19:38 View deployment

devtools-bot added the needs-priority label Feb 4, 2020

paulirish requested changes Feb 4, 2020

View reviewed changes

Add regex tests

0819a00

vercel bot deployed to Preview February 4, 2020 21:46 View deployment

Fix typo

fd8d954

vercel bot deployed to Preview February 5, 2020 19:14 View deployment

googlebot added the cla: no label Feb 14, 2020

paulirish reviewed Feb 15, 2020

View reviewed changes

vercel bot deployed to Preview February 18, 2020 19:17 View deployment

vercel bot deployed to Preview February 18, 2020 20:27 View deployment

paulirish approved these changes Feb 19, 2020

View reviewed changes

paulirish added the new_audit label Feb 19, 2020

paulirish changed the title ~~core: add charset declaration audit~~ new_audit: add charset declaration audit Feb 19, 2020

paulirish added cla: yes and removed cla: no labels Feb 19, 2020

vercel bot deployed to Preview February 19, 2020 20:47 View deployment

googlebot added cla: no and removed cla: yes labels Feb 19, 2020

Beytoven and others added 4 commits February 19, 2020 13:18

Fix dbw smoke test

2d45a80

Apply suggestions from code review

154aa10

Co-Authored-By: Paul Irish <paulirish@google.com>

Add charset and httpEquiv attributes to MetaElements

4e85a84

Address nits

925c202

Beytoven force-pushed the charset_header branch from b72030a to 925c202 Compare February 19, 2020 21:22

vercel bot deployed to Preview February 19, 2020 21:22 View deployment

googlebot added cla: yes and removed cla: no labels Feb 19, 2020

paulirish merged commit ce529a1 into master Feb 19, 2020

paulirish deleted the charset_header branch February 19, 2020 21:50

mathiasbynens mentioned this pull request Feb 27, 2020

core(charset audit): loosen CHARSET_HTML_REGEX and CHARSET_HTTP_REGEX #10389

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new_audit: add charset declaration audit #10284

new_audit: add charset declaration audit #10284

Beytoven commented Jan 30, 2020

paulirish left a comment

paulirish Jan 31, 2020

paulirish Jan 31, 2020

paulirish Jan 31, 2020

paulirish Jan 31, 2020

paulirish Jan 31, 2020

paulirish Jan 31, 2020

paulirish Jan 31, 2020

paulirish Jan 31, 2020

paulirish Feb 4, 2020

paulirish Jan 31, 2020

paulirish Feb 4, 2020

paulirish Feb 4, 2020

paulirish Feb 4, 2020

paulirish Feb 4, 2020

paulirish Feb 4, 2020

paulirish Feb 15, 2020

paulirish Feb 15, 2020

paulirish left a comment

paulirish Feb 19, 2020

paulirish Feb 19, 2020

paulirish Feb 19, 2020

googlebot commented Feb 19, 2020

googlebot commented Feb 19, 2020

googlebot commented Feb 19, 2020

paulirish commented Feb 19, 2020

hsivonen commented Feb 26, 2020

		@@ -0,0 +1,84 @@
		/**
		* @license Copyright 2016 Google Inc. All Rights Reserved.

	failureTitle: 'Charset element is missing or occurs too late on the page',
	failureTitle: 'Charset declaration is missing or occurs too late in the HTML',

		@@ -0,0 +1,185 @@
		/**
		* @license Copyright 2016 Google Inc. All Rights Reserved.

	* @license Copyright 2016 Google Inc. All Rights Reserved.
	* @license Copyright 2020 Google Inc. All Rights Reserved.

new_audit: add charset declaration audit #10284

new_audit: add charset declaration audit #10284

Conversation

Beytoven commented Jan 30, 2020

paulirish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulirish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

googlebot commented Feb 19, 2020

googlebot commented Feb 19, 2020

googlebot commented Feb 19, 2020

paulirish commented Feb 19, 2020

hsivonen commented Feb 26, 2020