Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ability to use sanitizers to identify required AMP components from output buffer #875

Closed
westonruter opened this issue Jan 17, 2018 · 14 comments
Assignees
Milestone

Comments

@westonruter
Copy link
Member

Theme support currently has the following logic that us run on the buffered output:

https://github.com/Automattic/amp-wp/blob/e60cb152a50e2ffbf5504d41aee859ad1d0e0baa/includes/class-amp-theme-support.php#L448-L455

This is not ideal. Normally when the sanitizer run, they identify the AMP component scripts that are required once they detect that a given AMP component is present in the page. When the sanitizer is not involved in sanitizing however, then the plugin currently requires manual inclusion of the required AMP components (cf. #874):

https://github.com/Automattic/amp-wp/blob/e60cb152a50e2ffbf5504d41aee859ad1d0e0baa/includes/amp-post-template-actions.php#L120-L131

This isn't good. Merely the presence of an <amp-analytics> element in the response should automatically result in the amp-analytics AMP component script being included. The same should go for when a form is included in a template, which currently results in issues like #802.

So one way or another, we need to identify which elements are used in the page and then add the required AMP components. With #857 we now have the output buffering so we could just take the entire output and feed it into the sanitizers once, instead of doing so once for each instance of the_content. It would certainly be simpler to do this, but my main concern is what would be more performant.

If we didn't sanitize the entire buffered output, then we'd need another mechanism to look through the raw unparsed HTML to identify whether a given element is present which needs an AMP component. Searching unparsed HTML is going to be less reliable than looking at the DOM.

I've written more about this in https://github.com/Automattic/amp-wp/pull/871/files#r162198204

@westonruter westonruter added this to the v0.7 milestone Jan 17, 2018
@westonruter
Copy link
Member Author

I made a little script to test out what specifically the performance impacts are: https://gist.github.com/westonruter/3a51a10013cd58367689201b69eb8fcd#file-test-sanitizer-performance-php

$ wp eval-file test-sanitizer-performance.php 30 together
Article size: 115 KB
Article count: 30
HTML size: 3 MB
Sanitize mode: together
Time elapsed: 16.694 seconds
Peak memory usage: 41,295 KB
$ wp eval-file test-sanitizer-performance.php 30 separate
Article size: 115 KB
Article count: 30
HTML size: 3 MB
Sanitize mode: separate
Time elapsed: 2.855 seconds
Peak memory usage: 35,415 KB

So, we can see that it is way faster currently to parse the DOM and sanitize the_content separately rather than to combine them all and sanitize them once. In addition to being faster, the peak memory usage is lower.

Here are the results from running the sanitizer against an increasing number of articles at a time, and we can see that running the sanitizer on each content separately is O(n) whereas running sanitizer on everything together is O(n^2):

Articles Weight Separate Time Together Time
5 576 KB 0.520 secs 0.594 secs
10 1,153 KB 1.018 secs 1.765 secs
15 1,729 KB 1.614 secs 4.037 secs
20 2,305 KB 1.994 secs 7.138 secs
25 2,882 KB 2.751 secs 12.791 secs
30 3,458 KB 3.053 secs 17.013 secs

Based on this, I think we should not sanitizer the entire output buffer as it will be very poor for performance, at least how the sanitizers are written today.

@westonruter
Copy link
Member Author

But, according to http://httparchive.org/interesting.php#bytesHtmlDoc less than 1% of HTML documents are greater than 60K in size. That's a lot smaller than I expected. So maybe it is a premature optimization to sanitize content separately.

@westonruter
Copy link
Member Author

westonruter commented Jan 18, 2018

See also http://beta.httparchive.org/reports/page-weight#bytesHtml where we can see that 90% of HTML responses are ~110KB or less in weight.

If I reduce the sample post size down to 9KB then having 20 such articles on a post results in these data points:

$ wp eval-file test-sanitizer-performance.php 20 separate
Article size: 9 KB
Article count: 20
HTML size: 173 KB
Sanitize mode: separate
Time elapsed: 0.137 seconds
Peak memory usage: 31,743 KB
$ wp eval-file test-sanitizer-performance.php 20 together
Article size: 9 KB
Article count: 20
HTML size: 173 KB
Sanitize mode: together
Time elapsed: 0.146 seconds
Peak memory usage: 31,919 KB

Having 20 articles on a page would be a lot for WordPress, as the default is 10:

$ wp eval-file test-sanitizer-performance.php 10 separate
Article size: 9 KB
Article count: 10
HTML size: 87 KB
Sanitize mode: separate
Time elapsed: 0.087 seconds
Peak memory usage: 31,743 KB
$ wp eval-file test-sanitizer-performance.php 10 together
Article size: 9 KB
Article count: 10
HTML size: 87 KB
Sanitize mode: together
Time elapsed: 0.094 seconds
Peak memory usage: 31,831 KB

In both cases there is about a 5% increase in time elapsed when sanitizing all together vs sanitizing each article separately.

I'm thinking now that we could go ahead with sanitizing the entire response and then also look at the sanitizing algorithms to see how we can improve performance to make sanitizing a document more of an O(n) operation than O(n^2).

@DavidCramer
Copy link
Contributor

@westonruter Additionally, much of this content will be delivered from the Google cache so the speed decrease will only be on refreshes and reloads.

@westonruter
Copy link
Member Author

westonruter commented Jan 18, 2018

@DavidCramer as a proof of concept, this is the minimal patch that would accomplish the document sanitization:

diff --git a/includes/class-amp-theme-support.php b/includes/class-amp-theme-support.php
index 2adbbc3..70de05a 100644
--- a/includes/class-amp-theme-support.php
+++ b/includes/class-amp-theme-support.php
@@ -187,8 +187,6 @@ class AMP_Theme_Support {
 		 */
 		add_action( 'template_redirect', array( __CLASS__, 'start_output_buffering' ), 0 );
 
-		add_filter( 'the_content', array( __CLASS__, 'filter_the_content' ), PHP_INT_MAX );
-
 		// @todo Add character conversion.
 	}
 
@@ -440,20 +438,9 @@ class AMP_Theme_Support {
 	/**
 	 * Determine required AMP scripts.
 	 *
-	 * @param string $html Output HTML.
 	 * @return string Scripts to inject into the HEAD.
 	 */
-	public static function get_amp_component_scripts( $html ) {
-
-		// @todo This should be integrated with the existing Sanitizer classes so that duplication is not done here.
-		$amp_components = array(
-			'amp-form' => array(
-				'pattern' => '#<(form|input)\b#i',
-				'source'  => 'https://cdn.ampproject.org/v0/amp-form-0.1.js',
-			),
-			// @todo Add more.
-		);
-
+	public static function get_amp_component_scripts() {
 		$amp_scripts = self::$amp_scripts;
 
 		foreach ( self::$embed_handlers as $embed_handler ) {
@@ -463,12 +450,6 @@ class AMP_Theme_Support {
 			);
 		}
 
-		foreach ( $amp_components as $component => $props ) {
-			if ( preg_match( $props['pattern'], $html ) ) {
-				$amp_scripts[ $component ] = $props['source'];
-			}
-		}
-
 		/**
 		 * Filters AMP component scripts before they are injected onto the output buffer for the response.
 		 *
@@ -509,10 +490,18 @@ class AMP_Theme_Support {
 	 */
 	public static function finish_output_buffering( $output ) {
 
+		$output = preg_replace_callback(
+			'#(<body[^>]*>)(.+)(</body>)#is',
+			function( $matches ) {
+				return $matches[1] . self::filter_the_content( $matches[2] ) . $matches[3];
+			},
+			$output
+		);
+
 		// Inject required scripts.
 		$output = preg_replace(
 			'#' . preg_quote( self::COMPONENT_SCRIPTS_PLACEHOLDER, '#' ) . '#',
-			self::get_amp_component_scripts( $output ),
+			self::get_amp_component_scripts(),
 			$output,
 			1
 		);

Naturally it would need to be adapted for PHP 5.2.

It would be better to sanitize the entire HTML element as opposed to just the body, but this currently fails because AMP_DOM_Utils::get_content_from_dom() is used by AMP_Content_Sanitizer::sanitize(), so it wraps HTML in HTML, causing the problem. If we modified AMP_Content_Sanitizer::sanitize() to take an existing DOMDocument object as opposed to an HTML string then that could be an answer.

Aside: It seems DOMDocument is mutation the CDATA sections in script tags output by the Categories widget with dropdowns. There may be a sanitization bug there.

@DavidCramer
Copy link
Contributor

@westonruter This is what I intended with the changes here https://github.com/Automattic/amp-wp/pull/871/files/4067b5955e4f39571f1c2f781c2245c5dce19ef7#diff-c461a4a24e0d1380626efb83c2c11c2c

It adds a get_dom method so that you can provide the HTML wrapped or it wrapps it automatically.

@westonruter
Copy link
Member Author

@DavidCramer got it. Makes sense!

@westonruter
Copy link
Member Author

westonruter commented Jan 19, 2018

If we want to run the sanitizer on HTML outside of the body then we'll need to remove these lines (among other changes):

https://github.com/Automattic/amp-wp/blob/24471652035185c14cb5652c0ae9972ba9349aa3/bin/amphtml-update.py#L384-L386

@ThierryA
Copy link
Collaborator

ThierryA commented Jan 19, 2018

Adding my 2 cents here, if we can find a way to sanitize the entire response would be ideal. Here are the reasons I have in mind:

  • One well architect bullet proof logic for all
  • No risk of missing WP Vanilla content injection
  • Also applies to the content injected by plugins
  • Will make it easy to display plugin warning upon activation

I indeed share your concerns about the potential performance, which is probably going to be a deciding factor. @westonruter when running the tests, did you get a chance to identify what eats the most performs, for example is it to load the entire document or nodes loop/search? It would be good to dig deeper to find what is the most resource heavy and explore alternatives to speed it up.

@westonruter
Copy link
Member Author

The performance problem is due to the document traversal, I believe. I haven't profiled it to double check that. However, @amedina shared that the average AMP document size is 80kb. So I think we should go ahead with using sanitizer for the entire buffered output and then optimize the sanitizer logic.

@ThierryA
Copy link
Collaborator

The average size being 80kb is encouraging, though we should make sure it is scalable. From my perspective an analogy would be the number of WP posts, the average in the world could be 200, though it is scalable to work with millions of post for publishers sites.

@westonruter
Copy link
Member Author

I'm working on this still.

@westonruter
Copy link
Member Author

Completed in #929.

@kienstra
Copy link
Contributor

No Functional Testing Needed

Feel free to comment if you have another opinion, but I moved this out of the 'QA' column. This could possibly use more technical testing, but I don't see a need for functional testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants