-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ability to use sanitizers to identify required AMP components from output buffer #875
Comments
I made a little script to test out what specifically the performance impacts are: https://gist.github.com/westonruter/3a51a10013cd58367689201b69eb8fcd#file-test-sanitizer-performance-php
So, we can see that it is way faster currently to parse the DOM and sanitize Here are the results from running the sanitizer against an increasing number of articles at a time, and we can see that running the sanitizer on each content separately is O(n) whereas running sanitizer on everything together is O(n^2):
Based on this, I think we should not sanitizer the entire output buffer as it will be very poor for performance, at least how the sanitizers are written today. |
But, according to http://httparchive.org/interesting.php#bytesHtmlDoc less than 1% of HTML documents are greater than 60K in size. That's a lot smaller than I expected. So maybe it is a premature optimization to sanitize content separately. |
See also http://beta.httparchive.org/reports/page-weight#bytesHtml where we can see that 90% of HTML responses are ~110KB or less in weight. If I reduce the sample post size down to 9KB then having 20 such articles on a post results in these data points:
Having 20 articles on a page would be a lot for WordPress, as the default is 10:
In both cases there is about a 5% increase in time elapsed when sanitizing all together vs sanitizing each article separately. I'm thinking now that we could go ahead with sanitizing the entire response and then also look at the sanitizing algorithms to see how we can improve performance to make sanitizing a document more of an O(n) operation than O(n^2). |
@westonruter Additionally, much of this content will be delivered from the Google cache so the speed decrease will only be on refreshes and reloads. |
@DavidCramer as a proof of concept, this is the minimal patch that would accomplish the document sanitization: diff --git a/includes/class-amp-theme-support.php b/includes/class-amp-theme-support.php
index 2adbbc3..70de05a 100644
--- a/includes/class-amp-theme-support.php
+++ b/includes/class-amp-theme-support.php
@@ -187,8 +187,6 @@ class AMP_Theme_Support {
*/
add_action( 'template_redirect', array( __CLASS__, 'start_output_buffering' ), 0 );
- add_filter( 'the_content', array( __CLASS__, 'filter_the_content' ), PHP_INT_MAX );
-
// @todo Add character conversion.
}
@@ -440,20 +438,9 @@ class AMP_Theme_Support {
/**
* Determine required AMP scripts.
*
- * @param string $html Output HTML.
* @return string Scripts to inject into the HEAD.
*/
- public static function get_amp_component_scripts( $html ) {
-
- // @todo This should be integrated with the existing Sanitizer classes so that duplication is not done here.
- $amp_components = array(
- 'amp-form' => array(
- 'pattern' => '#<(form|input)\b#i',
- 'source' => 'https://cdn.ampproject.org/v0/amp-form-0.1.js',
- ),
- // @todo Add more.
- );
-
+ public static function get_amp_component_scripts() {
$amp_scripts = self::$amp_scripts;
foreach ( self::$embed_handlers as $embed_handler ) {
@@ -463,12 +450,6 @@ class AMP_Theme_Support {
);
}
- foreach ( $amp_components as $component => $props ) {
- if ( preg_match( $props['pattern'], $html ) ) {
- $amp_scripts[ $component ] = $props['source'];
- }
- }
-
/**
* Filters AMP component scripts before they are injected onto the output buffer for the response.
*
@@ -509,10 +490,18 @@ class AMP_Theme_Support {
*/
public static function finish_output_buffering( $output ) {
+ $output = preg_replace_callback(
+ '#(<body[^>]*>)(.+)(</body>)#is',
+ function( $matches ) {
+ return $matches[1] . self::filter_the_content( $matches[2] ) . $matches[3];
+ },
+ $output
+ );
+
// Inject required scripts.
$output = preg_replace(
'#' . preg_quote( self::COMPONENT_SCRIPTS_PLACEHOLDER, '#' ) . '#',
- self::get_amp_component_scripts( $output ),
+ self::get_amp_component_scripts(),
$output,
1
); Naturally it would need to be adapted for PHP 5.2. It would be better to sanitize the entire HTML element as opposed to just the Aside: It seems |
@westonruter This is what I intended with the changes here https://github.com/Automattic/amp-wp/pull/871/files/4067b5955e4f39571f1c2f781c2245c5dce19ef7#diff-c461a4a24e0d1380626efb83c2c11c2c It adds a |
@DavidCramer got it. Makes sense! |
If we want to run the sanitizer on HTML outside of the |
Adding my 2 cents here, if we can find a way to sanitize the entire response would be ideal. Here are the reasons I have in mind:
I indeed share your concerns about the potential performance, which is probably going to be a deciding factor. @westonruter when running the tests, did you get a chance to identify what eats the most performs, for example is it to load the entire document or nodes loop/search? It would be good to dig deeper to find what is the most resource heavy and explore alternatives to speed it up. |
The performance problem is due to the document traversal, I believe. I haven't profiled it to double check that. However, @amedina shared that the average AMP document size is 80kb. So I think we should go ahead with using sanitizer for the entire buffered output and then optimize the sanitizer logic. |
The average size being 80kb is encouraging, though we should make sure it is scalable. From my perspective an analogy would be the number of WP posts, the average in the world could be 200, though it is scalable to work with millions of post for publishers sites. |
I'm working on this still. |
Completed in #929. |
No Functional Testing Needed Feel free to comment if you have another opinion, but I moved this out of the 'QA' column. This could possibly use more technical testing, but I don't see a need for functional testing. |
Theme support currently has the following logic that us run on the buffered output:
https://github.com/Automattic/amp-wp/blob/e60cb152a50e2ffbf5504d41aee859ad1d0e0baa/includes/class-amp-theme-support.php#L448-L455
This is not ideal. Normally when the sanitizer run, they identify the AMP component scripts that are required once they detect that a given AMP component is present in the page. When the sanitizer is not involved in sanitizing however, then the plugin currently requires manual inclusion of the required AMP components (cf. #874):
https://github.com/Automattic/amp-wp/blob/e60cb152a50e2ffbf5504d41aee859ad1d0e0baa/includes/amp-post-template-actions.php#L120-L131
This isn't good. Merely the presence of an
<amp-analytics>
element in the response should automatically result in theamp-analytics
AMP component script being included. The same should go for when aform
is included in a template, which currently results in issues like #802.So one way or another, we need to identify which elements are used in the page and then add the required AMP components. With #857 we now have the output buffering so we could just take the entire output and feed it into the sanitizers once, instead of doing so once for each instance of
the_content
. It would certainly be simpler to do this, but my main concern is what would be more performant.If we didn't sanitize the entire buffered output, then we'd need another mechanism to look through the raw unparsed HTML to identify whether a given element is present which needs an AMP component. Searching unparsed HTML is going to be less reliable than looking at the DOM.
I've written more about this in https://github.com/Automattic/amp-wp/pull/871/files#r162198204
The text was updated successfully, but these errors were encountered: