feat: meta-tags audit #382

dipratap · 2024-08-27T15:30:44Z

This PR adds an audit meta-tags that audits top-pages tags specifically the title, description and H1 tag. The tags are scraped by content-scraper and stored in S3, the audit worker fetches tags from S3 and performs checks to detect if the specified 3 tags are missing or length wise inoptimal.

JIRA: https://jira.corp.adobe.com/browse/SITES-23343

Related Docs

https://wiki.corp.adobe.com/display/AEMSites/SEO+%7C+Title%2C+Description+and+H1+Tags+Optimisation

Related Issues

adobe/spacecat-api-service#470
https://github.com/adobe/spacecat-content-scraper/pull/137

Thanks for contributing!

…tes-23343

github-actions · 2024-08-28T09:16:07Z

This PR will trigger a minor release when merged.

…tes-23343

… object

…into sites-23343

src/metatags/handler.js

ssilare-adobe · 2024-09-09T05:39:34Z

src/metatags/handler.js

+
+async function fetchAndProcessPageObject(s3Client, bucketName, key, prefix, log) {
+  const object = await getObjectFromKey(s3Client, bucketName, key, log);
+  if (!object?.scrapeResult?.tags || typeof object.scrapeResult.tags !== 'object') {


The ?. optional chaining is great for preventing errors, but after checking for object?.scrapeResult?.tags, the typeof object.scrapeResult.tags !== 'object' could be unnecessary.

if (!object?.scrapeResult?.tags)

Since the tags object is coming from an external source, validating if it is an object safeguards from type errors.

also, we have isObject or isNonEmptyObject in spacecat-shared-utils

ssilare-adobe · 2024-09-09T05:40:24Z

src/metatags/handler.js

+    [pageUrl]: {
+      title: object.scrapeResult.tags.title,
+      description: object.scrapeResult.tags.description,
+      h1: object.scrapeResult.tags.h1 || [],


Use optional chaining to avoid potential errors if h1 doesn’t exist in tags.

h1: object.scrapeResult.tags?.h1 || [],

Since we have already checked if tags object exists or not in line 22, we can safely use object.scrapeResult.tags object here

ssilare-adobe · 2024-09-09T05:45:04Z

src/metatags/handler.js

+    if (!site) {
+      return notFound('Site not found');
+    }
+    if (!site.isLive()) {
+      log.info(`Site ${siteId} is not live`);
+      return ok();
+    }


Can we combine this to a single return?

if (!site || !site.isLive()) { log.info(`Site ${siteId} not found or not live`); return notFound('Site not found or not live'); }

In this case, we won't be able to identify what exact error did the audit encountered - if site is not live or it is not found. We need this info in response so that the user knows what exact error we encountered.

more explicit guard statements are preferable

ssilare-adobe · 2024-09-09T05:46:06Z

src/metatags/handler.js

+    for (const key of scrapedObjectKeys) {
+      // eslint-disable-next-line no-await-in-loop
+      const pageMetadata = await fetchAndProcessPageObject(s3Client, bucketName, key, prefix, log);
+      if (pageMetadata) {
+        Object.assign(extractedTags, pageMetadata);
+      }
+    }


Using await inside a loop can lead to performance issues as it blocks the iteration. Consider refactoring to Promise.all() for parallel execution, improving efficiency.

const pageMetadataPromises = scrapedObjectKeys.map(key => fetchAndProcessPageObject(s3Client, bucketName, key, prefix, log)); const pageMetadataResults = await Promise.all(pageMetadataPromises); pageMetadataResults.forEach(pageMetadata => { if (pageMetadata) { Object.assign(extractedTags, pageMetadata); } });

The reason I'm doing sequential invocations instead of parallel is that in parallel execution, all S3 objects would be fetched into memory simultaneously, which could lead to exceeding the allocated memory. With sequential calls, the Nodejs garbage collector has more opportunities to clean up memory after each fetchAndProcessPageObject invocation finishes, reducing the risk of high memory usage. The audits complete in less than 8 seconds, so the execution time remains within acceptable limits.

maybe Promise.all is worth a test, as i doubt we'll quickly exceed the allocated 4GB for top 200 pages. this would give a DOM-size per page of 20 kB... additional consideration would be rate limiting towards the S3 API though.

src/metatags/handler.js

ssilare-adobe · 2024-09-09T05:54:56Z

src/utils/s3-utils.js

+  try {
+    const params = {
+      Bucket: bucketName,
+      Prefix: prefix,
+      MaxKeys: 1000,
+    };
+    const data = await s3Client.send(new ListObjectsV2Command(params));
+    data?.Contents?.forEach((obj) => {
+      objectKeys.push(obj.Key);
+    });
+    log.info(`Fetched ${objectKeys.length} keys from S3 for bucket ${bucketName} and prefix ${prefix}`);


If more than 1000 objects are expected, you should handle pagination by repeatedly calling ListObjectsV2 with ContinuationToken

let continuationToken = null; do { const params = { Bucket: bucketName, Prefix: prefix, MaxKeys: 1000, ContinuationToken: continuationToken, }; const data = await s3Client.send(new ListObjectsV2Command(params)); data?.Contents?.forEach((obj) => { objectKeys.push(obj.Key); }); continuationToken = data.IsTruncated ? data.NextContinuationToken : null; } while (continuationToken);

The ListObject command doesn't returns the object contents but just the objects metadata like key etc so memory wise it is not much. The listObjectsV2 API has a limit of 1,000 objects per request, and since we are scraping top-pages only which are expected to be < 200, we should be good here in my opinion. Let me know your thoughts.

dipratap added 2 commits August 27, 2024 20:51

feat: seo tags audit

c324d15

feat: renaming vars

ba7a196

dipratap mentioned this pull request Aug 28, 2024

feat: seo tags audit #373

Closed

2 tasks

Merge branch 'main' of github.com:adobe/spacecat-audit-worker into si…

1ed4f35

…tes-23343

dipratap added 17 commits August 29, 2024 16:09

Merge branch 'main' of github.com:adobe/spacecat-audit-worker into si…

66e8600

…tes-23343

feat: remove keyword inclusion check, add support for s3 scraped tags…

e37fbc5

… object

feat: update S3 bucket name

e6fd210

feat: fix s3 get object

6eefe49

Merge branch 'main' into sites-23343

5633c44

feat: scrape suffix in s3 filenames

9001f1e

Merge branch 'sites-23343' of github.com:adobe/spacecat-audit-worker …

4efb76a

…into sites-23343

feat: temp log add

0732419

feat: temp log add

a7a1e00

feat: handle s3 get object

20d5a2a

feat: adding temp logging

e5e448c

feat: fixing uts

7d418a6

feat: adding info log

531852f

feat: adding info log

7152b50

feat: adding info log

84d6b55

feat: removing temp info log

c1ff8ac

feat: main merge

f17d565

dipratap requested a review from a team September 4, 2024 20:34

dipratap changed the title ~~feat: seo tags audit~~ feat: meta-tags audit Sep 4, 2024

ssilare-adobe requested changes Sep 9, 2024

View reviewed changes

dipratap added 5 commits September 10, 2024 16:53

feat: temp change for site not live

fc6a230

feat: changes for reporting

5ff695d

feat: changes for reporting

7a36bcf

feat: organize detected tags

b59c457

feat: new audit result schema

0b71522

dipratap added 10 commits September 18, 2024 18:34

fix: issues

55108c1

fix: issues

29e9e6d

fix: add temporary debug logs

3c31b6e

fix: add temporary debug logs

28e461e

fix: add temporary debug logs

0a7b268

fix: detected tags structure

1db87f9

fix: revert temp debug logs

b0f0ea0

fix: temp change

828baf9

fix: revert temp change

e2d3a74

fix: uniqueness check

d627c96

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: meta-tags audit #382

feat: meta-tags audit #382

dipratap commented Aug 27, 2024 •

edited

Loading

github-actions bot commented Aug 28, 2024

ssilare-adobe Sep 9, 2024

dipratap Sep 9, 2024

solaris007 Sep 9, 2024 •

edited

Loading

ssilare-adobe Sep 9, 2024

dipratap Sep 9, 2024

ssilare-adobe Sep 9, 2024

dipratap Sep 9, 2024

solaris007 Sep 9, 2024

ssilare-adobe Sep 9, 2024

dipratap Sep 9, 2024

solaris007 Sep 9, 2024

ssilare-adobe Sep 9, 2024

dipratap Sep 9, 2024

feat: meta-tags audit #382

Are you sure you want to change the base?

feat: meta-tags audit #382

Conversation

dipratap commented Aug 27, 2024 • edited Loading

Related Docs

Related Issues

github-actions bot commented Aug 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

solaris007 Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dipratap commented Aug 27, 2024 •

edited

Loading

solaris007 Sep 9, 2024 •

edited

Loading