-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: meta-tags audit #382
base: main
Are you sure you want to change the base?
Conversation
This PR will trigger a minor release when merged. |
|
||
async function fetchAndProcessPageObject(s3Client, bucketName, key, prefix, log) { | ||
const object = await getObjectFromKey(s3Client, bucketName, key, log); | ||
if (!object?.scrapeResult?.tags || typeof object.scrapeResult.tags !== 'object') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ?.
optional chaining is great for preventing errors, but after checking for object?.scrapeResult?.tags
, the typeof object.scrapeResult.tags !== 'object'
could be unnecessary.
if (!object?.scrapeResult?.tags)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the tags object is coming from an external source, validating if it is an object safeguards from type errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, we have isObject
or isNonEmptyObject
in spacecat-shared-utils
[pageUrl]: { | ||
title: object.scrapeResult.tags.title, | ||
description: object.scrapeResult.tags.description, | ||
h1: object.scrapeResult.tags.h1 || [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use optional chaining to avoid potential errors if h1
doesn’t exist in tags
.
h1: object.scrapeResult.tags?.h1 || [],
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have already checked if tags object exists or not in line 22, we can safely use object.scrapeResult.tags object here
src/metatags/handler.js
Outdated
if (!site) { | ||
return notFound('Site not found'); | ||
} | ||
if (!site.isLive()) { | ||
log.info(`Site ${siteId} is not live`); | ||
return ok(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we combine this to a single return?
if (!site || !site.isLive()) {
log.info(`Site ${siteId} not found or not live`);
return notFound('Site not found or not live');
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, we won't be able to identify what exact error did the audit encountered - if site is not live or it is not found. We need this info in response so that the user knows what exact error we encountered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more explicit guard statements are preferable
for (const key of scrapedObjectKeys) { | ||
// eslint-disable-next-line no-await-in-loop | ||
const pageMetadata = await fetchAndProcessPageObject(s3Client, bucketName, key, prefix, log); | ||
if (pageMetadata) { | ||
Object.assign(extractedTags, pageMetadata); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using await
inside a loop can lead to performance issues as it blocks the iteration. Consider refactoring to Promise.all()
for parallel execution, improving efficiency.
const pageMetadataPromises = scrapedObjectKeys.map(key => fetchAndProcessPageObject(s3Client, bucketName, key, prefix, log));
const pageMetadataResults = await Promise.all(pageMetadataPromises);
pageMetadataResults.forEach(pageMetadata => {
if (pageMetadata) {
Object.assign(extractedTags, pageMetadata);
}
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I'm doing sequential invocations instead of parallel is that in parallel execution, all S3 objects would be fetched into memory simultaneously, which could lead to exceeding the allocated memory. With sequential calls, the Nodejs garbage collector has more opportunities to clean up memory after each fetchAndProcessPageObject invocation finishes, reducing the risk of high memory usage. The audits complete in less than 8 seconds, so the execution time remains within acceptable limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe Promise.all
is worth a test, as i doubt we'll quickly exceed the allocated 4GB for top 200 pages. this would give a DOM-size per page of 20 kB... additional consideration would be rate limiting towards the S3 API though.
try { | ||
const params = { | ||
Bucket: bucketName, | ||
Prefix: prefix, | ||
MaxKeys: 1000, | ||
}; | ||
const data = await s3Client.send(new ListObjectsV2Command(params)); | ||
data?.Contents?.forEach((obj) => { | ||
objectKeys.push(obj.Key); | ||
}); | ||
log.info(`Fetched ${objectKeys.length} keys from S3 for bucket ${bucketName} and prefix ${prefix}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If more than 1000 objects are expected, you should handle pagination by repeatedly calling ListObjectsV2
with ContinuationToken
let continuationToken = null;
do {
const params = {
Bucket: bucketName,
Prefix: prefix,
MaxKeys: 1000,
ContinuationToken: continuationToken,
};
const data = await s3Client.send(new ListObjectsV2Command(params));
data?.Contents?.forEach((obj) => {
objectKeys.push(obj.Key);
});
continuationToken = data.IsTruncated ? data.NextContinuationToken : null;
} while (continuationToken);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ListObject
command doesn't returns the object contents but just the objects metadata like key etc so memory wise it is not much. The listObjectsV2 API has a limit of 1,000 objects per request, and since we are scraping top-pages only which are expected to be < 200, we should be good here in my opinion. Let me know your thoughts.
This PR adds an audit meta-tags that audits top-pages tags specifically the title, description and H1 tag. The tags are scraped by content-scraper and stored in S3, the audit worker fetches tags from S3 and performs checks to detect if the specified 3 tags are missing or length wise inoptimal.
JIRA: https://jira.corp.adobe.com/browse/SITES-23343
Related Docs
https://wiki.corp.adobe.com/display/AEMSites/SEO+%7C+Title%2C+Description+and+H1+Tags+Optimisation
Related Issues
adobe/spacecat-api-service#470
https://github.com/adobe/spacecat-content-scraper/pull/137
Thanks for contributing!