Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ BUG ] generated sitemaps are truncated #362

Open
omar-dulaimi opened this issue Jun 5, 2021 · 3 comments
Open

[ BUG ] generated sitemaps are truncated #362

omar-dulaimi opened this issue Jun 5, 2021 · 3 comments

Comments

@omar-dulaimi
Copy link

Describe the bug
Hello,

I noticed that generated sitemaps get truncated when they reach their destination(S3). For example, one of the files has 3.5MB of size, on S3 it's only 80KB. I tried to generate them locally, and I actually got them correctly.

Expected behavior
I expect that the full sizes of sitemaps get uploaded.

Context:

  • Library Version: v6.1.1

Additional context

Here's an overview of my code:

    console.log("inside handler....");
    await clearGeneratedSitemapsFromTmpDir();
    const sms = new SitemapAndIndexStream({
      limit: 10000,
      getSitemapStream: (i) => {
        const sitemapStream = new SitemapStream({
          lastmodDateOnly: true,
        });

        const linkPath = `/sitemap-${i + 1}.xml`;
        const writePath = `/tmp/${linkPath}`;
        sitemapStream.pipe(createWriteStream(resolve(writePath)));
        return [new URL(linkPath, hostName).toString(), sitemapStream];
      },
    });

    const data = await generateSiteMap();
    sms.pipe(createWriteStream(resolve("/tmp/sitemap-index.xml")));
    data.forEach((item) => sms.write(item));
    sms.end();
    await uploadToS3();
    await clearGeneratedSitemapsFromTmpDir();

What i'm thinking is that when ending the stream, it doesn't end immediately, so files are uploaded partially.

    sms.end();
    await uploadToS3()

So does sms.end() wait until all files are done?
do I need to sleep a couple of seconds after calling it?

@omar-dulaimi
Copy link
Author

Any ideas @derduher

@omar-dulaimi
Copy link
Author

I did this as a workaround for now, until I get a better solution:


const data = await generateSiteMap();
const logger = createWriteStream(resolve("/tmp/all-urls.json.txt"), {
  flags: "a",
});
data.forEach((el) => {
  logger.write(JSON.stringify(el));
  logger.write("\n");
});
logger.end();

const stream = lineSeparatedURLsToSitemapOptions(
  createReadStream(resolve("/tmp/all-urls.json.txt"))
)
  .pipe(sms)
  .pipe(createWriteStream(resolve("/tmp/sitemap-index.xml")));

await new Promise((fulfill) => stream.on("finish", fulfill));
await uploadToS3();
await clearGeneratedSitemapsFromTmpDir();

@huntharo
Copy link
Contributor

@omar-dulaimi - What i'm thinking is that when ending the stream, it doesn't end immediately, so files are uploaded partially. - Yes that is correct.

Unfortunately most of the documentation examples are contrived and do not show how to wait for a file to be completely loaded or written as they just pipe the input to the output and don't have any code that waits for that to finish, so this is an area that needs some documentation improvements.

Here is how to most easily wait for the local sitemap file to be written and closed before reading it and uploading it to S3:

import { promisify } from 'util';
import { finished, Readable } from 'stream';
const finishedAsync = promisify(finished);

// Break out the write stream creation so we can access it
const sitemapFile = createWriteStream(resolve("/tmp/sitemap-index.xml"));
// Your code
sms.pipe(sitemapFile); // Your code pointing to new variable
// Your code
sms.end();

// This is the magic bit - We don't want to wait just for the the SMS to emit finish... we want to wait until the file stream has received that and closed the file
await finishedAsync(sitemapFile);

// Now the file should have all the contents and sending it to S3 will result in a non-truncated file

Note: if you pipe the sitemap through gzip you still need to await the file being finished not the gzip stream being finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants