-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a hard limit on maximum number of pages that Gatsby can build? #20338
Comments
Can you try increasing the allowed memory for nodejs? The flag is The first will apply the memory limit to all child processes (gatsby will spawn a few) NODE_OPTIONS="--max-old-space-size=8192" gatsby build The second will apply the memory limit only to the top level process node --max-old-space-size=8192 node_modules/.bin/gatsby build Assuming you have enough memory (my example is 8gb, adjust accordingly), does that work? :) |
The limit suggested in that SO link is 16m, not 160k ( |
Yeah, I can get the benchmark site work with But in my project it's very slow. It had created about 5700 pages in 11 minutes before I killed it. At this rate, it's gonna take more than 4 hours to create 150K pages. Is this expected? I guess the create pages benchmark is so fast as it doesn't have any data / query to run. Also, where do you see the 2^28 for max size? It's mentioned in the first statement that 2^24 is the implementation-defined limit, no? What am I missing? 🤔 I'm not sure if this really affects |
Short answer: not really, certainly not for local data without images, but who knows. Long answer: Ok. I'll have a look at it next week (most of us in core are on holiday last and this week, please bear with us). I'm sure we can sort this out. In general there are four aspects at play for the perf;
In your case I'm guessing it's all about the graphql. If I'm reading it right it took 16 minutes just to bootstrap.
That said, you did surface an interesting future problem with the Map limitation. I still don't expect you to be hitting that (but who knows), regardless we'll be needing to take that into account as we're going to scale up. So thank you for that. Is there any way for me to access this repro so we can investigate? Private or anonymized / generated data is fine for me. I mostly need the real world scale. |
(Sorry, |
Thank you @pvdz for the detailed response. I know I'm not great at math, but wouldn't have imagined mistaking 16m for 160k 🤦♂ Sorry about that 😛 I tried doing schema customization, but was getting an error with that. Not sure if using So the dataset that I'm using is Indian Financial System Code (used in banks in India for money transfer etc.) and it's a public dataset. You can find my repo here -- https://github.com/tsriram/ifsc/. I'm wondering if it's actually a good idea to build this as a static website or I could have a client only routes in Gatsby & build an API to get the data 🤔 |
I can repro. The run queries step is dog slow. It was already going to be my first point of investigation, this just confirms it even more so :) The speed of queries slows down the more pages we run so something's going on that shouldn't be affected by absolute page count. Thanks for the repro, very helpful! One immediate tip to speed things up: there's an implicit index on Another is that the inference may still be running. That's an unintended bug we're currently already investigating. So if that's the case a fix may also improve your build times. |
Fwiw, it's about 4 hours and 45 minutes :p On my benchmark machine, anyways. I'm looking into details. I'm convinced the time for run queries can be improved. It should not scale down as the number of pages scales up and at 150k pages there are 10 queries/second while at 2k pages it's about 150 queries/second. So that's very fishy :p Anyways, just wanted to let you know I'm looking into it. Will report back soon.
|
Wow, thanks much @pvdz 👍 Looking forward to what you come up with :) |
Still working on this. I'm very convinced fixing this bottleneck will have a huge impact on any large site.
|
You might think I have forgotten about this. But you'd be wrong. And happy. Debugging the problem in this build turned out to be a deep rabbit hole and it took me some time to get in, and out of it. But, happy to report I can build your site in ~10 minutes now.
You'll have to wait a bit before you can do this but there are some fixes / workarounds upcoming. The basic gist is that the way nodes are looked up have a shortcut for querying by After that, the run queries step drops to ~10 minutes (down from 257 minutes, or 4.2 hours, as you can see above). Which makes me very happy :d The wait for you is now for me to polish this fix, make sure the generic assumptions hold (is your site a one-of or are most sites like yours?) and then we should be good to go. |
No, I was pretty sure you'd be working on this :) Improvement in the run query step sounds like a great success. I actually wanted to spend some time on this and try figure out what's going on, but you could only do so much with a toddler 👶 around. Looking forward to the fix. Thank you again, @pvdz :) |
Now #20609 has landed in master. This is the part from us you'll need to see improvements. (Still needs to be published so if you're not comfortable to build from source it usually doesn't take long to get published). The other change is to your repo. It's changing the index from src/templates/ifsc.tsx: export const query = graphql`
- query($slug: String!) {
- allIfscCsv(filter: { fields: { slug: { eq: $slug } } }) {
+ query($id: String!) {
+ allIfscCsv(filter: { id: { eq: $id } }) {
edges {
node {
ifsc gatsby-node.js const result = await graphql(`
query {
allIfscCsv {
edges {
node {
+ id
fields {
slug
} and later in that file result.data.allIfscCsv.edges.forEach(({ node }) => {
createPage({
path: node.fields.slug,
component: path.resolve(`./src/templates/ifsc.tsx`),
context: {
- slug: node.fields.slug
+ id: node.id,
}
});
}); I think that should suffice. With that, the If you want counting stats while building for your pages (hey that's 60 seconds less of looking at an idle screen) you can copy paste my whole change, which will use a progress bar for the -exports.createPages = async ({ graphql, actions }) => {
+exports.createPages = async ({ graphql, actions, reporter }) => {
+ const progress = reporter.createProgress(`ifsc/gatsby-node.js`);
+ console.time("(ifsc) total exports.createPages");
+ console.time("(ifsc) initial graphql query");
+ progress.setStatus("initial graphl query");
+
const { createPage } = actions;
const result = await graphql(`
query {
allIfscCsv {
edges {
node {
+ id
fields {
slug
}
@@ -36,13 +42,38 @@ exports.createPages = async ({ graphql, actions }) => {
}
}
`);
+ console.timeEnd("(ifsc) initial graphql query");
+
+ console.time("(ifsc) created pages");
+
+ progress.start();
+ progress.total = result.data.allIfscCsv.edges.length - 1;
+ let start = Date.now();
+ progress.setStatus(
+ "Calling createPage for " + result.data.allIfscCsv.edges.length + " pages"
+ );
result.data.allIfscCsv.edges.forEach(({ node }) => {
createPage({
path: node.fields.slug,
component: path.resolve(`./src/templates/ifsc.tsx`),
context: {
- slug: node.fields.slug
+ id: node.id,
+ // slug: node.fields.slug
}
});
+ progress.tick(1);
});
+ progress.setStatus(
+ "Called createPage for " +
+ (result.data.allIfscCsv.edges.length - 1) +
+ " pages at " +
+ (result.data.allIfscCsv.edges.length - 1) /
+ ((Date.now() - start) / 1000) +
+ " pages/s"
+ );
+ progress.done();
+ console.timeEnd("(ifsc) created pages");
+ console.timeEnd("(ifsc) total exports.createPages");
+ progress.setStatus("createPages finished");
}; |
Note that the createPages step runs at roughly 4k page/s here. The "run queries" was running at 10 q/s before. Changing it to id (properly) changed that to roughly 70 q/s. Applying a shortcut in Gatsby raised that to roughly 600~800 q/s. It's all about loops :) And these show themselves pretty quickly at scale. Thank you for your example! |
This is awesome, @pvdz 🎉 I just upgrade the Gatsby version locally and it started building very fast, but crashed after about 70% with out of memory error. Now if I run the build with It's the same result every time I run build -- fails at ~70% with default memory and gets stuck in building schema with I'll close this issue as you've fixed the build time. Thank you again for your help 👍 |
Hm ok. Fwiw the build completes here (indeed with increased memory). Node v8.16.2, I don't think there's anything else of significance to the env right now. If you can't push it forward please post a new issue about it :) Although perhaps waiting for an actual release with this fix, perhaps that helps. Dunno. |
Okay, for the curious, I ran build with
This is weird 🙄 |
@tsriram You should sanitize path here: https://github.com/tsriram/ifsc/blob/master/gatsby-node.js#L58 It is highly likely that your data contains characters that are not allowed in a file name |
ftr (more for ourselves than anything else); currently this is how long a build took me after the last changes, on a non-stabalized machine (I usually run these on a stable benchmarking machine):
|
I wanted to try out the new hotness and with a bit of fiddling got it working too. It built in 316 seconds on my 16 inch macbook pro 🔥 One thing we could fix here is I had to manually truncate overly long paths to successfully write them to file (macs only allow 255 characters in file paths). We should just do that automatically. I thought I'd made an issue for this before but I'll make a new one.
My diff diff --git a/gatsby-node.js b/gatsby-node.js
index 2aa38d1..e0d26c3 100644
--- a/gatsby-node.js
+++ b/gatsby-node.js
@@ -1,5 +1,6 @@
-const slugify = text => text.replace(/ /g, "-").toLowerCase();
+const slugify = require(`slug`);
const path = require("path");
+const strhash = require(`string-hash`);
exports.onCreateNode = ({ node, actions }) => {
if (node.internal.type === "IfscCsv") {
@@ -11,12 +12,18 @@ exports.onCreateNode = ({ node, actions }) => {
const citySlug = slugify(city);
const branchSlug = slugify(branch);
- const slug = `${bankSlug}/${stateSlug}/${citySlug}/${branchSlug}-branch`;
+ let slug = `${bankSlug}/${stateSlug}/${citySlug}/${branchSlug}-branch`;
+
+ if (slug.length > 200) {
+ console.log(`old slug`, slug);
+ slug = slug.slice(0, 100) + strhash(slug.slice(100));
+ console.log(`new slug`, slug);
+ }
createNodeField({
node,
name: `slug`,
- value: slug
+ value: slug,
});
}
};
@@ -58,8 +65,8 @@ exports.createPages = async ({ graphql, actions, reporter }) => {
path: node.fields.slug,
component: path.resolve(`./src/templates/ifsc.tsx`),
context: {
- id: node.id
- }
+ id: node.id,
+ },
});
progress.tick(1);
});
diff --git a/package.json b/package.json
index 50601e8..88c18bd 100644
--- a/package.json
+++ b/package.json
@@ -25,7 +25,9 @@
"gatsby-transformer-csv": "^2.1.21",
"node-sass": "^4.13.0",
"react": "^16.12.0",
- "react-dom": "^16.12.0"
+ "react-dom": "^16.12.0",
+ "slug": "^2.1.0",
+ "string-hash": "^1.1.3"
},
"devDependencies": {
"@aleung/csvsplit": "^2.0.0", |
Just note: There is only a limit of 255 for each filename. But the path has no limit see https://en.wikipedia.org/wiki/Comparison_of_file_systems |
I'm trying to build a site with ~150k pages (probably more than this when I get closer to finishing) using Gatsby with a CSV file as data source. I initially had a sample dataset with about 100 rows in a CSV file and developed my initial pages and it worked. When I tried running
gatsby build
with all 150k rows, build was getting stuck in "source and transform nodes" step.As suggested by @KyleAMathews, I split the large CSV into multiple files (varied number of rows based on data) and the build now finishes "source and transform nodes" in about 100s, but fails with heap out of memory error.
I also tried running the create pages benchmark site with 125k pages and it fails with the same error too, while it builds the site in less than 2 minutes for 100k pages.
I tried figuring out the underlying issue myself. From page creation docs, I reached pages reducer and found that we use JavaScript
Map
for the state.I was wondering if there's a hard limit on the number of items that can be set in a
Map
. From this StackOverflow answer, it looks like we can set only upto 2^24 (roughly 167k) items in a Map. I'm not very sure about what else does this redux state have, but if it's storing only the pages, does ~167k become a hard limit for the number of pages that Gatsby can build?There's a lot of places where we use Map in Gatsby source code. It's probably one of them causing this out of memory error?
The text was updated successfully, but these errors were encountered: