Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large data sets: 1.x has issues that 0.11.1, 0.12.1 do not #2226

Closed
SignpostMarv opened this issue Feb 18, 2022 · 35 comments
Closed

large data sets: 1.x has issues that 0.11.1, 0.12.1 do not #2226

SignpostMarv opened this issue Feb 18, 2022 · 35 comments
Labels
bug: memory Memory limits

Comments

@SignpostMarv
Copy link

SignpostMarv commented Feb 18, 2022

Describe the bug
I've a semi-open site generator project that squishes gigabytes of data sources down to about 6.9k pages to be processed by eleventy in two ways:

  1. the legacy markdown repo
  2. the more up-to-date json data source (pagination ftw!)

0.11.1 handles the 6.9k documents & 15.6mb json file without issue, 1.0.0 falls over in a similar fashion to that described in #695

To Reproduce
The site generator is semi-open in that the source is available at https://github.com/Satisfactory-Clips-Archive/Media-Search-Archive, but it's not feasible to stash 2.7gb+ of source data into the repo, so the repro steps aren't readily reproduceable by anyone that doesn't have the data set.

While the method mentioned in #695 of specifying --max-old-space-size does move the goalposts somewhat, it still falls over with 8gb assigned.

Steps to reproduce the behaviour:

  1. npm run build or ./node_modules/.bin/elevent --config=./.eleventy.pages.js
  2. watch & wait

Expected behaviour
1.x to handle 6.9k markdown documents or 6.9k json data file entries as reliably as 0.11.x does

Screenshots

<--- Last few GCs --->

[16544:000001D55D359230]   168029 ms: Mark-sweep 4038.5 (4130.3) -> 4024.3 (4130.3) MB, 3347.8 / 0.0 ms  (average mu = 0.138, current mu = 0.017) task scavenge might not succeed
[16544:000001D55D359230]   171431 ms: Mark-sweep 4039.7 (4131.5) -> 4025.8 (4132.0) MB, 3345.7 / 0.0 ms  (average mu = 0.081, current mu = 0.016) task scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
 1: 00007FF70641DF0F v8::internal::CodeObjectRegistry::~CodeObjectRegistry+113567
 2: 00007FF7063AD736 v8::internal::MicrotaskQueue::GetMicrotasksScopeDepth+67398
 3: 00007FF7063AE5ED node::OnFatalError+301
 4: 00007FF706DA0CAE v8::Isolate::ReportExternalAllocationLimitReached+94
 5: 00007FF706D8B2FD v8::Isolate::Exit+653
 6: 00007FF706C2EC5C v8::internal::Heap::EphemeronKeyWriteBarrierFromCode+1468
 7: 00007FF706C3AC57 v8::internal::Heap::PublishPendingAllocations+1159
 8: 00007FF706C37C3A v8::internal::Heap::PageFlagsAreConsistent+2874
 9: 00007FF706C2B919 v8::internal::Heap::CollectGarbage+2153
10: 00007FF706BDC315 v8::internal::IndexGenerator::~IndexGenerator+22133
11: 00007FF70633F0AF X509_STORE_CTX_get_lookup_certs+4847
12: 00007FF70633DA46 v8::CFunctionInfo::HasOptions+16150
13: 00007FF70647C27B uv_async_send+331
14: 00007FF70647BA0C uv_loop_init+1292
15: 00007FF70647BBAA uv_run+202
16: 00007FF70644ABD5 node::SpinEventLoop+309
17: 00007FF706365BC3 v8::internal::UnoptimizedCompilationInfo::feedback_vector_spec+52419
18: 00007FF7063E3598 node::Start+232
19: 00007FF70620F88C CRYPTO_memcmp+342300
20: 00007FF707322AC8 v8::internal::compiler::RepresentationChanger::Uint32OverflowOperatorFor+14488
21: 00007FFB71217034 BaseThreadInitThunk+20
22: 00007FFB71402651 RtlUserThreadStart+33

Environment:

  • OS and Version: Win 10, running tool via git bash
  • Eleventy Version: 1.0.0

Additional context

  • 6.9k markdown files (still falls over if the path is excluded due to the json file)
  • 6.9k objects in a pretty-printed json file, totalling 15.6mb spread over 280k lines
@pdehaan
Copy link
Contributor

pdehaan commented Feb 18, 2022

Wow, that definitely wins for one of the larger sites/datasets I've seen in Eleventy!

You mentioned v0.11.1 and v1.0.0, but have you tried in v0.12.1 (which seems to be ~5 months newer than 0.11.x)? I'm curious if we can determine roughly where this may have changed/broke without having access to the ~2.7 GB of required data files.

npm info @11ty/eleventy time --json | grep -Ev "(canary|beta)" | tail -5

  "0.11.1": "2020-10-22T18:40:22.846Z",
  "0.12.0": "2021-03-19T19:24:27.860Z",
  "0.12.1": "2021-03-19T19:55:13.306Z",
  "1.0.0": "2022-01-08T20:27:32.789Z",

@SignpostMarv
Copy link
Author

@pdehaan trying that now 👍

p.s. the data isn't exactly confidential, it's just more of a "I don't wanna have to spam up the git repo" thing :P

@pdehaan
Copy link
Contributor

pdehaan commented Feb 18, 2022

p.s. the data isn't exactly confidential, it's just more of a "I don't wanna have to spam up the git repo" thing :P

Oh, no worries. I totally don't want to download 2.7 GB of data unless… nope, I just really don't want to download roughly 1989 floppy disk's worth of data.

Although now I kind of want to add a "kb_to_floppy_disk" custom filter in Eleventy and represent all file sizes in relation to how many 3.5" floppy disks would be needed.

@SignpostMarv
Copy link
Author

It's the subtitles and video pages for about 5.9k youtube videos. (not sure how I've got 1k more transcriptions than I have clips 🤷‍♂️)

You mentioned v0.11.1 and v1.0.0, but have you tried in v0.12.1

that completes as expected, although I haven't diffed the output to see if there are any changes/bugs etc.

@pdehaan
Copy link
Contributor

pdehaan commented Feb 18, 2022

that completes as expected, although I haven't diffed the output to see if there are any changes/bugs etc.

So, i think you're saying:
✔️ 0.11.1
✔️ 0.12.1
1.0.0
1.0.1-canary.3

Doubting this has already been fixed in 1.0.1-canary builds, but if you were looking to try the sharpest of cutting edge builds, you could try npm i @11ty/eleventy@canary. 🔪

npm info @11ty/eleventy dist-tags --json

{
  "latest": "1.0.0",
  "beta": "1.0.0-beta.10",
  "canary": "1.0.1-canary.3"
}

@SignpostMarv
Copy link
Author

<--- Last few GCs --->

[16756:0000024C7C0F9B80]   144598 ms: Mark-sweep (reduce) 4067.7 (4143.4) -> 4067.1 (4143.9) MB, 7589.3 / 0.0 ms  (average mu = 0.141, current mu = 0.001) allocation failure scavenge might not succeed
[16756:0000024C7C0F9B80]   151443 ms: Mark-sweep (reduce) 4068.3 (4144.1) -> 4067.8 (4144.6) MB, 6831.9 / 0.1 ms  (average mu = 0.080, current mu = 0.002) allocation failure scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 00007FF70641DF0F v8::internal::CodeObjectRegistry::~CodeObjectRegistry+113567
 2: 00007FF7063AD736 v8::internal::MicrotaskQueue::GetMicrotasksScopeDepth+67398
 3: 00007FF7063AE5ED node::OnFatalError+301
 4: 00007FF706DA0CAE v8::Isolate::ReportExternalAllocationLimitReached+94
 5: 00007FF706D8B2FD v8::Isolate::Exit+653
 6: 00007FF706C2EC5C v8::internal::Heap::EphemeronKeyWriteBarrierFromCode+1468
 7: 00007FF706C2C151 v8::internal::Heap::CollectGarbage+4257
 8: 00007FF706C29AC0 v8::internal::Heap::AllocateExternalBackingStore+1904
 9: 00007FF706C464E0 v8::internal::FreeListManyCached::Reset+1408
10: 00007FF706C46B95 v8::internal::Factory::AllocateRaw+37
11: 00007FF706C5AB7A v8::internal::FactoryBase<v8::internal::Factory>::NewFixedArrayWithFiller+90
12: 00007FF706C5AE63 v8::internal::FactoryBase<v8::internal::Factory>::NewFixedArrayWithMap+35
13: 00007FF706A689A6 v8::internal::HashTable<v8::internal::NameDictionary,v8::internal::NameDictionaryShape>::EnsureCapacity<v8::internal::Isolate>+246
14: 00007FF706A6E88E v8::internal::BaseNameDictionary<v8::internal::NameDictionary,v8::internal::NameDictionaryShape>::Add+110
15: 00007FF70697AE68 v8::internal::Runtime::GetObjectProperty+1624
16: 00007FF706E33281 v8::internal::SetupIsolateDelegate::SetupHeap+513585
17: 0000024C0028643A
$ ./node_modules/.bin/eleventy --version
1.0.1-canary.3

@SignpostMarv SignpostMarv changed the title large data sets: 1.x has issues that 0.11.1 does not large data sets: 1.x has issues that 0.11.1, 0.12.1 do not Feb 18, 2022
@SignpostMarv
Copy link
Author

SignpostMarv commented Feb 18, 2022

I totally don't want to download 2.7 GB of data unless…

@pdehaan the problematic json source is only 2.7mb gzipped (in case one wanted to produce a bare-minimum reproduceable case), although I suspect one could bulk generate random test data with for an array of objects this structure & it'd do the trick:

    {
        "id": "yt-0pKBBrBp9tM",
        "url": "https:\/\/youtu.be\/0pKBBrBp9tM",
        "date": "2022-02-15",
        "dateTitle": "February 15th, 2022 Livestream",
        "title": "State of Dave",
        "description": "00:00 Intro\n00:11 Presentation on Update 6\n01:23 Just simmering\n02:04 Recapping last week\n02:24 Hot Potato Save File\n04:53 Outro\n05:26 One more thing!",
        "topics": [
            "PLbjDnnBIxiEo8RlgfifC8OhLmJl8SgpJE"
        ],
        "other_parts": false,
        "is_replaced": false,
        "is_duplicate": false,
        "has_duplicates": false,
        "seealsos": false,
        "transcript": [
            /*
            this is an array of strings that could technically be structured objects but are generally only strings of
            single words up to full groups of paragraphs up, with this example having about 5-7kb of strings in total
            */
        ],
        "like_count": 7,
        "video_object": {
            "@context": "https:\/\/schema.org",
            "@type": "VideoObject",
            "name": "State of Dave",
            "description": "00:00 Intro\n00:11 Presentation on Update 6\n01:23 Just simmering\n02:04 Recapping last week\n02:24 Hot Potato Save File\n04:53 Outro\n05:26 One more thing!",
            "thumbnailUrl": "https:\/\/img.youtube.com\/vi\/BBrBp9tM\/hqdefault.jpg",
            "contentUrl": "https:\/\/youtu.be\/0pKBBrBp9tM",
            "url": [
                "https:\/\/youtu.be\/0pKBBrBp9tM",
                "https:\/\/archive.satisfactory.video\/transcriptions\/yt-0pKBBrBp9tM\/"
            ],
            "uploadDate": "2022-02-15"
        }
    }

p.s. this is the template that's in use in case it's a combination of size-of-data as well as the template: https://github.com/Satisfactory-Clips-Archive/Media-Search-Archive/blob/d5040ac3a42f8eca9517931812892d493b81d326/11ty/pages/transcriptions.njk, rather than just size-of-data

@SignpostMarv
Copy link
Author

@pdehaan working on an isolated test case, have managed to trigger the bug in 0.12, going to check at what point 0.12 succeeds where 1.0 fails.

@SignpostMarv
Copy link
Author

@pdehaan isolated test case currently fails on 0.11, 0.12, and 1.0 at about 21980 entries: https://github.com/SignpostMarv/11ty-eleventy-issue-2226

usage: git checkout ${branch} && npm install && node ./generate ${number} && ./node_modules/.bin/eleventy

the data & templates aren't as complex as those in the media-search-archive repo, will give a second pass at making this more complex if it's not useful enough to let you experiment with avoiding the heap out of memory issue?

@SignpostMarv
Copy link
Author

test.json.gz
p.s. because the generator is currently non-seeded, please find attached the gzipped test.json file that all three versions currently fail on

@SignpostMarv
Copy link
Author

@pdehaan including the markdown repo as a source across all three versions definitely suggests it's either templating or data-related, rather than input-related, as all three versions can handle 7k of just straight-up markdown files. will amend further in the near future and keep you apprised.

@SignpostMarv
Copy link
Author

@pdehaan bit of a delay with further investigation; Have started converting the runtime-generated data to pre-baked data, it looks like having the 131k line json data file in-memory causes the problems.

@SignpostMarv
Copy link
Author

@pdehaan have updated the test-case repo that fails on 1.0 with 9k entries (node ./generate.js 9000) but runs on 0.11 and 0.12 without issue.

@esheehan-gsl
Copy link

I'm hitting this problem as well. I have a site that (only about 1,600 pages) that builds fine with Eleventy 0.12.0, but when I upgraded to 1.0.0 I get out of memory errors.

I've got a global data file (JS) that pulls data from a database (about 660 rows of data) and uses pagination to create one page for each entry from the database. If I shut the database off so that those pages don't get built, the build runs fine with 1.0.0.

I can work around the issue by increasing Node's max memory thus:

NODE_OPTIONS=--max_old_space_size=8192 npm run build

Not sure what happened with 1.0.0 that increased the memory usage this much (with pagination, or global data?) but it'd be great to get it back down.

@pdehaan
Copy link
Contributor

pdehaan commented Mar 30, 2022

/summon @zachleat
Possible performance regression between 0.12.x and 1.0.

Thanks @SignpostMarv, I'll try fetching the ZIP file from #2226 (comment) and see if it will build on my laptop locally (disclaimer: it's a higher end M1 MacBook Pro, so results may differ).

@esheehan-gsl How complex is your content from your database? (is it Liquid or Markdown? etc)
I've toyed with creating a "11ty lorem ipsum" blog generator in the past which just creates X pages based on some global variable so I can poke at performance issues like this w/ bigger sites. But sometimes it comes down to more of how many other filters and plugins and the general complexity of the site instead of just 600 pages vs 6000 pages (which can be frustrating).

@esheehan-gsl
Copy link

How complex is your content from your database? (is it Liquid or Markdown? etc)

There are quite a few fields coming from the database. There's probably over 30 fields coming from the database. Some of it is HTML, some of it is just metadata (paths to video files, categories) that get rendered on the page.

If it helps, it's used to build these pages: https://sos.noaa.gov/catalog/datasets/721/

@SignpostMarv
Copy link
Author

Thanks @SignpostMarv, I'll try fetching the ZIP file from #2226 (comment) and see if it will build on my laptop locally (disclaimer: it's a higher end M1 MacBook Pro, so results may differ).

@pdehaan to clarify, the zip file isn't needed as the problem is replicable at a lower volume of generated pages (9k + supplemantary data) rather than the zip file's higher volume (21.9k w/ no supplementary data)

@pdehaan
Copy link
Contributor

pdehaan commented Mar 31, 2022

I created https://github.com/pdehaan/11ty-lorem which can generate 20k pages (in ~21s).
If I bump it to around ~22k pages, it seems to hit memory issues (on Eleventy v1.0.0).

@SignpostMarv
Copy link
Author

@pdehaan could you now grab the supplementary data file from my test repo (or generate something similar) and see how much lower you have to drop the page count?

@zachleat
Copy link
Member

zachleat commented May 3, 2022

Howdy y’all, there are a few issues to organize here so just to keep things concise I am opening #2360 to coordinate this one. Please follow along there!

@SignpostMarv
Copy link
Author

SignpostMarv commented May 7, 2022

@zachleat tracking updates specific to the test repo here, rather than on new/open tickets:

80000

40000

as above, except for:

  • 2.0.0.canary-9, succeeds

@zachleat zachleat reopened this May 12, 2022
@zachleat
Copy link
Member

What are the success conditions here? Is 80K the goal?

@SignpostMarv
Copy link
Author

@zachleat was basing the test cases from your google spreadsheet, one assumes if it succeeds at 40k it'll succeed at the other sizes you found.

p.s. I'm not sure if the 80k "too many open files" thing should be counted as a new issue or a won't-fix?

@SignpostMarv
Copy link
Author

success @ 50k + 55k + 59k + 59.9k + 59.92k + 59.925k + 59.928k + 59.929k, too many open files @ 60k + 59.99k + 59.95k + 59.93k

A couple things that I'm noticing:

  • whether it succeeds or fails, eleventy seems to hang on the last file for a while before spitting out the stats / completion messages
  • when it errs out with too many open files, eleventy reports 0 files written, is 11ty holding these in tmp / memory somewhere, rather than the output in batches?

@adamlaki
Copy link

Is there any progress here? I also have a bigger JSON source (5.6MB with 270k rows) that made circa 17k pages. On my local setup, I can build it with --max_old_space_size in ~5 minutes, but on Netlify, it breaks with the heap limit.

On another topic: do you have any tips on importing this amount without breaking? Is an external database a better idea?

Thank you!

@SignpostMarv
Copy link
Author

On another topic: do you have any tips on importing this amount without breaking? Is an external database a better idea?

Thank you!

The most terrible option would be to duplicate templates & split the data up.

@adamlaki
Copy link

Yeah, that is something that came to my mind, too, but it will kill the pagination and the collection as a whole. It would be cool if we could break these files into smaller pieces and source them under the same collection or something similar.

For some reason, I could build it on Netlify without the error (maybe it needed time for the NODE_OPTIONS or it had a better day, I am not sure, unfortunately), but still complicated to plan knowing this problem. And my demo is quite plain, almost only data, with the biggest extra is an image inliner (SVG) shortcode for the icons.

Thank you for the feedback. I'll update if there's anything worthwhile.

@SignpostMarv
Copy link
Author

Yeah, that is something that came to my mind, too, but it will kill the pagination and the collection as a whole. It would be cool if we could break these files into smaller pieces and source them under the same collection or something similar.

If you're referring to pagination links, one assumes that if you're taking steps to have data automatically split, you can have pagination values automatically generated "correctly"?

@adamlaki
Copy link

Breaking the source file beforehand could work for me if I could handle it as one collection at import. Still, much more editorial work to manage but at least no hacking at the template level. For the pagination (to connect two sources): I think you can offset the second source's pagination but still, you have two not related data groups with more administration and hacky solutions.

@SignpostMarv
Copy link
Author

I've yet to revisit upgrades on mine since migrating away from the mixed markdown + json sources to eliminate the markdown source 🤔

@Mcdostone
Copy link

Mcdostone commented Oct 28, 2023

I'm facing a similar issue. I have a 1.9GB JSON file (src/_data/configs.json) containing an array of 591 494 objects.

---
pagination:
  data: configs
  size: 1
  alias: config
permalink: "{{ config.permalink }}"
eleventyComputed: {
  title: "{{ config.data.symbol }}"
}
---

Hello {{ config.data.symbol }}

Unfortunately, it doesn't generate any HTML output.
after some debuging steps, I realized that this.ctx.configs is empty and I have no visible errors in the console.
I tried to increase the heap size (--max_old_space_size=8192) but still.

I reduced the size of the src/_data/configs.json file and
It turns out Eleventy works fine when the file size is below ~500MB.

operating system: MacOS ventura, M1 pro, 16GB
Eleventy version: 2.0.1

@d3v1an7
Copy link

d3v1an7 commented Jul 22, 2024

Anyone landing here in 2024:

If using WebC:

  1. Check for nested webc:for (gets expensive quick)
  2. Switch from @html to @raw where possible (using @html once in the base layout seems to be sufficient!)

And/or:

  1. Just switch to v3! :)

I really didn't want to bump up RAM -- we're still just in the 1,000's of assets range (with relatively chunky objects). Switched to v3 and haven't bumped into RAM issues since. Also much faster:

v2.0.2-alpha.2: Wrote 6871 files in 264.93 seconds (38.6ms each)
v3.0.0-alpha.17: Wrote 6871 files in 178.43 seconds (26.0ms each)

After also removing the nested for and switching to raw, it's now around:

v3.0.0-alpha.17: Wrote 6871 files in 92.99 seconds (13.5ms each)

@adamlaki
Copy link

Hey @d3v1an7,

for me, it is still present, but somehow, Netlify pushes it through (19k pages), although the live output will be a bit buggy. On local, I use a different data set with fewer records.

It is good news that v3 could solve it; I plan to migrate in the future.

Thanks for the update!

@zachleat
Copy link
Member

Note that #2627 and #3272 shipped with 3.0 as well. Going to close this one for now. If you encounter it in 3.0 please open a new issue!

@zachleat
Copy link
Member

~/Temp/eleventy-2226 (main ✘)✖✹✭ ᐅ npm run build

> build
> npx @11ty/eleventy --quiet

Heap allocated 0.17 GB
Free 0.38
[11ty] Wrote 20000 files in 3.42 seconds (0.2ms each, v3.0.0)
~/Temp/eleventy-2226 (main ✘)✖✹✭ ᐅ npm run generate      

> generate
> node generate 40000

40000 written
~/Temp/eleventy-2226 (main ✘)✖✹✭ ᐅ npm run build   

> build
> npx @11ty/eleventy --quiet

Heap allocated 0.33 GB
Free 0.39
[11ty] Wrote 40000 files in 6.59 seconds (0.2ms each, v3.0.0)
~/Temp/eleventy-2226 (main ✘)✖✹✭ ᐅ npm run generate

> generate
> node generate 80000

80000 written
~/Temp/eleventy-2226 (main ✘)✖✹✭ ᐅ npm run build   

> build
> npx @11ty/eleventy --quiet

Heap allocated 0.64 GB
Free 0.36
[11ty] Wrote 80000 files in 12.19 seconds (0.2ms each, v3.0.0)
~/Temp/eleventy-2226 (main ✘)✖✹✭ ᐅ npm run generate

> generate
> node generate 160000

160000 written
~/Temp/eleventy-2226 (main ✘)✖✹✭ ᐅ npm run build   

> build
> npx @11ty/eleventy --quiet

Heap allocated 1.3 GB
Free 0.28
[11ty] Wrote 160000 files in 26.39 seconds (0.2ms each, v3.0.0)

Confirmed that https://github.com/SignpostMarv/11ty-eleventy-issue-2226 worked up to 160k files on 3.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug: memory Memory limits
Projects
None yet
Development

No branches or pull requests

7 participants