Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: speed up parsing long lists #2302

Merged
merged 10 commits into from
Dec 2, 2021

Conversation

calculuschild
Copy link
Contributor

@calculuschild calculuschild commented Nov 24, 2021

Marked version:

v3.0.0 +

Description

The List tokenizer was using a RegEx to capture the potential next list Item, and then split that captured text line-by-line to determine if it had proper indentation, etc. and whether each line should be part of the current list Item.

Problem is, the captured text was literally the entire document, so for every potential list item, we were capturing the entire document and then splitting the document into lines. For longer documents, this meant spending the majority of time just splitting the document into lines over and over.

Here's the offending Regex, and notice that (?:\\n[^\\n]*)* matches everything to the end of the document:

const itemRegex = new RegExp(`^( {0,3}${bull})((?: [^\\n]*| *)(?:\\n[^\\n]*)*(?:\\n|$))`);

This PR changes that so we only capture the first line (with it's bullet point), and once we verify that it is a candidate for starting a new list Item, we just traverse the SRC one line at a time. No more mass line-splitting when we really only need to look one line at a time anyway.

Offending line-splitting line where 90% of processing time was spent:

lines = cap[2].split('\n');

Needs a bit of cleanup, but it's passing tests.

Contributor

  • Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
  • no tests required for this PR.
  • If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

@vercel
Copy link

vercel bot commented Nov 24, 2021

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/markedjs/markedjs/5cTrJbPLiAZvhvjqo23kEq9L2pBJ
✅ Preview: https://markedjs-git-fork-calculuschild-fixon2scalingforlists-markedjs.vercel.app

@@ -1 +0,0 @@
*foo __bar *baz bim__ bam*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this test removed? Is it failing with this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm I see this was an accidental addition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a test I added playing around trying to debug my program and accidentally committed it to the PR.

@calculuschild calculuschild changed the title Fixes applied. Passing all tests, but needs cleanup. Fix List Tokenizer O(n2) time on long lists. Nov 24, 2021
Comment on lines +1 to +4
module.exports = {
markdown: '- a\n'.repeat(10000),
html: `<ul>${'<li>a</li>'.repeat(10000)}</ul>`
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm this speeds up parsing long lists.

This test takes about 2 seconds on my machine on master and about 100 ms with this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hooray!

@calculuschild
Copy link
Contributor Author

I want to go over this code once more before it gets merged. I have a nagging sense that there's still some debris left behind I should clean up but exactly where is eluding me at the moment.

lib/marked.cjs Outdated Show resolved Hide resolved
lib/marked.cjs Outdated Show resolved Hide resolved
@calculuschild
Copy link
Contributor Author

calculuschild commented Nov 27, 2021

@UziTech Cleaned up the logic how I wanted. Passes the specs but is failing the SNYK security test here and I'm not sure why.

Otherwise, this is now ready to merged.

Edit: Ah, there was a merge conflict somewhere setting it off. All fixed now!

src/Tokenizer.js Outdated Show resolved Hide resolved
lib/marked.esm.js Outdated Show resolved Hide resolved
lib/marked.umd.js Outdated Show resolved Hide resolved
lib/marked.cjs Outdated Show resolved Hide resolved
@UziTech UziTech changed the title Fix List Tokenizer O(n2) time on long lists. fix: speed up parsing long lists Dec 2, 2021
@UziTech UziTech merged commit e0005d8 into markedjs:master Dec 2, 2021
github-actions bot pushed a commit that referenced this pull request Dec 2, 2021
## [4.0.6](v4.0.5...v4.0.6) (2021-12-02)

### Bug Fixes

* speed up parsing long lists ([#2302](#2302)) ([e0005d8](e0005d8))
@github-actions
Copy link

github-actions bot commented Dec 2, 2021

🎉 This PR is included in version 4.0.6 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance of parse() is 15 times worse in releases after 2.1.3
3 participants