Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing TagData_dealloc #303

Merged
merged 1 commit into from
Apr 6, 2023
Merged

Add missing TagData_dealloc #303

merged 1 commit into from
Apr 6, 2023

Conversation

declerambaul
Copy link
Contributor

Fix for the memory leak that causes an issue on certain wikitext (e.g. lots of nested tags, often vandalism), see #286.

I didn't investigate the algorithm itself, and while the leak was due to a missing dealloc, it still takes a long time to parse these problematic wikitexts. I am attaching some details below, it might be helpful if somebody has a look at how to optimize the search tree algorithm itself.

Created a script that parses a single "bad" wikitext (the one from the linked issue).

wikitext = open('bad.txt').read()
from mwparserfromhell.parser._tokenizer import CTokenizer
tok = CTokenizer()
tok.tokenize(wikitext)

The valgrind logs point to the problematic method (PYTHONMALLOC=malloc valgrind --leak-check=yes --track-origins=yes --log-file=valgrind-log.txt python helling.py).

==17814== 24,718,216 (2,038,992 direct, 22,679,224 indirect) bytes in 42,479 blocks are definitely lost in loss record 8,248 of 8,248
==17814==    at 0x483877F: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==17814==    by 0x8DBF003: TagData_new (tag_data.c:39)
==17814==    by 0x8DC3BBC: Tokenizer_really_parse_tag (tok_parse.c:1765)
==17814==    by 0x8DC4198: Tokenizer_parse_tag (tok_parse.c:1910)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2997)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2858)
==17814==    by 0x8DC4198: Tokenizer_parse_tag (tok_parse.c:1910)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2997)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2858)
==17814==    by 0x8DC2249: Tokenizer_parse_bold (tok_parse.c:2015)
==17814==    by 0x8DC2249: Tokenizer_parse_style (tok_parse.c:2176)
==17814==    by 0x8DC2249: Tokenizer_parse (tok_parse.c:3006)
==17814==    by 0x8DC2249: Tokenizer_parse (tok_parse.c:2858)
==17814==    by 0x8DC4198: Tokenizer_parse_tag (tok_parse.c:1910)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2997)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2858)
==17814==    by 0x8DC4198: Tokenizer_parse_tag (tok_parse.c:1910)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2997)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2858)

@earwig
Copy link
Owner

earwig commented Apr 6, 2023

Thank you!

@elukey
Copy link

elukey commented Sep 21, 2023

@declerambaul thanks a lof from the ML Team @ Wikimedia, we were investigating a memory leak in our model servers using mwparserfromhell and once we found the culprit it was really nice to see that a fix was already in place and ready to go. It would have taken ages to us to provide a fix like yours!

Thanks also to @earwig for the quick review/release cycle!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants