Add missing TagData_dealloc #303

declerambaul · 2023-04-05T14:33:04Z

Fix for the memory leak that causes an issue on certain wikitext (e.g. lots of nested tags, often vandalism), see #286.

I didn't investigate the algorithm itself, and while the leak was due to a missing dealloc, it still takes a long time to parse these problematic wikitexts. I am attaching some details below, it might be helpful if somebody has a look at how to optimize the search tree algorithm itself.

Created a script that parses a single "bad" wikitext (the one from the linked issue).

wikitext = open('bad.txt').read()
from mwparserfromhell.parser._tokenizer import CTokenizer
tok = CTokenizer()
tok.tokenize(wikitext)

The valgrind logs point to the problematic method (PYTHONMALLOC=malloc valgrind --leak-check=yes --track-origins=yes --log-file=valgrind-log.txt python helling.py).

==17814== 24,718,216 (2,038,992 direct, 22,679,224 indirect) bytes in 42,479 blocks are definitely lost in loss record 8,248 of 8,248
==17814==    at 0x483877F: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==17814==    by 0x8DBF003: TagData_new (tag_data.c:39)
==17814==    by 0x8DC3BBC: Tokenizer_really_parse_tag (tok_parse.c:1765)
==17814==    by 0x8DC4198: Tokenizer_parse_tag (tok_parse.c:1910)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2997)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2858)
==17814==    by 0x8DC4198: Tokenizer_parse_tag (tok_parse.c:1910)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2997)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2858)
==17814==    by 0x8DC2249: Tokenizer_parse_bold (tok_parse.c:2015)
==17814==    by 0x8DC2249: Tokenizer_parse_style (tok_parse.c:2176)
==17814==    by 0x8DC2249: Tokenizer_parse (tok_parse.c:3006)
==17814==    by 0x8DC2249: Tokenizer_parse (tok_parse.c:2858)
==17814==    by 0x8DC4198: Tokenizer_parse_tag (tok_parse.c:1910)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2997)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2858)
==17814==    by 0x8DC4198: Tokenizer_parse_tag (tok_parse.c:1910)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2997)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2858)

earwig · 2023-04-06T03:39:45Z

Thank you!

elukey · 2023-09-21T07:16:06Z

@declerambaul thanks a lof from the ML Team @ Wikimedia, we were investigating a memory leak in our model servers using mwparserfromhell and once we found the culprit it was really nice to see that a fix was already in place and ready to go. It would have taken ages to us to provide a fix like yours!

Thanks also to @earwig for the quick review/release cycle!

Add missing TagData_dealloc

efa9795

earwig merged commit 0f89f44 into earwig:main Apr 6, 2023

isaranto mentioned this pull request Sep 20, 2023

fix: upgrade mwparsefromhell to solve memory leak wikimedia/revscoring#549

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add missing TagData_dealloc #303

Add missing TagData_dealloc #303

declerambaul commented Apr 5, 2023

earwig commented Apr 6, 2023

elukey commented Sep 21, 2023

Add missing TagData_dealloc #303

Add missing TagData_dealloc #303

Conversation

declerambaul commented Apr 5, 2023

earwig commented Apr 6, 2023

elukey commented Sep 21, 2023