Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

heap corruption while parsing huge comment #99581

Closed
9001 opened this issue Nov 18, 2022 · 8 comments
Closed

heap corruption while parsing huge comment #99581

9001 opened this issue Nov 18, 2022 · 8 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@9001
Copy link

9001 commented Nov 18, 2022

Crash report

A very large comment in heapcrpt.py causes tokenizer.c to perform an illegal write, leading to heap corruption and crashing the interpreter

Error messages

Linux/glibc: double free or corruption (!prev)
Windows: 0xc0000374 in event viewer

Your environment

Reproduced on cpython 3.10.0, 3.10.8, 3.12.0a2
Reproduced on fedora 35 (x64), windows 10 (x64, 17763.316)

Not reproduced on cpython 3.9.15
Not visibly reproduced on macos 10.13

Linked PRs

@9001 9001 added the type-crash A hard crash of the interpreter, possibly with a core dump label Nov 18, 2022
@sobolevn sobolevn added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Nov 18, 2022
@sweeneyde
Copy link
Member

Thanks for the report. This is a backtrace I got from gdb for 3.11:

sween@DESKTOP-3GGN2FL:~/311$ gdb --args ./python /mnt/c/Users/sween/Source/Repos/cpython2/cpython/heapcrpt.py
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./python...
warning: File "/home/sween/311/python-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
        add-auto-load-safe-path /home/sween/311/python-gdb.py
line to your configuration file "/home/sween/.gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "/home/sween/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
(gdb) r
Starting program: /home/sween/311/python /mnt/c/Users/sween/Source/Repos/cpython2/cpython/heapcrpt.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Debug memory block at address p=0x7ffff37f6810: API 'm'
    810936 bytes originally requested
    The 7 pad bytes at p-7 are FORBIDDENBYTE, as expected.
    The 8 pad bytes at tail=0x7ffff38bc7c8 are not all FORBIDDENBYTE (0xfd):
        at tail+0: 0x00 *** OUCH
        at tail+1: 0xfd
        at tail+2: 0xfd
        at tail+3: 0xfd
        at tail+4: 0xfd
        at tail+5: 0xfd
        at tail+6: 0xfd
        at tail+7: 0xfd
    Data at p: 00 42 5a 68 35 31 41 59 ... 42 41 c2 ae 6c 34 54 0a

Enable tracemalloc to get the memory block allocation traceback

Fatal Python error: _PyMem_DebugRawFree: bad trailing pad byte
Python runtime state: initialized

Current thread 0x00007ffff71f4bc0 (most recent call first):
  <no Python frame>

Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7243859 in __GI_abort () at abort.c:79
#2  0x0000555555d87db2 in fatal_error_exit (status=-1) at Python/pylifecycle.c:2625
#3  fatal_error (fd=fd@entry=2, header=header@entry=1,
    prefix=prefix@entry=0x555555fbd800 <__func__.15853> "_PyMem_DebugRawFree",
    msg=msg@entry=0x555555fbd3c0 "bad trailing pad byte", status=status@entry=-1)
    at Python/pylifecycle.c:2806
#4  0x0000555555d87e62 in _Py_FatalErrorFunc (
    func=func@entry=0x555555fbd800 <__func__.15853> "_PyMem_DebugRawFree",
    msg=msg@entry=0x555555fbd3c0 "bad trailing pad byte") at Python/pylifecycle.c:2822
#5  0x0000555555b0610c in _PyMem_DebugCheckAddress (
    func=func@entry=0x555555fbd800 <__func__.15853> "_PyMem_DebugRawFree", api=<optimized out>,
    p=0x7ffff37f6810) at Objects/obmalloc.c:2761
#6  0x0000555555b063d3 in _PyMem_DebugRawFree (ctx=ctx@entry=0x5555562500b0 <_PyMem_Debug+48>,
    p=p@entry=0x7ffff37f6810) at Objects/obmalloc.c:2571
#7  0x0000555555b06d32 in _PyMem_DebugFree (ctx=0x5555562500b0 <_PyMem_Debug+48>, ptr=0x7ffff37f6810)
    at Objects/obmalloc.c:2708
#8  0x0000555555b086e3 in PyMem_Free (ptr=<optimized out>) at Objects/obmalloc.c:652
#9  0x00005555559c4817 in _PyTokenizer_Free (tok=tok@entry=0x61e000029490) at Parser/tokenizer.c:811
#10 0x00005555558993de in _PyPegen_run_parser_from_file_pointer (fp=fp@entry=0x615000010480,
    start_rule=start_rule@entry=257, filename_ob=filename_ob@entry=0x60d00001e970,
    enc=enc@entry=0x0, ps1=ps1@entry=0x0, ps2=ps2@entry=0x0, flags=0x7fffffffd4a0, errcode=0x0,
    arena=0x6080001a54b0) at Parser/pegen.c:914
#11 0x00005555559bac9b in _PyParser_ASTFromFile (fp=fp@entry=0x615000010480,
    filename_ob=filename_ob@entry=0x60d00001e970, enc=enc@entry=0x0, mode=mode@entry=257,
    ps1=ps1@entry=0x0, ps2=ps2@entry=0x0, flags=0x7fffffffd4a0, errcode=0x0, arena=0x6080001a54b0)
    at Parser/peg_api.c:26
#12 0x0000555555d9379c in pyrun_file (fp=fp@entry=0x615000010480,
    filename=filename@entry=0x60d00001e970, start=start@entry=257,
    globals=globals@entry=0x6080000e6540, locals=locals@entry=0x6080000e6540,
    closeit=closeit@entry=1, flags=0x7fffffffd4a0) at Python/pythonrun.c:1621
#13 0x0000555555d9ac15 in _PyRun_SimpleFileObject (fp=fp@entry=0x615000010480,
    filename=filename@entry=0x60d00001e970, closeit=closeit@entry=1,
    flags=flags@entry=0x7fffffffd4a0) at Python/pythonrun.c:440
#14 0x0000555555d9af51 in _PyRun_AnyFileObject (fp=fp@entry=0x615000010480,
    filename=filename@entry=0x60d00001e970, closeit=closeit@entry=1,
    flags=flags@entry=0x7fffffffd4a0) at Python/pythonrun.c:79
#15 0x0000555555deba48 in pymain_run_file_obj (program_name=program_name@entry=0x6080001a4f30,
    filename=filename@entry=0x60d00001e970, skip_source_first_line=<optimized out>)
    at Modules/main.c:360
#16 0x0000555555dec222 in pymain_run_file (config=config@entry=0x5555564cd2a0 <_PyRuntime+59904>)
    at Modules/main.c:379
#17 0x0000555555dee89d in pymain_run_python (exitcode=exitcode@entry=0x7fffffffd730)
    at Modules/main.c:601
#18 0x0000555555deea19 in Py_RunMain () at Modules/main.c:680
#19 0x0000555555deec09 in pymain_main (args=args@entry=0x7fffffffd880) at Modules/main.c:710
#20 0x0000555555deef70 in Py_BytesMain (argc=2, argv=0x7fffffffda18) at Modules/main.c:734
#21 0x0000555555894166 in main (argc=<optimized out>, argv=<optimized out>) at ./Programs/python.c:15

I think it might be particular to running a file directly, i.e., python heapcrpt.py, since I couldn't reproduce with python -c "import heapcrpt"

The patch below stopped the error, but I'm assuming it would be better to find the actual source of the buffer overflow.

diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c
index a5cfb659b4..2d23e5aa09 100644
--- a/Parser/tokenizer.c
+++ b/Parser/tokenizer.c
@@ -359,7 +359,7 @@ tok_reserve_buf(struct tok_state *tok, Py_ssize_t size)
         Py_ssize_t start = tok->start == NULL ? -1 : tok->start - tok->buf;
         Py_ssize_t line_start = tok->start == NULL ? -1 : tok->line_start - tok->buf;
         Py_ssize_t multi_line_start = tok->multi_line_start - tok->buf;
-        newbuf = (char *)PyMem_Realloc(newbuf, newsize);
+        newbuf = (char *)PyMem_Realloc(newbuf, newsize + 1);
         if (newbuf == NULL) {
             tok->done = E_NOMEM;
             return 0;

From adding assert(tok->inp < tok->end), I believe the actual overflow is here:

*tok->inp = '\0';

cc @pablogsal or @lysnikolaou

@pablogsal
Copy link
Member

Will try to take a look this night or tomorrow 👍

@9001
Copy link
Author

9001 commented Nov 19, 2022

Awesome :>
One detail about the final line in the file -- the crash disappears if any single byte above \x80 is replaced with something below \x80. Maybe this is related to charset conversion somehow?
This is also why the file is # coding: latin-1 rather than utf-8, since pypy would panic on the invalid byte sequences.

@pablogsal
Copy link
Member

Valgrind complains here:

==103320== Invalid write of size 1
==103320==    at 0x272851: tok_underflow_file (tokenizer.c:1005)
==103320==    by 0x272B73: tok_nextc (tokenizer.c:1079)
==103320==    by 0x27313F: tok_get (tokenizer.c:1453)
==103320==    by 0x274AF1: _PyTokenizer_Get (tokenizer.c:2158)
==103320==    by 0x203525: _PyPegen_fill_token (pegen.c:201)
==103320==    by 0x2038F3: _PyPegen_expect_token (pegen.c:355)
==103320==    by 0x2082F4: _tmp_7_rule (parser.c:23975)
==103320==    by 0x203897: _PyPegen_lookahead (pegen.c:346)
==103320==    by 0x26AFDA: compound_stmt_rule (parser.c:2003)
==103320==    by 0x26BA55: statement_rule (parser.c:1349)
==103320==    by 0x26BE31: _loop1_3_rule (parser.c:23733)
==103320==    by 0x26C00C: statements_rule (parser.c:1305)
==103320==  Address 0x58cf758 is 0 bytes after a block of size 810,936 alloc'd
==103320==    at 0x4846CC3: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==103320==    by 0x2E1E77: _PyMem_RawRealloc (obmalloc.c:65)
==103320==    by 0x2E3E24: PyMem_Realloc (obmalloc.c:610)
==103320==    by 0x270BFF: tok_reserve_buf (tokenizer.c:374)
==103320==    by 0x272684: tok_readline_recode (tokenizer.c:416)
==103320==    by 0x272826: tok_underflow_file (tokenizer.c:988)
==103320==    by 0x272B73: tok_nextc (tokenizer.c:1079)
==103320==    by 0x27313F: tok_get (tokenizer.c:1453)
==103320==    by 0x274AF1: _PyTokenizer_Get (tokenizer.c:2158)
==103320==    by 0x203525: _PyPegen_fill_token (pegen.c:201)
==103320==    by 0x2038F3: _PyPegen_expect_token (pegen.c:355)
==103320==    by 0x2082F4: _tmp_7_rule (parser.c:23975)

@pablogsal
Copy link
Member

The problem is that we are assuming that we can happily add a \n and a NULL terminator, but that's not true if we are at the end of the line.

@pablogsal
Copy link
Member

There are still some mysteries because we are supposedly allocating for that. Some assumption somewhere is wrong

@pablogsal
Copy link
Member

I will make a PR as soon as I get a better understanding on how the problem is being triggered

pablogsal added a commit to pablogsal/cpython that referenced this issue Nov 19, 2022
pablogsal added a commit to pablogsal/cpython that referenced this issue Nov 20, 2022
…ines that fill the available buffer

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
pablogsal added a commit to pablogsal/cpython that referenced this issue Nov 20, 2022
…ines that fill the available buffer

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
pablogsal added a commit to pablogsal/cpython that referenced this issue Nov 20, 2022
…ines that fill the available buffer

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
pablogsal added a commit that referenced this issue Nov 20, 2022
pablogsal added a commit to pablogsal/cpython that referenced this issue Nov 20, 2022
…pying lines that fill the available buffer (pythonGH-99605)

(cherry picked from commit e13d1d9)

Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Nov 20, 2022
…ines that fill the available buffer (pythonGH-99605)

(cherry picked from commit e13d1d9)

Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
pablogsal added a commit to pablogsal/cpython that referenced this issue Nov 20, 2022
…pying lines that fill the available buffer (pythonGH-99605).

(cherry picked from commit e13d1d9)

Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
miss-islington added a commit that referenced this issue Nov 20, 2022
…hat fill the available buffer (GH-99605)

(cherry picked from commit e13d1d9)

Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
pablogsal added a commit that referenced this issue Nov 20, 2022
@hauntsaninja
Copy link
Contributor

Thanks everyone, looks like this has been fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump
Projects
None yet
Development

No branches or pull requests

5 participants