Skip to content

Segfault with non-initialized codecs #142662

@shadchin

Description

@shadchin

Crash report

What happened?

The problem occurs after the commit f8290df (Python 3.13+)

The minimum reproducer:

  • Checkout cpython repository
  • ./configure
  • make
  • cd Programs
  • echo "# -*- coding: UTF -*-" > crash.py
  • ./_freeze_module crash crash.py crash.h
  • Segmentation fault
Program received signal SIGSEGV, Segmentation fault.
PyDict_GetItemRef (op=0x0, key='utf', result=result@entry=0x7fffffffdb70) at ./Include/object.h:795
795         return ((flags & feature) != 0);
(gdb) bt
#0  PyDict_GetItemRef (op=0x0, key='utf', result=result@entry=0x7fffffffdb70) at ./Include/object.h:795
#1  0x00005555557ab5a7 in _PyCodec_Lookup (encoding=encoding@entry=0x7ffff7bc4050 "UTF") at Python/codecs.c:164
#2  0x00005555557ac2e2 in _PyCodec_LookupTextEncoding (alternate_command=0x55555593d93d "codecs.decode()", encoding=0x7ffff7bc4050 "UTF") at Python/codecs.c:525
#3  codec_getitem_checked (index=1, alternate_command=0x55555593d93d "codecs.decode()", encoding=0x7ffff7bc4050 "UTF") at Python/codecs.c:574
#4  _PyCodec_TextDecoder (encoding=0x7ffff7bc4050 "UTF") at Python/codecs.c:590
#5  _PyCodec_DecodeText (object=object@entry=<memoryview at remote 0x7ffff7b88280>, encoding=encoding@entry=0x7ffff7bc4050 "UTF", errors=errors@entry=0x0) at Python/codecs.c:612
#6  0x000055555573fb39 in PyUnicode_Decode (s=s@entry=0x7ffff7b5dfb0 "# -*- coding: UTF -*-\n", size=<optimized out>, encoding=encoding@entry=0x7ffff7bc4050 "UTF", errors=errors@entry=0x0)
    at Objects/unicodeobject.c:3712
#7  0x000055555574007f in PyUnicode_Decode (s=s@entry=0x7ffff7b5dfb0 "# -*- coding: UTF -*-\n", size=<optimized out>, encoding=<optimized out>, encoding@entry=0x7ffff7bc4050 "UTF", errors=<optimized out>,
    errors@entry=0x0) at Objects/unicodeobject.c:3730
#8  0x000055555560f706 in _PyTokenizer_translate_into_utf8 (str=str@entry=0x7ffff7b5dfb0 "# -*- coding: UTF -*-\n", enc=0x7ffff7bc4050 "UTF") at Parser/tokenizer/helpers.c:206
#9  0x000055555560ecfc in decode_str (preserve_crlf=<optimized out>, tok=0x555555b7d510, single=<optimized out>, input=<optimized out>) at Parser/tokenizer/string_tokenizer.c:103
#10 _PyTokenizer_FromString (str=<optimized out>, exec_input=<optimized out>, preserve_crlf=<optimized out>) at Parser/tokenizer/string_tokenizer.c:125
#11 0x00005555555da1e7 in _PyPegen_run_parser_from_string (str=str@entry=0x555555b4c4a0 "# -*- coding: UTF -*-\n", start_rule=start_rule@entry=257, filename_ob=filename_ob@entry='<frozen crash>',
    flags=flags@entry=0x0, arena=arena@entry=0x7ffff7b5df70) at Parser/pegen.c:1054
#12 0x000055555560a0e6 in _PyParser_ASTFromString (str=str@entry=0x555555b4c4a0 "# -*- coding: UTF -*-\n", filename=filename@entry='<frozen crash>', mode=mode@entry=257, flags=flags@entry=0x0,
    arena=arena@entry=0x7ffff7b5df70) at Parser/peg_api.c:13
#13 0x0000555555826df5 in Py_CompileStringObject (optimize=0, flags=0x0, start=257, filename='<frozen crash>', str=0x555555b4c4a0 "# -*- coding: UTF -*-\n") at Python/pythonrun.c:1517
#14 Py_CompileStringExFlags (str=str@entry=0x555555b4c4a0 "# -*- coding: UTF -*-\n", filename_str=filename_str@entry=0x555555b4c2a0 "<frozen crash>", start=start@entry=257, flags=flags@entry=0x0,
    optimize=optimize@entry=0) at Python/pythonrun.c:1545
#15 0x00005555555c5398 in compile_and_marshal (text=0x555555b4c4a0 "# -*- coding: UTF -*-\n", name=0x7fffffffe2ed "crash") at Programs/_freeze_module.c:117
#16 main (argc=<optimized out>, argv=<optimized out>) at Programs/_freeze_module.c:231

If build with --with-pydebug:

_freeze_module: Python/codecs.c:149: _PyCodec_Lookup: Assertion `interp->codecs.initialized' failed.

The problem is not very popular, but if you are building an analog of _freeze_module for yourself, it will segfault on problematic encodings. So we found the following cases in our repository: UTF, U8 :)

Before commit f8290df in _PyCodec_Lookup, if codecs was not initialized, then we tried to initialize it, and if it failed, NULL was returned

    if (interp->codec_search_path == NULL && _PyCodecRegistry_Init()) {
        return NULL;
    }

My naive solution is to replace assert with the old behavior:

@@ -138,6 +138,9 @@ PyObject *_PyCodec_Lookup(const char *encoding)
     }
.
     PyInterpreterState *interp = _PyInterpreterState_GET();
-    assert(interp->codecs.initialized);
+    if (!interp->codecs.initialized) {
+        return NULL;
+    }
.
     /* Convert the encoding to a normalized Python string: all

CPython versions tested on:

3.13, 3.14, 3.15, CPython main branch

Operating systems tested on:

Linux

Output from running 'python -VV' on the command line:

Python 3.15.0a2+ (heads/main:c98182be8d4, Dec 13 2025, 16:49:21) [GCC 9.4.0]

Metadata

Metadata

Assignees

No one assigned

    Labels

    interpreter-core(Objects, Python, Grammar, and Parser dirs)type-crashA hard crash of the interpreter, possibly with a core dump

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions