Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace _Py_IDENTIFIER() with statically initialized objects. #90699

Closed
ericsnowcurrently opened this issue Jan 26, 2022 · 42 comments
Closed

Replace _Py_IDENTIFIER() with statically initialized objects. #90699

ericsnowcurrently opened this issue Jan 26, 2022 · 42 comments
Assignees
Labels
3.12 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs)

Comments

@ericsnowcurrently
Copy link
Member

ericsnowcurrently commented Jan 26, 2022

BPO 46541
Nosy @vstinner, @tiran, @ericsnowcurrently, @serhiy-storchaka, @corona10, @erlend-aasland, @kumaraditya303, @notarealdeveloper
PRs
  • bpo-46541: Replace core use of _Py_IDENTIFIER() with statically initialized global objects. #30928
  • bpo-46541: Generate the global objects initializer. #30941
  • bpo-46712: Share global string identifiers in deepfreeze #31261
  • bpo-46541: Regen the global objects using PYTHON_FOR_REGEN. #31344
  • bpo-46541: Discover the global strings. #31346
  • bpo-46541: Replace _Py_IDENTIFIER with _Py_ID in sqlite3 #31351
  • bpo-46541: Remove usage of _Py_IDENTIFIER from dbms modules #31358
  • bpo-46541: Drop the c-analyzer check for orphaned global strings. #31363
  • bpo-46541: Scan Fewer Files in generate_global_objects.py #31364
  • bpo-46541: Remove usage of _Py_IDENTIFIER from csv module #31372
  • bpo-46541: Remove usage of _Py_IDENTIFIER from mmap module #31375
  • bpo-46541: Remove usage of _Py_IDENTIFIER from array module #31376
  • bpo-46210: Fix deadlock in print. #30310
  • bpo-46541: Remove usage of _Py_IDENTIFIER from pyexpat #31468
  • bpo-46541: Remove usage of _Py_IDENTIFIER from multibytecodec #31475
  • bpo-46541: remove usage of _Py_IDENTIFIER from _ssl module #31599
  • bpo-46541: Remove unnecessary Py_VISIT (GH-31608) #31608
  • bpo-46541: Remove unneeded visits from sqlite3 #31609
  • bpo-46541: Remove usage of _Py_IDENTIFIER from lzma module #31683
  • bpo-46541: Add a comment about when to use _Py_DECLARE_STR(). #32063
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ericsnowcurrently'
    closed_at = None
    created_at = <Date 2022-01-26.22:01:04.421>
    labels = ['interpreter-core', '3.11']
    title = 'Replace _Py_IDENTIFIER() with statically initialized objects.'
    updated_at = <Date 2022-03-23.15:52:58.371>
    user = 'https://github.com/ericsnowcurrently'

    bugs.python.org fields:

    activity = <Date 2022-03-23.15:52:58.371>
    actor = 'eric.snow'
    assignee = 'eric.snow'
    closed = False
    closed_date = None
    closer = None
    components = ['Interpreter Core']
    creation = <Date 2022-01-26.22:01:04.421>
    creator = 'eric.snow'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 46541
    keywords = ['patch']
    message_count = 28.0
    messages = ['411799', '411800', '411910', '411987', '412229', '412405', '412488', '412868', '413047', '413086', '413246', '413274', '413311', '413312', '413323', '413333', '413334', '413337', '413355', '413374', '413384', '413653', '414178', '414182', '414265', '414274', '414534', '415881']
    nosy_count = 8.0
    nosy_names = ['vstinner', 'christian.heimes', 'eric.snow', 'serhiy.storchaka', 'corona10', 'erlendaasland', 'kumaraditya', 'notarealdeveloper']
    pr_nums = ['30928', '30941', '31261', '31344', '31346', '31351', '31358', '31363', '31364', '31372', '31375', '31376', '30310', '31468', '31475', '31599', '31608', '31609', '31683', '32063']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue46541'
    versions = ['Python 3.11']

    @ericsnowcurrently
    Copy link
    Member Author

    _Py_Identifier has been useful but at this point there is a faster and simpler approach we could take as a replacement: statically initialize the objects as fields on _PyRuntimeState and reference them directly through a macro.

    This would involve the following:

    • add a PyUnicodeObject field (not a pointer) to _PyRuntimeStatefor each string that currently uses_Py_IDENTIFIER()`
    • initialize each object as part of the static initializer for _PyRuntimeState
    • make each "immortal" (e.g. start with a really high refcount)
    • add a macro to look up a given string
    • update each location that currently uses _Py_IDENTIFIER() to use the new macro instead

    As part of this, we would also do the following:

    • get rid of all C-API functions with _Py_Identifer parameters
    • get rid of the old runtime state related to identifiers
    • get rid of _Py_Identifier, _Py_IDENTIFIER(), etc.

    (Note that there are several hundred uses of _Py_IDENTIFIER(), including a number of duplicates.)

    Pros:

    • reduces indirection (and extra calls) for C-API using the strings (making the code easier to understand and speeding it up)
    • the objects are referenced from a fixed address in the static data section (speeding things up and allowing the C compiler to optimize better)
    • there is no lazy allocation (or lookup, etc.) so there are fewer possible failures when the objects get used (thus less error return checking)
    • simplifies the runtime state
    • saves memory (at little, at least)
    • the approach for per-interpreter is simpler (if needed)
    • reduces the number of static variables in any given C module
    • reduces the number of functions in the ("private") C-API
    • "deep frozen" modules can use these strings
    • other commonly-used strings could be pre-allocated by adding _PyRuntimeState fields for them

    Cons:

    • churn
    • adding a string to the list requires modifying a separate file from the one where you actually want to use the string
    • strings can get "orphaned" (we could prevent this with a check in make check)
    • some PyPI packages may rely on _Py_IDENTIFIER() (even though it is "private" C-API)
    • some strings may never get used for any given ./python invocation

    Note that with a basic partial implementation (GH-30928) I'm seeing a 1% improvement in performance (see faster-cpython/ideas#230).

    @ericsnowcurrently ericsnowcurrently added the 3.11 only security fixes label Jan 26, 2022
    @ericsnowcurrently ericsnowcurrently self-assigned this Jan 26, 2022
    @ericsnowcurrently ericsnowcurrently added interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.11 only security fixes labels Jan 26, 2022
    @ericsnowcurrently ericsnowcurrently self-assigned this Jan 26, 2022
    @ericsnowcurrently ericsnowcurrently added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Jan 26, 2022
    @ericsnowcurrently
    Copy link
    Member Author

    ## Background ##

    _Py_Identifier (and _Py_IDENTIFIER(), etc.) was added in 2011 [1][2] for several reasons:

    • provide a consistent approach for a common optimization: caching C-string based string objects
    • facilitate freeing those objects during runtime finalization

    The solution involved using a static variable defined, using _Py_IDENTIFIER() near the code that needed the string. The variable (a _Py_Identifier) would hold the desired C string (statically initialized) and the corresponding (lazily created) PyUnicodeObject. The code where the _Py_Identifier was defined would then pass it to specialized versions of various C-API that would normally consume a C string or PyUnicodeObject. Then that code would use either the C-string or the object (creating and caching it first if not done already). This approach decentralized the caching but also provided a single tracking mechanism that made it easier to clean up the objects.

    Over the last decade a number of changes were made, including recent changes to make the identifiers per-interpreter and to use a centralized cache.

    [1] afe55bb
    [2] https://mail.python.org/archives/list/python-dev@python.org/message/FRUTTE47JO2XN3LXV2J4VB5A5VILILLA/

    @ericsnowcurrently
    Copy link
    Member Author

    New changeset 247480a by Eric Snow in branch 'main':
    bpo-46541: Generate the global objects initializer. (gh-30941)
    247480a

    @vstinner
    Copy link
    Member

    _Py_Identifier has been useful but at this point there is a faster and simpler approach we could take as a replacement: statically initialize the objects as fields on _PyRuntimeState and reference them directly through a macro.

    This change is going to break projects in the wild. Yes, people use the _Py_IDENTIFIER(), _PyUnicode_FromId() and other "Id" variant of many functions in 3rd party projects.

    Is it possible to keep runtime initialization if this API is used by 3rd party code?

    @ericsnowcurrently
    Copy link
    Member Author

    If necessary, we can keep _Py_IDENTIFIER() (and the functions). Regardless, we can stop using it internally.

    @ericsnowcurrently
    Copy link
    Member Author

    FYI, I've posted to python-dev for feedback before proceeding:

    https://mail.python.org/archives/list/python-dev@python.org/thread/DNMZAMB4M6RVR76RDZMUK2WRLI6KAAYS/

    @ericsnowcurrently
    Copy link
    Member Author

    (thanks Victor: https://mail.python.org/archives/list/python-dev@python.org/message/7RMLIJHUWVBZFV747TFEHOE6LNBVQSMM/)

    3rd party use of _Py_IDENTIFIER():

    All of them should be trivial to drop _Py_IDENTIFIER() without any real performance impact or mess.

    Also, the following implies that PyPy has some sort of _Py_IDENTIFIER() support: https://github.com/benhoyt/scandir/blob/3396aa4155ffde8600a0e9ca50d5872569169b5d/_scandir.c#L51.

    @ericsnowcurrently
    Copy link
    Member Author

    New changeset 81c7204 by Eric Snow in branch 'main':
    bpo-46541: Replace core use of _Py_IDENTIFIER() with statically initialized global objects. (gh-30928)
    81c7204

    @lazka
    Copy link
    Mannequin

    lazka mannequin commented Feb 11, 2022

    Sorry if off topic, but I noticed that CPython doesn't deprecate macros in code, while with gcc/clang it's possible to show compiler warnings for them using some pragma magic:

    $ gcc a.c
    a.c: In function 'main':
    a.c:29:13: warning: Deprecated pre-processor symbol
       29 |     PySomethingDeprecated (0);
          |             ^~~~~~~~~~~~~~~~~~
    a.c:30:13: warning: Deprecated pre-processor symbol: replace with "SomethingCompletelyDifferent"
       30 |     PySomethingDeprecated2 (42);
          |             ^~~~~~~~~~~~~~~~~~~~

    Here is how glib implements this for example: https://gist.github.com/lazka/4749c74249a3918a059d944040aca4a3

    Maybe that makes getting rid of them easier in the long run?

    @ericsnowcurrently
    Copy link
    Member Author

    On Fri, Feb 11, 2022 at 1:36 AM Christoph Reiter <report@bugs.python.org> wrote:

    Sorry if off topic, but I noticed that CPython doesn't deprecate macros in code, while with gcc/clang it's possible to show compiler warnings for them using some pragma magic:
    [snip]
    Maybe that makes getting rid of them easier in the long run?

    That's a good question. We do have Py_DEPRECATED() (in
    Include/pyport.h), which is used for symbols. I'm not sure anyone has
    given much thought to deprecating macros, but it's probably worth
    considering. I recommend that you post something about this to
    python-dev@python.org.

    FWIW, here are other explanations of how to deprecate macros:

    @ericsnowcurrently
    Copy link
    Member Author

    With core code sorted out, stdlib and 3rd party extension modules are left to sort out.

    I see the following possibilities:

    1. leave _Py_IDENTIFIER() alone (it is already disallowed in core code)
    2. change _Py_IDENTIFIER() to create static string objects (then get rid of global state)
    3. get rid of _Py_IDENTIFIER()
      a. provide a public alternative (needs a PEP)
      b. first work with 3rd party projects to stop using it

    @ericsnowcurrently
    Copy link
    Member Author

    New changeset 12360aa by Eric Snow in branch 'main':
    bpo-46541: Discover the global strings. (gh-31346)
    12360aa

    @ericsnowcurrently
    Copy link
    Member Author

    New changeset 6c89589 by Eric Snow in branch 'main':
    bpo-46541: Drop the check for orphaned global strings. (gh-31363)
    6c89589

    @ericsnowcurrently
    Copy link
    Member Author

    New changeset 4d8a515 by Eric Snow in branch 'main':
    bpo-46541: Scan Fewer Files in generate_global_objects.py (gh-31364)
    4d8a515

    @corona10
    Copy link
    Member

    New changeset e59309b by Dong-hee Na in branch 'main':
    bpo-46541: Remove usage of _Py_IDENTIFIER from dbms modules (GH-31358)
    e59309b

    @corona10
    Copy link
    Member

    New changeset d64f3ca by Dong-hee Na in branch 'main':
    bpo-46541: Remove usage of _Py_IDENTIFIER from csv module (GH-31372)
    d64f3ca

    @corona10
    Copy link
    Member

    New changeset b207711 by Erlend Egeberg Aasland in branch 'main':
    bpo-46541: Replace _Py_IDENTIFIER with _Py_ID in sqlite3 (GH-31351)
    b207711

    @corona10
    Copy link
    Member

    New changeset e8a19b0 by Dong-hee Na in branch 'main':
    bpo-46541: Remove usage of _Py_IDENTIFIER from mmap module (GH-31375)
    e8a19b0

    @ericsnowcurrently
    Copy link
    Member Author

    (from #31376 (comment))

    [corona10]

    Should we create the separate bpo issue if module changes are too noisy?

    I think it's fine to use the one issue. There are only 26 modules with NEEDS_PY_IDENTIFIER and none have much risk with the change. So it's unlikely that we'll get any discussion about any specific module. If we do, we can easily create an issue then for the module in question. If one of the modules had many uses of _Py_IDENTIFIER() then it might make sense to split it out, but IIRC none of them have very many.

    @corona10
    Copy link
    Member

    I think it's fine to use the one issue.

    Great! and thank you for all your efforts :)

    @corona10
    Copy link
    Member

    New changeset 8cb5f70 by Dong-hee Na in branch 'main':
    bpo-46541: Remove usage of _Py_IDENTIFIER from array module (GH-31376)
    8cb5f70

    @corona10
    Copy link
    Member

    New changeset 2b86616 by Dong-hee Na in branch 'main':
    bpo-46541: Remove usage of _Py_IDENTIFIER from pyexpat (GH-31468)
    2b86616

    @tiran
    Copy link
    Member

    tiran commented Feb 28, 2022

    New changeset 088dd76 by Dong-hee Na in branch 'main':
    bpo-46541: Remove unnecessary Py_VISIT (GH-31608)
    088dd76

    @corona10
    Copy link
    Member

    New changeset c32aef4 by Erlend Egeberg Aasland in branch 'main':
    bpo-46541: Remove unneeded visits from sqlite3 (GH-31609)
    c32aef4

    @kumaraditya303
    Copy link
    Contributor

    ctypes and datetime are similar cases which already use internal C API so I'll change it to use _Py_ID.

    @ericsnowcurrently
    Copy link
    Member Author

    I'm okay with that, though ideally we would use public API as much as possible and limit _Py_ID() to performance-sensitive code. That said, I favor getting this done over being excessively careful about how much we use _Py_ID(). 🙂

    @kumaraditya303
    Copy link
    Contributor

    I'm okay with that, though ideally we would use public API as much as possible and limit _Py_ID() to performance-sensitive code. That said, I favor getting this done over being excessively careful about how much we use _Py_ID(). 🙂

    practicality beats purity :) We can of course reconsider this when these modules will be ported to multiphase heap types.

    @kumaraditya303
    Copy link
    Contributor

    kumaraditya303 commented Nov 3, 2022

    PR #99067 removes the remaining.

    @kumaraditya303
    Copy link
    Contributor

    FYI, once #99067 is merged, I'll disable using _Py_IDENTIFIER in core code. With that completed I consider this done and deprecation/removal for _Py_IDENTIFIER can be discussed separately.

    @ericsnowcurrently
    Copy link
    Member Author

    We will address Programs/_testembed.c in the issue/PR where we remove _Py_IDENTIFIER() (or decide to not).

    @ericsnowcurrently
    Copy link
    Member Author

    Thanks for doing all that work, @kumaraditya303!

    @ericsnowcurrently
    Copy link
    Member Author

    We can close this once gh-99210 lands.

    @ericsnowcurrently
    Copy link
    Member Author

    This side of things is done. We can create a new issue about maybe getting rid of _Py_IDENTIFIER() and what to do about third-party usage in the community.

    @kumaraditya303
    Copy link
    Contributor

    kumaraditya303 commented Nov 9, 2022

    Thanks to everyone!

    @ericsnowcurrently: It would be nice to update the c analyzer tool as a lot of globals are removed as part of this.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.12 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs)
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants