Slow import times due to always importing optional packages #1460

mattwthompson · 2022-02-08T18:16:01Z

Hi, thanks to the developers and maintainers for creating and supporting this software.

I've noticed some slow (> 1 second) import times and used Tuna to isolate it as the Pandas import.

$ python
Python 3.9.10 | packaged by conda-forge | (main, Feb  1 2022, 21:28:27)
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> from pint import UnitRegistry
>>> 'pandas' in sys.modules
True
>>>
$ conda list | grep pandas
pandas                    1.4.0            py39h4d6be9b_0    conda-forge

Here is Tuna's output of a script containing only from pint import UnitRegistry:

It seems to be here, the result of #959 in which it's imported if available alongside some other packages no matter if it's used. Note that I don't have pint-pandas installed so I'm not trying to take advantage of that functionality.

I assuming the thinking there is "if it's not needed, the user shouldn't have it installed," but Pandas is used by so many packages that it's hard to avoid it creeping in. Most of of my use cases, i.e. tracking small amounts of data or something as simple as providing a custom unit registry, don't use Pandas and therefore the ~1 second of import time is not a desired side effect.

The text was updated successfully, but these errors were encountered:

jthielen · 2022-02-08T19:05:33Z

I assuming the thinking there is "if it's not needed, the user shouldn't have it installed," [...]

That's not intention with these imports, instead, we need to have a list of types that Pint is not allowed to wrap... it's not about enabling optional features as much as ensuring that erroneous behavior cannot come about with whatever packages a user happens to have in their environment (e.g., we need to make sure that attempting to stick a pandas Series inside a pint Quantity always fails). Having a "deny list" rather than an "allow list" was a decision made early on given Pint's relatively high place in the generally agreed upon type casting hierarchy, although it is one that will likely see revisiting once there is consensus reached on pydata/duck-array-discussion#3.

That being said, using type checks against the actual array types themselves still opens us up to those other libraries' slow import times also slowing down Pint. This style of check has caused circular import issues in the past as well (see
pydata/xarray#4751 (comment)), so regardless of whether there is an "allow list" or "deny list," I think we should move away from isinstance checks and see if fully qualified class name strings would work. @keewis do you have suggestions for this?

keewis · 2022-02-08T22:40:30Z

not really, besides pushing pydata/duck-array-discussion#3 forward.

For now, the code for fully_qualified_name could be something like:

def fully_qualified_name(obj):
    t = type(obj)
    module = t.__module__
    name = t.__qualname__

    if module is None or module == "__builtin__":
        return name

    return f"{module}.{name}"

and we'd define a tuple / set of strings to compare against. upcast_types = ("xarray.core.dataarray.DataArray", ...), for example?

hgrecco · 2022-03-01T15:48:56Z

I like @keewis proposal. One improved way will be to "delay" the import until the fully qualified match. The rationale would be as follows:

we compare by fully qualified name,
if it does not match, do nothing
if there is a match, we import the actual package (that should be imported already by the user because we are actually getting and object) and we double check.

Something like this (disregard the name of the variables):

upcast_types = { "fully.qualified.name": actual_type or None}

def is_upcast_type_by_name(other):
    fqn = fully_qualified_name(other)
    try:
        importer = upcast_types[fqn]
    except KeyError:
        return False
    if isinstance(importer, callable):
        cls = importer()
    else:
        module_name, class_name = fqn.rsplit(".", 1)
        cls = getattr(import_module(module_name), cls)

    upcast_types[fqn] = cls
    # This is to check we are importing the same thing.
    # and avoid weird problems. Maybe instead of return
    # we should raise an error if false.
    return isinstance(obj, cls)

def is_upcast_type(other):
    if other in upcast_types.values():
         return True
    return is_upcast_type_by_name(other)

and in compat.py

# Meaning: "we now of its existence and we can handle"
# Alternative we can also allow having a function that does a more complex import. 
upcast_types["xarray.core.dataarray.DataArray"] = None

This was referenced Feb 9, 2022

Add "desire paths" to API openforcefield/openff-toolkit#1192

Merged

Slow import times openforcefield/openff-units#17

Closed

mattwthompson mentioned this issue Mar 30, 2022

Defer loading modules used only in compat.upcast_types #1492

Merged

5 tasks

hgrecco closed this as completed in #1492 Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow import times due to always importing optional packages #1460

Slow import times due to always importing optional packages #1460

mattwthompson commented Feb 8, 2022

jthielen commented Feb 8, 2022 •

edited

Loading

keewis commented Feb 8, 2022

hgrecco commented Mar 1, 2022 •

edited

Loading

Slow import times due to always importing optional packages #1460

Slow import times due to always importing optional packages #1460

Comments

mattwthompson commented Feb 8, 2022

jthielen commented Feb 8, 2022 • edited Loading

keewis commented Feb 8, 2022

hgrecco commented Mar 1, 2022 • edited Loading

jthielen commented Feb 8, 2022 •

edited

Loading

hgrecco commented Mar 1, 2022 •

edited

Loading