Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow import times due to always importing optional packages #1460

Closed
mattwthompson opened this issue Feb 8, 2022 · 3 comments · Fixed by #1492
Closed

Slow import times due to always importing optional packages #1460

mattwthompson opened this issue Feb 8, 2022 · 3 comments · Fixed by #1492

Comments

@mattwthompson
Copy link
Contributor

Hi, thanks to the developers and maintainers for creating and supporting this software.

I've noticed some slow (> 1 second) import times and used Tuna to isolate it as the Pandas import.

$ python
Python 3.9.10 | packaged by conda-forge | (main, Feb  1 2022, 21:28:27)
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> from pint import UnitRegistry
>>> 'pandas' in sys.modules
True
>>>
$ conda list | grep pandas
pandas                    1.4.0            py39h4d6be9b_0    conda-forge

Here is Tuna's output of a script containing only from pint import UnitRegistry:

image

It seems to be here, the result of #959 in which it's imported if available alongside some other packages no matter if it's used. Note that I don't have pint-pandas installed so I'm not trying to take advantage of that functionality.

I assuming the thinking there is "if it's not needed, the user shouldn't have it installed," but Pandas is used by so many packages that it's hard to avoid it creeping in. Most of of my use cases, i.e. tracking small amounts of data or something as simple as providing a custom unit registry, don't use Pandas and therefore the ~1 second of import time is not a desired side effect.

@jthielen
Copy link
Contributor

jthielen commented Feb 8, 2022

I assuming the thinking there is "if it's not needed, the user shouldn't have it installed," [...]

That's not intention with these imports, instead, we need to have a list of types that Pint is not allowed to wrap... it's not about enabling optional features as much as ensuring that erroneous behavior cannot come about with whatever packages a user happens to have in their environment (e.g., we need to make sure that attempting to stick a pandas Series inside a pint Quantity always fails). Having a "deny list" rather than an "allow list" was a decision made early on given Pint's relatively high place in the generally agreed upon type casting hierarchy, although it is one that will likely see revisiting once there is consensus reached on pydata/duck-array-discussion#3.

That being said, using type checks against the actual array types themselves still opens us up to those other libraries' slow import times also slowing down Pint. This style of check has caused circular import issues in the past as well (see
pydata/xarray#4751 (comment)), so regardless of whether there is an "allow list" or "deny list," I think we should move away from isinstance checks and see if fully qualified class name strings would work. @keewis do you have suggestions for this?

@keewis
Copy link
Contributor

keewis commented Feb 8, 2022

not really, besides pushing pydata/duck-array-discussion#3 forward.

For now, the code for fully_qualified_name could be something like:

def fully_qualified_name(obj):
    t = type(obj)
    module = t.__module__
    name = t.__qualname__

    if module is None or module == "__builtin__":
        return name

    return f"{module}.{name}"

and we'd define a tuple / set of strings to compare against. upcast_types = ("xarray.core.dataarray.DataArray", ...), for example?

@hgrecco
Copy link
Owner

hgrecco commented Mar 1, 2022

I like @keewis proposal. One improved way will be to "delay" the import until the fully qualified match. The rationale would be as follows:

  1. we compare by fully qualified name,
  2. if it does not match, do nothing
  3. if there is a match, we import the actual package (that should be imported already by the user because we are actually getting and object) and we double check.

Something like this (disregard the name of the variables):

upcast_types = { "fully.qualified.name": actual_type or None}

def is_upcast_type_by_name(other):
    fqn = fully_qualified_name(other)
    try:
        importer = upcast_types[fqn]
    except KeyError:
        return False
    if isinstance(importer, callable):
        cls = importer()
    else:
        module_name, class_name = fqn.rsplit(".", 1)
        cls = getattr(import_module(module_name), cls)

    upcast_types[fqn] = cls
    # This is to check we are importing the same thing.
    # and avoid weird problems. Maybe instead of return
    # we should raise an error if false.
    return isinstance(obj, cls)

def is_upcast_type(other):
    if other in upcast_types.values():
         return True
    return is_upcast_type_by_name(other)

and in compat.py

# Meaning: "we now of its existence and we can handle"
# Alternative we can also allow having a function that does a more complex import. 
upcast_types["xarray.core.dataarray.DataArray"] = None 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants