Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] apply with a UDF that references the pandas module and another module fails to find __import__ with cudf.pandas #15548

Closed
blue-cat-whale opened this issue Apr 17, 2024 · 2 comments · Fixed by #15569
Labels
bug Something isn't working cudf.pandas Issues specific to cudf.pandas Python Affects Python cuDF API.

Comments

@blue-cat-whale
Copy link

My code runs correctly without cudf. When I install cudf, it reports a 'NotImplementedError'. Which part of the code caused the problem? Is there a roadmap to implement it?

The error:

-------------------- program starts -----------------------
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/pandas/fast_slow_proxy.py", line 888, in _fast_slow_function_call
    fast_args, fast_kwargs = _fast_arg(args), _fast_arg(kwargs)
                             ^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/pandas/fast_slow_proxy.py", line 1007, in _fast_arg
    return _transform_arg(arg, "_fsproxy_fast", seen)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/pandas/fast_slow_proxy.py", line 934, in _transform_arg
    return tuple(_transform_arg(a, attribute_name, seen) for a in arg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/pandas/fast_slow_proxy.py", line 934, in <genexpr>
    return tuple(_transform_arg(a, attribute_name, seen) for a in arg)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/pandas/fast_slow_proxy.py", line 917, in _transform_arg
    typ = getattr(arg, attribute_name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/pandas/fast_slow_proxy.py", line 528, in _fsproxy_fast
    self._fsproxy_wrapped = self._fsproxy_slow_to_fast()
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/pandas/fast_slow_proxy.py", line 786, in _fsproxy_slow_to_fast
    args, kwargs = _fast_arg(args), _fast_arg(kwargs)
                   ^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/pandas/fast_slow_proxy.py", line 1007, in _fast_arg
    return _transform_arg(arg, "_fsproxy_fast", seen)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/pandas/fast_slow_proxy.py", line 934, in _transform_arg
    return tuple(_transform_arg(a, attribute_name, seen) for a in arg)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/pandas/fast_slow_proxy.py", line 934, in <genexpr>
    return tuple(_transform_arg(a, attribute_name, seen) for a in arg)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/pandas/fast_slow_proxy.py", line 917, in _transform_arg
    typ = getattr(arg, attribute_name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/pandas/fast_slow_proxy.py", line 528, in _fsproxy_fast
    self._fsproxy_wrapped = self._fsproxy_slow_to_fast()
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/pandas/fast_slow_proxy.py", line 174, in _fsproxy_slow_to_fast
    return slow_to_fast(self._fsproxy_wrapped)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/core/dataframe.py", line 8011, in from_pandas
    return DataFrame.from_pandas(obj, nan_as_null=nan_as_null)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/core/dataframe.py", line 5383, in from_pandas
    data = {
           ^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/core/dataframe.py", line 5384, in <dictcomp>
    col_name: column.as_column(
              ^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/core/column/column.py", line 1923, in as_column
    raise NotImplementedError("not supported")
NotImplementedError: not supported

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.11/concurrent/futures/process.py", line 256, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/process.py", line 205, in _process_chunk
    return [fn(*args) for args in chunk]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/process.py", line 205, in <listcomp>
    return [fn(*args) for args in chunk]
            ^^^^^^^^^
  File "/home/working/code/nn/tmp2.py", line 49, in my_func_single
    df_padding = my_df.apply(my_apply,axis=1,bias=branchIndex,n_l=name_list)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/pandas/fast_slow_proxy.py", line 837, in __call__
    result, _ = _fast_slow_function_call(
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/pandas/fast_slow_proxy.py", line 902, in _fast_slow_function_call
    result = func(*slow_args, **slow_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/cudf/pandas/fast_slow_proxy.py", line 30, in call_operator
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/pandas/core/frame.py", line 10361, in apply
    return op.apply().__finalize__(self, method="apply")
           ^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/pandas/core/apply.py", line 916, in apply
    return self.apply_standard()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/pandas/core/apply.py", line 1063, in apply_standard
    results, res_index = self.apply_series_generator()
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/share/.virtualenvs/cudf/lib64/python3.11/site-packages/pandas/core/apply.py", line 1081, in apply_series_generator
    results[i] = self.func(v, *self.args, **self.kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/working/code/nn/tmp2.py", line 31, in my_apply
    t_now = datetime.strptime(df['Minute'], '%H:%M:%S')
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: '__import__'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/working/code/nn/tmp2.py", line 70, in <module>
    main()
  File "/home/working/code/nn/tmp2.py", line 67, in main
    my_func()
  File "/home/working/code/nn/tmp2.py", line 61, in my_func
    for obj in r:
  File "/usr/lib64/python3.11/concurrent/futures/process.py", line 606, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
KeyError: '__import__'

And here is my code:

from datetime import datetime, timedelta, date
import numpy as np
try:
    import cudf.pandas
    cudf.pandas.install()
except:
    print('cudf.pandas load failed')
import pandas as pd
from random import randint
import json, sys, os
from cudf.pandas.module_accelerator import disable_module_accelerator

from functools import partial
from concurrent.futures import ProcessPoolExecutor as Pool
from multiprocessing import set_start_method


def data_generation(nRows: int):
################## unimportant, for reproducing purpose ###################
# This function generates the dataframe obj, which has 5 columns, and the data are sorted by WorkingDay and Minute ascendingly
    my_df = pd.DataFrame(data={'WorkingDay': ['2019-01-02', '2018-01-02', '2019-05-02', '2020-01-02', '2021-01-02'], 'name': ['albert', 'alex', 'alice', 'ben', 'bob'], 'Minute': ['09:00:00', '09:20:00', '08:00:00', '07:00:00', '09:30:00'], 'aaa': np.random.rand(5), 'bbb': np.    random.rand(5)})
    my_df = pd.concat([my_df for i in range(int(nRows/5))], axis=0)
    my_df['WorkingDay'] = my_df['WorkingDay'].map(lambda x: (date(randint(2010,2020), randint(1,4), randint(1,5))).strftime('%Y-%m-%d'))
    my_df['Minute'] = np.random.permutation(my_df['Minute'].values)
    my_df = my_df.sort_values(by=['WorkingDay', 'Minute'], inplace=False).reset_index(drop=True,inplace=False)
    return my_df


def my_apply(df, bias: int, n_l: list):
    df_padding = None
    t_now = datetime.strptime(df['Minute'], '%H:%M:%S')
    for i in range(2):
        df_padding = pd.concat([df_padding,df],axis=1)
        df_padding.loc[df_padding.index[-1],'aaa'] = df['aaa'] + i
        df_padding.loc[df_padding.index[-1],'name'] = n_l[i]
        df_padding.loc[df_padding.index[-1],'bbb'] = df['bbb'] + bias
        t_now = t_now+timedelta(minutes=2)
        df_padding.loc[df_padding.index[-1],'Minute'] = t_now.strftime('%H:%M:%S')
    return df_padding.transpose()


def my_func_single(branchIndex: int):
    my_df = data_generation(20-5*branchIndex)
    my_df[['WorkingDay','name','Minute']] = my_df[['WorkingDay','name','Minute']].astype('string')
    name_list = ['a_albert', 'b_bob', 'c_chris', 'd_dave']
# data generated
# -------------------------- The problem comes from below ------------------------
    with disable_module_accelerator():
        df_padding = my_df.apply(my_apply,axis=1,bias=branchIndex,n_l=name_list)
        df_padding = df_padding.T.dropna().reset_index(drop=True)
        df_padding = pd.concat([r for r in df_padding],axis=0).reset_index(drop=True)
    return df_padding
    #return df_padding.values, list(df_padding.index)


def my_func():
    set_start_method('spawn')
    my_func_partial = partial(my_func_single)
    with Pool(max_workers=2) as pool:
        r = pool.map(my_func_partial, range(3))
    for obj in r:
        #print('df has length: {}.'.format(obj))
        print('df has length: {}.'.format(obj.shape[0]))


def main():
    print('-------------------- program starts -----------------------')
    my_func()


if __name__ == '__main__':
    main() 

I'm using cudf-cu12==24.4.0 and pandas==2.2.1

@blue-cat-whale blue-cat-whale added the question Further information is requested label Apr 17, 2024
@bdice
Copy link
Contributor

bdice commented Apr 17, 2024

@mroeschke Would you have insight here? I know you’ve looked at datetimes and as_column refactoring lately. It looks like we’re hitting this:

raise NotImplementedError("not supported")

@mroeschke
Copy link
Contributor

Thanks for the report. Related to @bdice's comment about a NotImplementedError getting hit, the astype("string") call in my_func_single is not supported yet as cudf.pandas cannot faithfully roundtrip these types yet #14149. Using astype(str) should achieve a similar result if your data doesn't have NAs

However once fixing that, I think we're hitting an actual bug when a UDF references the pandas module and another module from the global namespace (xref #14482 maybe)

In [1]: %load_ext cudf.pandas
   ...: import pandas as pd
   ...: from datetime import datetime
   ...: 
   ...: def my_apply(df, bias: int):
   ...:     datetime.strptime(df['Minute'], '%H:%M:%S')
   ...:     return pd.to_numeric(1)
   ...: 
   ...: my_df = pd.DataFrame({'Minute': ['09:00:00']})
   ...: my_df.apply(my_apply,axis=1, bias=1)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/python/cudf/cudf/pandas/fast_slow_proxy.py:889, in _fast_slow_function_call(func, *args, **kwargs)
    888 fast_args, fast_kwargs = _fast_arg(args), _fast_arg(kwargs)
--> 889 result = func(*fast_args, **fast_kwargs)
    890 if result is NotImplemented:
    891     # try slow path

File ~/python/cudf/cudf/pandas/fast_slow_proxy.py:30, in call_operator(fn, args, kwargs)
     29 def call_operator(fn, args, kwargs):
---> 30     return fn(*args, **kwargs)

File ~/miniforge3/envs/cudf-dev/lib/python3.11/site-packages/nvtx/nvtx.py:116, in annotate.__call__.<locals>.inner(*args, **kwargs)
    115 libnvtx_push_range(self.attributes, self.domain.handle)
--> 116 result = func(*args, **kwargs)
    117 libnvtx_pop_range(self.domain.handle)

File ~python/cudf/cudf/core/dataframe.py:4603, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
   4601     raise ValueError("The `result_type` kwarg is not yet supported.")
-> 4603 return self._apply(func, _get_row_kernel, *args, **kwargs)

File ~/miniforge3/envs/cudf-dev/lib/python3.11/contextlib.py:81, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     80 with self._recreate_cm():
---> 81     return func(*args, **kwds)

File ~/miniforge3/envs/cudf-dev/lib/python3.11/site-packages/nvtx/nvtx.py:116, in annotate.__call__.<locals>.inner(*args, **kwargs)
    115 libnvtx_push_range(self.attributes, self.domain.handle)
--> 116 result = func(*args, **kwargs)
    117 libnvtx_pop_range(self.domain.handle)

File ~/python/cudf/cudf/core/indexed_frame.py:3446, in IndexedFrame._apply(self, func, kernel_getter, *args, **kwargs)
   3445 if kwargs:
-> 3446     raise ValueError("UDFs using **kwargs are not yet supported.")
   3447 try:

ValueError: UDFs using **kwargs are not yet supported.

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
Cell In[1], line 10
      7     return pd.to_numeric(1)
      9 my_df = pd.DataFrame({'Minute': ['09:00:00']})
---> 10 my_df.apply(my_apply,axis=1, bias=1)

File ~/python/cudf/cudf/pandas/fast_slow_proxy.py:837, in _CallableProxyMixin.__call__(self, *args, **kwargs)
    836 def __call__(self, *args, **kwargs) -> Any:
--> 837     result, _ = _fast_slow_function_call(
    838         # We cannot directly call self here because we need it to be
    839         # converted into either the fast or slow object (by
    840         # _fast_slow_function_call) to avoid infinite recursion.
    841         # TODO: When Python 3.11 is the minimum supported Python version
    842         # this can use operator.call
    843         call_operator,
    844         self,
    845         args,
    846         kwargs,
    847     )
    848     return result

File ~/python/cudf/cudf/pandas/fast_slow_proxy.py:902, in _fast_slow_function_call(func, *args, **kwargs)
    900         slow_args, slow_kwargs = _slow_arg(args), _slow_arg(kwargs)
    901         with disable_module_accelerator():
--> 902             result = func(*slow_args, **slow_kwargs)
    903 return _maybe_wrap_result(result, func, *args, **kwargs), fast

File ~/python/cudf/cudf/pandas/fast_slow_proxy.py:30, in call_operator(fn, args, kwargs)
     29 def call_operator(fn, args, kwargs):
---> 30     return fn(*args, **kwargs)

File ~/miniforge3/envs/cudf-dev/lib/python3.11/site-packages/pandas/core/frame.py:10361, in DataFrame.apply(self, func, axis, raw, result_type, args, by_row, engine, engine_kwargs, **kwargs)
  10347 from pandas.core.apply import frame_apply
  10349 op = frame_apply(
  10350     self,
  10351     func=func,
   (...)
  10359     kwargs=kwargs,
  10360 )
> 10361 return op.apply().__finalize__(self, method="apply")

File ~/miniforge3/envs/cudf-dev/lib/python3.11/site-packages/pandas/core/apply.py:916, in FrameApply.apply(self)
    913 elif self.raw:
    914     return self.apply_raw(engine=self.engine, engine_kwargs=self.engine_kwargs)
--> 916 return self.apply_standard()

File ~/miniforge3/envs/cudf-dev/lib/python3.11/site-packages/pandas/core/apply.py:1063, in FrameApply.apply_standard(self)
   1061 def apply_standard(self):
   1062     if self.engine == "python":
-> 1063         results, res_index = self.apply_series_generator()
   1064     else:
   1065         results, res_index = self.apply_series_numba()

File ~/miniforge3/envs/cudf-dev/lib/python3.11/site-packages/pandas/core/apply.py:1081, in FrameApply.apply_series_generator(self)
   1078 with option_context("mode.chained_assignment", None):
   1079     for i, v in enumerate(series_gen):
   1080         # ignore SettingWithCopy here in case the user mutates
-> 1081         results[i] = self.func(v, *self.args, **self.kwargs)
   1082         if isinstance(results[i], ABCSeries):
   1083             # If we have a view on v, we need to make a copy because
   1084             #  series_generator will swap out the underlying data
   1085             results[i] = results[i].copy(deep=False)

Cell In[1], line 6, in my_apply(df, bias)
      5 def my_apply(df, bias: int):
----> 6     datetime.strptime(df['Minute'], '%H:%M:%S')
      7     return pd.to_numeric(1)

KeyError: '__import__'

vs

In [1]: %load_ext cudf.pandas
   ...: import pandas as pd
   ...: from datetime import datetime
   ...: 
   ...: def my_apply(df, bias: int):
   ...:     datetime.strptime(df['Minute'], '%H:%M:%S')
   ...:     return 1
   ...: 
   ...: my_df = pd.DataFrame({'Minute': ['09:00:00']})
   ...: my_df.apply(my_apply,axis=1, bias=1)
Out[1]: 
0    1
dtype: int64

@mroeschke mroeschke added bug Something isn't working Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas and removed question Further information is requested labels Apr 17, 2024
@mroeschke mroeschke changed the title [QST] Can't we use datetime module with cudf? [BUG] apply with a UDF that references the pandas module and another module fails to find __import__ with cudf.pandas Apr 17, 2024
@galipremsagar galipremsagar added this to the Proxying - cudf.pandas milestone Apr 18, 2024
rapids-bot bot pushed a commit that referenced this issue Apr 19, 2024
closes #15548

`_replace_closurevars` creates a new function by replacing objects with their fast versions. When creating the new function, it populates `globals` from the result of `inspect.getclosurevars`, but it don't think it comprehensively returns _all_ the globals accessible to the function (`function.__globals__`)

To minimize the change, the "fast globals" are still sourced from `inspect.getclosurevars`, and those update the `old_function.__globals__` when creating a new function.

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #15569
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf.pandas Issues specific to cudf.pandas Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants