Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: modin on ray produce error with empty dataframes #5430

Closed
3 tasks done
Egor-Krivov opened this issue Dec 13, 2022 · 5 comments · Fixed by #5442
Closed
3 tasks done

BUG: modin on ray produce error with empty dataframes #5430

Egor-Krivov opened this issue Dec 13, 2022 · 5 comments · Fixed by #5442
Labels
bug 🦗 Something isn't working P0 Highest priority tasks requiring immediate fix pandas concordance 🐼 Functionality that does not match pandas

Comments

@Egor-Krivov
Copy link
Contributor

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

pd.DataFrame({'item': [1, 2], 'week': [1, 2]}).query('10 < week').groupby('item').size()

Issue Description

This code produces an error with modin on ray. However, works on pandas. I have a bit different code in my benchmark and it just works on pandas but fails on modin.

Expected Behavior

Modin on ray should behave like pandas.

Error Logs

(MODIN-ON-RAY)
pd.DataFrame({'item': [1, 2], 'week': [1, 2]}).query('10 < week').groupby('item').size()
UserWarning: Distributing <class 'dict'> object. This may take some time. 
UserWarning: `DataFrame.__getitem__` for empty DataFrame is not currently supported by PandasOnRay, defaulting to pandas implementation.
UserWarning: Distributing <class 'pandas.core.series.Series'> object. This may take some time.
*** IndexError: list index out of range 

(PANDAS) 
>>> pd.DataFrame({'item': [1, 2], 'week': [1, 2]}).query('10 < week').groupby('item').size()
Series([], dtype: int64)

Installed Versions

UserWarning: Setuptools is replacing distutils.

INSTALLED VERSIONS

commit : c30ab4c
python : 3.8.15.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-124-generic
Version : #140-Ubuntu SMP Thu Aug 4 02:23:37 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US
LOCALE : en_US.ISO8859-1

Modin dependencies

modin : 0.7.3+1359.gc30ab4c1
ray : 2.0.1
dask : None
distributed : None
hdk : present

pandas dependencies

pandas : 1.5.1
numpy : 1.23.4
pytz : 2022.6
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.2.2
Cython : 0.29.32
pytest : 7.2.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.6.0
pandas_datareader: None
bs4 : None
bottleneck : 1.3.5
brotli :
fastparquet : None
fsspec : 2022.11.0
gcsfs : None
matplotlib : 3.6.2
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 6.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

@Egor-Krivov Egor-Krivov added bug 🦗 Something isn't working Triage 🩹 Issues that need triage labels Dec 13, 2022
@Egor-Krivov
Copy link
Contributor Author

Surprisingly, in my case this problem appears with non-empty array after query. And for some reason doing .iloc[:10000000000000] right after query solves the issue.

@vnlitvinov
Copy link
Collaborator

@Egor-Krivov you seem to have quite a funny Modin version reported, how did you install that?

cc @dchigarev I wonder if your recent groupby().size() PR fixed that...

I've run the reproducer on my Windows laptop, and I'm seeing a different error here:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Vass\ponder\modin\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "C:\Vass\ponder\modin\modin\pandas\dataframe.py", line 480, in groupby
    if by is not None and by in self._query_compiler.get_index_names(axis):
  File "C:\Vass\ponder\modin\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "C:\Vass\ponder\modin\modin\core\storage_formats\base\query_compiler.py", line 3464, in get_index_names
    return self.get_axis(axis).names
AttributeError: 'list' object has no attribute 'names'

@vnlitvinov vnlitvinov added pandas concordance 🐼 Functionality that does not match pandas P0 Highest priority tasks requiring immediate fix and removed Triage 🩹 Issues that need triage labels Dec 13, 2022
@Egor-Krivov
Copy link
Contributor Author

Indeed, my version is strange. I will reproduce my installation and check what will change after that.

@dchigarev
Copy link
Collaborator

There are two distinct problems causing this behavior that I found so far (hope I won't find more :D):

  1. When Modin's logic produces an empty dataframe (e.g. filtering) it writes an empty list into the frame's .index attribute instead of an empty pandas.Index object. This obviously causes problems when trying to treat a df.index as an actual index:
    >>> import modin.pandas as pd
    >>> res = pd.DataFrame({"a": [1, 2, 3]}).query("a > 200")
    >>> res.index
    []
    >>> type(res.index)
    <class 'list'>
    >>> res.index.name
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'list' object has no attribute 'name'
    This happens because the logic that computes indices unconditionally filters out empty ones, thus returning an empty list instead of an index in case all of the partitions returned a zero-length index.
    # filter empty indexes
    total_idx = list(filter(len, new_idx))
    if len(total_idx) > 0:
    # TODO FIX INFORMATION LEAK!!!!1!!1!!
    total_idx = total_idx[0].append(total_idx[1:])
    return total_idx, new_idx

    I've fixed this exact problem in FIX-#5436: Fix '.index' extraction for an empty frame  #5431
  2. Modin falls down to an actual groupby execution on an empty frame in case an underlying query compiler is lazy (e.g. has delayed indices). Groupby implementation was not developed with the intention to work on empty frames (it was believed that empty frame cases would always default to pandas), however, in the reproducer, we see the case when an empty dataframe snuck into low-level implementation and breaks everything. Here's the scenario of how this happens:
    a. .query sets resulting index-cache to None which causes the query compiler to be lazy.
    b. When next time checking for a posibility to default to pandas on an empty frame Modin skips this check as the frame has delayed executions [1]
    c. We end-up in a groupby execiton on an empty frame.

@Egor-Krivov
Copy link
Contributor Author

Maybe this error log will help (this was not an empty dataframe):

Traceback (most recent call last):
  File "run_modin_tests.py", line 142, in <module>                                                                                                                                                                                        main()
  File "run_modin_tests.py", line 138, in main                                                                                                                                                                                            run_benchmark_task(args)
  File "run_modin_tests.py", line 88, in run_benchmark_task                                                                                                                                                                               run_benchmarks(
  File "/localdisk/ekrivov/hm/omniscripts/utils/utils.py", line 788, in run_benchmarks                                                                                                                                                    benchmark_results = run_benchmark(parameters)
  File "/localdisk/ekrivov/hm/omniscripts/hm_fashion_recs/week_processing_benchmark.py", line 121, in run_benchmark
    main(raw_data_path=raw_data_path)                                                                                                                                                                                                   File "/localdisk/ekrivov/hm/omniscripts/hm_fashion_recs/week_processing_benchmark.py", line 98, in main                                                                                                                                 feature_engieering(week=week)
  File "/localdisk/ekrivov/hm/omniscripts/hm_fashion_recs/week_processing_benchmark.py", line 54, in feature_engieering
    week_candidates = make_one_week_candidates(
  File "/localdisk/ekrivov/hm/omniscripts/hm_fashion_recs/candidates.py", line 408, in make_one_week_candidates
    candidates = create_candidates(
  File "/localdisk/ekrivov/hm/omniscripts/hm_fashion_recs/candidates.py", line 302, in create_candidates
    candidates_dept = create_candidates_category_popular(
  File "/localdisk/ekrivov/hm/omniscripts/hm_fashion_recs/candidates.py", line 148, in create_candidates_category_popular
    tr = tr.groupby("item").size().reset_index(name="volume")
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/pandas/groupby.py", line 666, in size
    result = work_object._wrap_aggregation(
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/pandas/groupby.py", line 1083, in _wrap_aggregation
    query_compiler=qc_method(
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 2699, in groupby_size
    result = self._groupby_dict_reduce(
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 2796, in _groupby_dict_reduce
    return GroupByReduce.register(map_dict, reduce_dict, **kwargs)(
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/dataframe/algebra/groupby.py", line 68, in <lambda>
    return lambda *args, **kwargs: cls.caller(
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/dataframe/algebra/groupby.py", line 348, in caller
    new_modin_frame = query_compiler._modin_frame.groupby_reduce(
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 126, in run_f_on_minimally_updated_metadata
    result = f(self, *args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 3106, in groupby_reduce
    new_partitions = self._partition_mgr_cls.groupby_reduce(
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 244, in groupby_reduce
    mapped_partitions = cls.broadcast_apply(
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 58, in wait
    result = func(cls, *args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 373, in broadcast_apply
    [
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 374, in <listcomp>
    [
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 380, in <listcomp>
    else rt_axis_parts[row_idx].list_of_blocks
IndexError: list index out of range

dchigarev added a commit to dchigarev/modin that referenced this issue Dec 15, 2022
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
YarShev pushed a commit that referenced this issue Jan 17, 2023
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working P0 Highest priority tasks requiring immediate fix pandas concordance 🐼 Functionality that does not match pandas
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants