BUG: modin on ray produce error with empty dataframes #5430

Egor-Krivov · 2022-12-13T12:14:55Z

Modin version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest released version of Modin.
I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

pd.DataFrame({'item': [1, 2], 'week': [1, 2]}).query('10 < week').groupby('item').size()

Issue Description

This code produces an error with modin on ray. However, works on pandas. I have a bit different code in my benchmark and it just works on pandas but fails on modin.

Expected Behavior

Modin on ray should behave like pandas.

Error Logs

(MODIN-ON-RAY)
pd.DataFrame({'item': [1, 2], 'week': [1, 2]}).query('10 < week').groupby('item').size()
UserWarning: Distributing <class 'dict'> object. This may take some time. 
UserWarning: `DataFrame.__getitem__` for empty DataFrame is not currently supported by PandasOnRay, defaulting to pandas implementation.
UserWarning: Distributing <class 'pandas.core.series.Series'> object. This may take some time.
*** IndexError: list index out of range 

(PANDAS) 
>>> pd.DataFrame({'item': [1, 2], 'week': [1, 2]}).query('10 < week').groupby('item').size()
Series([], dtype: int64)

Installed Versions

UserWarning: Setuptools is replacing distutils.

INSTALLED VERSIONS

commit : c30ab4c
python : 3.8.15.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-124-generic
Version : #140-Ubuntu SMP Thu Aug 4 02:23:37 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US
LOCALE : en_US.ISO8859-1

Modin dependencies

modin : 0.7.3+1359.gc30ab4c1
ray : 2.0.1
dask : None
distributed : None
hdk : present

pandas dependencies

pandas : 1.5.1
numpy : 1.23.4
pytz : 2022.6
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.2.2
Cython : 0.29.32
pytest : 7.2.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.6.0
pandas_datareader: None
bs4 : None
bottleneck : 1.3.5
brotli :
fastparquet : None
fsspec : 2022.11.0
gcsfs : None
matplotlib : 3.6.2
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 6.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

The text was updated successfully, but these errors were encountered:

Egor-Krivov · 2022-12-13T12:30:03Z

Surprisingly, in my case this problem appears with non-empty array after query. And for some reason doing .iloc[:10000000000000] right after query solves the issue.

vnlitvinov · 2022-12-13T13:11:37Z

@Egor-Krivov you seem to have quite a funny Modin version reported, how did you install that?

cc @dchigarev I wonder if your recent groupby().size() PR fixed that...

I've run the reproducer on my Windows laptop, and I'm seeing a different error here:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Vass\ponder\modin\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "C:\Vass\ponder\modin\modin\pandas\dataframe.py", line 480, in groupby
    if by is not None and by in self._query_compiler.get_index_names(axis):
  File "C:\Vass\ponder\modin\modin\logging\logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "C:\Vass\ponder\modin\modin\core\storage_formats\base\query_compiler.py", line 3464, in get_index_names
    return self.get_axis(axis).names
AttributeError: 'list' object has no attribute 'names'

Egor-Krivov · 2022-12-13T13:25:05Z

Indeed, my version is strange. I will reproduce my installation and check what will change after that.

dchigarev · 2022-12-13T15:02:59Z

There are two distinct problems causing this behavior that I found so far (hope I won't find more :D):

When Modin's logic produces an empty dataframe (e.g. filtering) it writes an empty list into the frame's .index attribute instead of an empty pandas.Index object. This obviously causes problems when trying to treat a df.index as an actual index:

>>> import modin.pandas as pd
>>> res = pd.DataFrame({"a": [1, 2, 3]}).query("a > 200")
>>> res.index
[]
>>> type(res.index)
<class 'list'>
>>> res.index.name
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'name'

This happens because the logic that computes indices unconditionally filters out empty ones, thus returning an empty list instead of an index in case all of the partitions returned a zero-length index.

modin/modin/core/dataframe/pandas/partitioning/partition_manager.py

Lines 883 to 888 in 4114183

    
           # filter empty indexes 
        
           total_idx = list(filter(len, new_idx)) 
        
           if len(total_idx) > 0: 
        
               # TODO FIX INFORMATION LEAK!!!!1!!1!! 
        
               total_idx = total_idx[0].append(total_idx[1:]) 
        
           return total_idx, new_idx

I've fixed this exact problem in FIX-#5436: Fix '.index' extraction for an empty frame #5431

Modin falls down to an actual groupby execution on an empty frame in case an underlying query compiler is lazy (e.g. has delayed indices). Groupby implementation was not developed with the intention to work on empty frames (it was believed that empty frame cases would always default to pandas), however, in the reproducer, we see the case when an empty dataframe snuck into low-level implementation and breaks everything. Here's the scenario of how this happens:
a. .query sets resulting index-cache to None which causes the query compiler to be lazy.
b. When next time checking for a posibility to default to pandas on an empty frame Modin skips this check as the frame has delayed executions [1]
c. We end-up in a groupby execiton on an empty frame.

Egor-Krivov · 2022-12-14T10:36:39Z

Maybe this error log will help (this was not an empty dataframe):

Traceback (most recent call last):
  File "run_modin_tests.py", line 142, in <module>                                                                                                                                                                                        main()
  File "run_modin_tests.py", line 138, in main                                                                                                                                                                                            run_benchmark_task(args)
  File "run_modin_tests.py", line 88, in run_benchmark_task                                                                                                                                                                               run_benchmarks(
  File "/localdisk/ekrivov/hm/omniscripts/utils/utils.py", line 788, in run_benchmarks                                                                                                                                                    benchmark_results = run_benchmark(parameters)
  File "/localdisk/ekrivov/hm/omniscripts/hm_fashion_recs/week_processing_benchmark.py", line 121, in run_benchmark
    main(raw_data_path=raw_data_path)                                                                                                                                                                                                   File "/localdisk/ekrivov/hm/omniscripts/hm_fashion_recs/week_processing_benchmark.py", line 98, in main                                                                                                                                 feature_engieering(week=week)
  File "/localdisk/ekrivov/hm/omniscripts/hm_fashion_recs/week_processing_benchmark.py", line 54, in feature_engieering
    week_candidates = make_one_week_candidates(
  File "/localdisk/ekrivov/hm/omniscripts/hm_fashion_recs/candidates.py", line 408, in make_one_week_candidates
    candidates = create_candidates(
  File "/localdisk/ekrivov/hm/omniscripts/hm_fashion_recs/candidates.py", line 302, in create_candidates
    candidates_dept = create_candidates_category_popular(
  File "/localdisk/ekrivov/hm/omniscripts/hm_fashion_recs/candidates.py", line 148, in create_candidates_category_popular
    tr = tr.groupby("item").size().reset_index(name="volume")
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/pandas/groupby.py", line 666, in size
    result = work_object._wrap_aggregation(
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/pandas/groupby.py", line 1083, in _wrap_aggregation
    query_compiler=qc_method(
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 2699, in groupby_size
    result = self._groupby_dict_reduce(
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 2796, in _groupby_dict_reduce
    return GroupByReduce.register(map_dict, reduce_dict, **kwargs)(
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/dataframe/algebra/groupby.py", line 68, in <lambda>
    return lambda *args, **kwargs: cls.caller(
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/dataframe/algebra/groupby.py", line 348, in caller
    new_modin_frame = query_compiler._modin_frame.groupby_reduce(
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 126, in run_f_on_minimally_updated_metadata
    result = f(self, *args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 3106, in groupby_reduce
    new_partitions = self._partition_mgr_cls.groupby_reduce(
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 244, in groupby_reduce
    mapped_partitions = cls.broadcast_apply(
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 58, in wait
    result = func(cls, *args, **kwargs)
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 373, in broadcast_apply
    [
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 374, in <listcomp>
    [
  File "/nfs/site/home/ekrivov/large/miniconda3/envs/hm/lib/python3.8/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 380, in <listcomp>
    else rt_axis_parts[row_idx].list_of_blocks
IndexError: list index out of range

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Egor-Krivov added bug 🦗 Something isn't working Triage 🩹 Issues that need triage labels Dec 13, 2022

vnlitvinov added pandas concordance 🐼 Functionality that does not match pandas P0 Highest priority tasks requiring immediate fix and removed Triage 🩹 Issues that need triage labels Dec 13, 2022

dchigarev mentioned this issue Dec 13, 2022

FIX-#5436: Fix '.index' extraction for an empty frame #5431

Merged

7 tasks

dchigarev added a commit to dchigarev/modin that referenced this issue Dec 15, 2022

FIX-modin-project#5430: make groupby work on empty frames

b5962d0

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev mentioned this issue Dec 15, 2022

FIX-#5430: Make groupby work on empty frames #5442

Merged

7 tasks

YarShev closed this as completed in #5442 Jan 17, 2023

YarShev pushed a commit that referenced this issue Jan 17, 2023

FIX-#5430: Make groupby work on empty frames (#5442)

4351b8b

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: modin on ray produce error with empty dataframes #5430

BUG: modin on ray produce error with empty dataframes #5430

Egor-Krivov commented Dec 13, 2022

INSTALLED VERSIONS

Modin dependencies

pandas dependencies

Egor-Krivov commented Dec 13, 2022

vnlitvinov commented Dec 13, 2022

Egor-Krivov commented Dec 13, 2022

dchigarev commented Dec 13, 2022

Egor-Krivov commented Dec 14, 2022

BUG: modin on ray produce error with empty dataframes #5430

BUG: modin on ray produce error with empty dataframes #5430

Comments

Egor-Krivov commented Dec 13, 2022

Modin version checks

Reproducible Example

Issue Description

Expected Behavior

Error Logs

Installed Versions

INSTALLED VERSIONS

Modin dependencies

pandas dependencies

Egor-Krivov commented Dec 13, 2022

vnlitvinov commented Dec 13, 2022

Egor-Krivov commented Dec 13, 2022

dchigarev commented Dec 13, 2022

Egor-Krivov commented Dec 14, 2022