BUG: groupby with timestamp in `by` raises KeyError #5099

mvashishtha · 2022-10-06T04:28:31Z

Modin version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest released version of Modin.
I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import modin.pandas as pd

df = pd.DataFrame({'timestamp': [pd.to_datetime(1490195805, unit='s')], 'numeric': [0]})
print(df.groupby('timestamp').mean())

Issue Description

I get KeyError in Modin, but the groupby works in pandas.

Expected Behavior

pandas gives output like

                     numeric
timestamp
2017-03-22 15:16:45      0.0

Error Logs

UserWarning: Distributing <class 'dict'> object. This may take some time.
pd.---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/core/indexes/base.py:3800, in Index.get_loc(self, key, method, tolerance)
   3799 try:
-> 3800     return self._engine.get_loc(casted_key)
   3801 except KeyError as err:

File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'timestamp'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Input In [5], in <cell line: 4>()
      1 import modin.pandas as pd
      3 df = pd.DataFrame({'timestamp': [pd.to_datetime(1490195805, unit='s')], 'numeric': [0]})
----> 4 print(df.groupby('timestamp').mean())

File ~/software_sources/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/software_sources/modin/modin/pandas/groupby.py:138, in DataFrameGroupBy.mean(self, numeric_only)
    136 def mean(self, numeric_only=None):
    137     return self._check_index(
--> 138         self._wrap_aggregation(
    139             type(self._query_compiler).groupby_mean,
    140             numeric_only=numeric_only,
    141             agg_kwargs=dict(numeric_only=numeric_only),
    142         )
    143     )

File ~/software_sources/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/software_sources/modin/modin/pandas/groupby.py:1095, in DataFrameGroupBy._wrap_aggregation(self, qc_method, numeric_only, agg_args, agg_kwargs, **kwargs)
   1091 else:
   1092     groupby_qc = self._query_compiler
   1094 result = type(self._df)(
-> 1095     query_compiler=qc_method(
   1096         groupby_qc,
   1097         by=self._by,
   1098         axis=self._axis,
   1099         groupby_kwargs=self._kwargs,
   1100         agg_args=agg_args,
   1101         agg_kwargs=agg_kwargs,
   1102         drop=self._drop,
   1103         **kwargs,
   1104     )
   1105 )
   1106 if self._squeeze:
   1107     return result.squeeze()

File ~/software_sources/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/software_sources/modin/modin/core/storage_formats/pandas/query_compiler.py:2635, in PandasQueryCompiler.groupby_mean(self, by, axis, groupby_kwargs, agg_args, agg_kwargs, drop)
   2616 result = GroupByReduce.register(
   2617     lambda dfgb, **kwargs: pandas.concat(
   2618         [dfgb.sum(**kwargs), dfgb.count()],
   (...)
   2631     drop=drop,
   2632 )
   2634 if len(datetime_cols) > 0:
-> 2635     result = result.astype({col: dtype for col, dtype in datetime_cols.items()})
   2636 return result

File ~/software_sources/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/software_sources/modin/modin/core/storage_formats/pandas/query_compiler.py:1537, in PandasQueryCompiler.astype(self, col_dtypes, **kwargs)
   1536 def astype(self, col_dtypes, **kwargs):
-> 1537     return self.__constructor__(self._modin_frame.astype(col_dtypes))

File ~/software_sources/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/software_sources/modin/modin/core/dataframe/pandas/dataframe/dataframe.py:124, in lazy_metadata_decorator.<locals>.decorator.<locals>.run_f_on_minimally_updated_metadata(self, *args, **kwargs)
    122     elif apply_axis == "rows":
    123         obj._propagate_index_objs(axis=0)
--> 124 result = f(self, *args, **kwargs)
    125 if apply_axis is None and not transpose:
    126     result._deferred_index = self._deferred_index

File ~/software_sources/modin/modin/core/dataframe/pandas/dataframe/dataframe.py:1146, in PandasDataframe.astype(self, col_dtypes)
   1143 for i, column in enumerate(columns):
   1144     dtype = col_dtypes[column]
   1145     if (
-> 1146         not isinstance(dtype, type(self.dtypes[column]))
   1147         or dtype != self.dtypes[column]
   1148     ):
   1149         # Update the new dtype series to the proper pandas dtype
   1150         try:
   1151             new_dtype = np.dtype(dtype)

File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/core/series.py:982, in Series.__getitem__(self, key)
    979     return self._values[key]
    981 elif key_is_scalar:
--> 982     return self._get_value(key)
    984 if is_hashable(key):
    985     # Otherwise index.get_value will raise InvalidIndexError
    986     try:
    987         # For labels that don't resolve as scalars like tuples and frozensets

File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/core/series.py:1092, in Series._get_value(self, label, takeable)
   1089     return self._values[label]
   1091 # Similar to Index.get_value, but we do not fall back to positional
-> 1092 loc = self.index.get_loc(label)
   1093 return self.index._get_values_for_loc(self, loc, label)

File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3800     return self._engine.get_loc(casted_key)
   3801 except KeyError as err:
-> 3802     raise KeyError(key) from err
   3803 except TypeError:
   3804     # If we have a listlike key, _check_indexing_error will raise
   3805     #  InvalidIndexError. Otherwise we fall through and re-raise
   3806     #  the TypeError.
   3807     self._check_indexing_error(key)

KeyError: 'timestamp'

Installed Versions

INSTALLED VERSIONS ------------------ commit : 621bc10 python : 3.10.4.final.0 python-bits : 64 OS : Darwin OS-release : 21.5.0 Version : Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:22 PDT 2022; root:xnu-8020.121.3~4/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

Modin dependencies

modin : 0.16.0
ray : 2.0.0
dask : 2022.7.1
distributed : 2022.7.1
hdk : None

pandas dependencies

pandas : 1.5.0
numpy : 1.23.2
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 61.2.0
pip : 22.2.2
Cython : None
pytest : 7.1.2
hypothesis : None
sphinx : 4.5.0
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.4.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : 1.0.9
fastparquet : 0.8.1
fsspec : 2022.7.1
gcsfs : None
matplotlib : 3.5.2
numba : None
numexpr : 2.8.3
odfpy : None
openpyxl : 3.0.10
pandas_gbq : 0.17.7
pyarrow : 8.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2022.7.1
scipy : 1.9.0
snappy : None
sqlalchemy : 1.4.39
tables : 3.7.0
tabulate : None
xarray : 2022.6.0
xlrd : 2.0.1
xlwt : None
zstandard : None
tzdata : None

The text was updated successfully, but these errors were encountered:

mvashishtha · 2022-10-06T04:32:46Z

The problem is that we convert the by timestamp column to int64 here because we think we have to mean it, but we don't have to mean it because it's a key and not a value for the groupby. The key column is not present in the groupby result, so we get a KeyError when we try to convert the type back to a datetime type here.

I have confirmed that this is a regression due to a76e2a1.

…estamp in by Signed-off-by: Bill Wang <billiam@ponder.io>

#5140) Signed-off-by: Bill Wang <billiam@ponder.io>

YarShev · 2022-10-26T18:42:25Z

Closed in #5140.

mvashishtha added bug 🦗 Something isn't working P1 Important tasks that we should complete soon pandas.groupby labels Oct 6, 2022

mvashishtha mentioned this issue Oct 6, 2022

BUG: KeyError for TimeGrouper with df.group_by() #5091

Closed

3 tasks

billiam-wang added a commit to billiam-wang/modin that referenced this issue Oct 19, 2022

FIX-modin-project#5099: Fix PandasQueryCompiler.groupby_mean with tim…

879ff23

…estamp in by Signed-off-by: Bill Wang <billiam@ponder.io>

mvashishtha pushed a commit that referenced this issue Oct 26, 2022

FIX-#5099: Fix PandasQueryCompiler.groupby_mean with timestamp in by (

3cc33a2

#5140) Signed-off-by: Bill Wang <billiam@ponder.io>

YarShev closed this as completed Oct 26, 2022

YarShev mentioned this issue Oct 26, 2022

FIX-#5099: Fix PandasQueryCompiler.groupby_mean with timestamp in by #5140

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby with timestamp in `by` raises KeyError #5099

BUG: groupby with timestamp in `by` raises KeyError #5099

mvashishtha commented Oct 6, 2022

Modin dependencies

pandas dependencies

mvashishtha commented Oct 6, 2022

YarShev commented Oct 26, 2022

BUG: groupby with timestamp in by raises KeyError #5099

BUG: groupby with timestamp in by raises KeyError #5099

Comments

mvashishtha commented Oct 6, 2022

Modin version checks

Reproducible Example

Issue Description

Expected Behavior

Error Logs

Installed Versions

Modin dependencies

pandas dependencies

mvashishtha commented Oct 6, 2022

YarShev commented Oct 26, 2022

BUG: groupby with timestamp in `by` raises KeyError #5099

BUG: groupby with timestamp in `by` raises KeyError #5099