Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby with timestamp in by raises KeyError #5099

Closed
3 tasks done
mvashishtha opened this issue Oct 6, 2022 · 2 comments · Fixed by #5140
Closed
3 tasks done

BUG: groupby with timestamp in by raises KeyError #5099

mvashishtha opened this issue Oct 6, 2022 · 2 comments · Fixed by #5140
Labels
bug 🦗 Something isn't working P1 Important tasks that we should complete soon pandas.groupby

Comments

@mvashishtha
Copy link
Collaborator

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import modin.pandas as pd

df = pd.DataFrame({'timestamp': [pd.to_datetime(1490195805, unit='s')], 'numeric': [0]})
print(df.groupby('timestamp').mean())

Issue Description

I get KeyError in Modin, but the groupby works in pandas.

Expected Behavior

pandas gives output like

                     numeric
timestamp
2017-03-22 15:16:45      0.0

Error Logs

UserWarning: Distributing <class 'dict'> object. This may take some time.
pd.---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/core/indexes/base.py:3800, in Index.get_loc(self, key, method, tolerance)
   3799 try:
-> 3800     return self._engine.get_loc(casted_key)
   3801 except KeyError as err:

File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'timestamp'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Input In [5], in <cell line: 4>()
      1 import modin.pandas as pd
      3 df = pd.DataFrame({'timestamp': [pd.to_datetime(1490195805, unit='s')], 'numeric': [0]})
----> 4 print(df.groupby('timestamp').mean())

File ~/software_sources/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/software_sources/modin/modin/pandas/groupby.py:138, in DataFrameGroupBy.mean(self, numeric_only)
    136 def mean(self, numeric_only=None):
    137     return self._check_index(
--> 138         self._wrap_aggregation(
    139             type(self._query_compiler).groupby_mean,
    140             numeric_only=numeric_only,
    141             agg_kwargs=dict(numeric_only=numeric_only),
    142         )
    143     )

File ~/software_sources/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/software_sources/modin/modin/pandas/groupby.py:1095, in DataFrameGroupBy._wrap_aggregation(self, qc_method, numeric_only, agg_args, agg_kwargs, **kwargs)
   1091 else:
   1092     groupby_qc = self._query_compiler
   1094 result = type(self._df)(
-> 1095     query_compiler=qc_method(
   1096         groupby_qc,
   1097         by=self._by,
   1098         axis=self._axis,
   1099         groupby_kwargs=self._kwargs,
   1100         agg_args=agg_args,
   1101         agg_kwargs=agg_kwargs,
   1102         drop=self._drop,
   1103         **kwargs,
   1104     )
   1105 )
   1106 if self._squeeze:
   1107     return result.squeeze()

File ~/software_sources/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/software_sources/modin/modin/core/storage_formats/pandas/query_compiler.py:2635, in PandasQueryCompiler.groupby_mean(self, by, axis, groupby_kwargs, agg_args, agg_kwargs, drop)
   2616 result = GroupByReduce.register(
   2617     lambda dfgb, **kwargs: pandas.concat(
   2618         [dfgb.sum(**kwargs), dfgb.count()],
   (...)
   2631     drop=drop,
   2632 )
   2634 if len(datetime_cols) > 0:
-> 2635     result = result.astype({col: dtype for col, dtype in datetime_cols.items()})
   2636 return result

File ~/software_sources/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/software_sources/modin/modin/core/storage_formats/pandas/query_compiler.py:1537, in PandasQueryCompiler.astype(self, col_dtypes, **kwargs)
   1536 def astype(self, col_dtypes, **kwargs):
-> 1537     return self.__constructor__(self._modin_frame.astype(col_dtypes))

File ~/software_sources/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/software_sources/modin/modin/core/dataframe/pandas/dataframe/dataframe.py:124, in lazy_metadata_decorator.<locals>.decorator.<locals>.run_f_on_minimally_updated_metadata(self, *args, **kwargs)
    122     elif apply_axis == "rows":
    123         obj._propagate_index_objs(axis=0)
--> 124 result = f(self, *args, **kwargs)
    125 if apply_axis is None and not transpose:
    126     result._deferred_index = self._deferred_index

File ~/software_sources/modin/modin/core/dataframe/pandas/dataframe/dataframe.py:1146, in PandasDataframe.astype(self, col_dtypes)
   1143 for i, column in enumerate(columns):
   1144     dtype = col_dtypes[column]
   1145     if (
-> 1146         not isinstance(dtype, type(self.dtypes[column]))
   1147         or dtype != self.dtypes[column]
   1148     ):
   1149         # Update the new dtype series to the proper pandas dtype
   1150         try:
   1151             new_dtype = np.dtype(dtype)

File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/core/series.py:982, in Series.__getitem__(self, key)
    979     return self._values[key]
    981 elif key_is_scalar:
--> 982     return self._get_value(key)
    984 if is_hashable(key):
    985     # Otherwise index.get_value will raise InvalidIndexError
    986     try:
    987         # For labels that don't resolve as scalars like tuples and frozensets

File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/core/series.py:1092, in Series._get_value(self, label, takeable)
   1089     return self._values[label]
   1091 # Similar to Index.get_value, but we do not fall back to positional
-> 1092 loc = self.index.get_loc(label)
   1093 return self.index._get_values_for_loc(self, loc, label)

File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3800     return self._engine.get_loc(casted_key)
   3801 except KeyError as err:
-> 3802     raise KeyError(key) from err
   3803 except TypeError:
   3804     # If we have a listlike key, _check_indexing_error will raise
   3805     #  InvalidIndexError. Otherwise we fall through and re-raise
   3806     #  the TypeError.
   3807     self._check_indexing_error(key)

KeyError: 'timestamp'

Installed Versions

INSTALLED VERSIONS ------------------ commit : 621bc10 python : 3.10.4.final.0 python-bits : 64 OS : Darwin OS-release : 21.5.0 Version : Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:22 PDT 2022; root:xnu-8020.121.3~4/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

Modin dependencies

modin : 0.16.0
ray : 2.0.0
dask : 2022.7.1
distributed : 2022.7.1
hdk : None

pandas dependencies

pandas : 1.5.0
numpy : 1.23.2
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 61.2.0
pip : 22.2.2
Cython : None
pytest : 7.1.2
hypothesis : None
sphinx : 4.5.0
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.4.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : 1.0.9
fastparquet : 0.8.1
fsspec : 2022.7.1
gcsfs : None
matplotlib : 3.5.2
numba : None
numexpr : 2.8.3
odfpy : None
openpyxl : 3.0.10
pandas_gbq : 0.17.7
pyarrow : 8.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2022.7.1
scipy : 1.9.0
snappy : None
sqlalchemy : 1.4.39
tables : 3.7.0
tabulate : None
xarray : 2022.6.0
xlrd : 2.0.1
xlwt : None
zstandard : None
tzdata : None

@mvashishtha mvashishtha added bug 🦗 Something isn't working P1 Important tasks that we should complete soon pandas.groupby labels Oct 6, 2022
@mvashishtha
Copy link
Collaborator Author

The problem is that we convert the by timestamp column to int64 here because we think we have to mean it, but we don't have to mean it because it's a key and not a value for the groupby. The key column is not present in the groupby result, so we get a KeyError when we try to convert the type back to a datetime type here.

I have confirmed that this is a regression due to a76e2a1.

billiam-wang added a commit to billiam-wang/modin that referenced this issue Oct 19, 2022
…estamp in by

Signed-off-by: Bill Wang <billiam@ponder.io>
mvashishtha pushed a commit that referenced this issue Oct 26, 2022
#5140)

Signed-off-by: Bill Wang <billiam@ponder.io>
@YarShev
Copy link
Collaborator

YarShev commented Oct 26, 2022

Closed in #5140.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working P1 Important tasks that we should complete soon pandas.groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants