Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: df.describe fails for DataFrame created with add_prefix #4808

Closed
pyrito opened this issue Aug 11, 2022 · 2 comments · Fixed by #4809
Closed

BUG: df.describe fails for DataFrame created with add_prefix #4808

pyrito opened this issue Aug 11, 2022 · 2 comments · Fixed by #4809
Assignees
Labels
bug 🦗 Something isn't working

Comments

@pyrito
Copy link
Collaborator

pyrito commented Aug 11, 2022

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): macOS 12.2.1
  • Modin version (modin.__version__): 0.15.0+75.g3f985ed6
  • Python version: 3.9.12
  • Code we can use to reproduce:
import modin.pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(20, 16)).add_prefix('col')

# This fails
df.describe()

Describe the problem

df.describe() fails on a DataFrame that has its columns set with add_prefix.

Source code / logs

Stack trace
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [19], in <cell line: 1>()
----> 1 df.describe()

File ~/Documents/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115 
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/Documents/modin/modin/pandas/base.py:1172, in BasePandasDataset.describe(self, percentiles, include, exclude, datetime_is_numeric)
   1169 else:
   1170     percentiles = np.array([0.25, 0.5, 0.75])
   1171 return self.__constructor__(
-> 1172     query_compiler=self._query_compiler.describe(
   1173         percentiles=percentiles,
   1174         include=include,
   1175         exclude=exclude,
   1176         datetime_is_numeric=datetime_is_numeric,
   1177     )
   1178 )

File ~/Documents/modin/modin/logging/logger_decorator.py:128, in enable_logging.<locals>.decorator.<locals>.run_and_log(*args, **kwargs)
    113 """
    114 Compute function with logging if Modin logging is enabled.
    115 
   (...)
    125 Any
    126 """
    127 if LogMode.get() == "disable":
--> 128     return obj(*args, **kwargs)
    130 logger = get_logger()
    131 logger_level = getattr(logger, log_level)

File ~/Documents/modin/modin/core/storage_formats/pandas/query_compiler.py:1563, in PandasQueryCompiler.describe(self, **kwargs)
   1560 def describe(self, **kwargs):
   1561     # Use pandas to calculate the correct columns
   1562     empty_df = (
-> 1563         pandas.DataFrame(columns=self.columns)
   1564         .astype(self.dtypes)
   1565         .describe(**kwargs)
   1566     )
   1567     new_index = empty_df.index
   1569     # Note: `describe` convert timestamp type to object type
   1570     # which results in the loss of two values in index: `first` and `last`
   1571     # for empty DataFrame.

File ~/opt/anaconda3/envs/modin/lib/python3.9/site-packages/pandas/core/generic.py:5879, in NDFrame.astype(self, dtype, copy, errors)
   5877 for col_name in dtype.keys():
   5878     if col_name not in self:
-> 5879         raise KeyError(
   5880             "Only a column name can be used for the "
   5881             "key in a dtype mappings argument. "
   5882             f"'{col_name}' not found in columns."
   5883         )
   5885 # GH#44417 cast to Series so we can use .iat below, which will be
   5886 #  robust in case we
   5887 from pandas import Series

@pyrito pyrito added bug 🦗 Something isn't working Triage 🩹 Issues that need triage labels Aug 11, 2022
@mvashishtha mvashishtha changed the title df.describe fails for DataFrame created with add_prefix BUG: df.describe fails for DataFrame created with add_prefix Aug 11, 2022
@mvashishtha mvashishtha removed the Triage 🩹 Issues that need triage label Aug 11, 2022
@mvashishtha
Copy link
Collaborator

I get the same error on my macbook at Modin 3f985ed.

@pyrito
Copy link
Collaborator Author

pyrito commented Aug 11, 2022

It looks like the error here was caused by the fact that when we call add_prefix, we call rename and we do not update the labels for the DataFrame dtypes which results in this error. I suspect that we may have other issues with dtypes since we don't check for them in df_equals which is used a lot in our tests. See: #3804. We should make the df_equals fix a priority and fix it soon.

@mvashishtha mvashishtha self-assigned this Aug 11, 2022
mvashishtha pushed a commit to mvashishtha/modin that referenced this issue Aug 11, 2022
Signed-off-by: mvashishtha <mahesh@ponder.io>
anmyachev pushed a commit that referenced this issue Aug 12, 2022
Signed-off-by: mvashishtha <mahesh@ponder.io>
Co-authored-by: Karthik Velayutham <karthik.velayutham@gmail.com>
YarShev pushed a commit that referenced this issue Sep 6, 2022
Signed-off-by: mvashishtha <mahesh@ponder.io>
Co-authored-by: Karthik Velayutham <karthik.velayutham@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants