-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modin is slower than Pandas on filters when Series
is used as a mask
#4268
Comments
@dchigarev could you please have a look at |
Connected with #1903 |
Note: in our current configuration, unless a user explicitly calls Another note: initializing a worker requires importing Here's my current init line after engine was configured: pd.DataFrame(range(cfg.CpuCount.get() * cfg.MinPartitionSize().get())).to_numpy() |
…op labels Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
…masks (#4753) Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
System information
modin.__version__
): fc539c3Describe the problem
Modin is slower than Pandas on filters when
Series
is used as a mask.The results for Ray execution engine are follows:
According to the logs, the main part of execution time (>75%) is here:
modin/modin/core/dataframe/pandas/dataframe/dataframe.py
Lines 628 to 630 in c17dde7
Also, we have
to_pandas
call in this flow here:modin/modin/core/storage_formats/pandas/query_compiler.py
Lines 2132 to 2136 in c17dde7
Log of execution for shape (100m, 10), NPartitions=112
2022-02-25,08:29:55.250: START::PANDAS-API::DataFrame.__init__ 2022-02-25,08:30:04.500: START::PANDAS-API::PandasDataframe.__init__ 2022-02-25,08:30:04.501: START::PANDAS-API::PandasDataframe._filter_empties 2022-02-25,08:30:04.501: END::PANDAS-API::PandasDataframe._filter_empties 2022-02-25,08:30:04.501: END::PANDAS-API::PandasDataframe.__init__ 2022-02-25,08:30:04.501: START::PANDAS-API::PandasQueryCompiler.__init__ 2022-02-25,08:30:04.501: END::PANDAS-API::PandasQueryCompiler.__init__ 2022-02-25,08:30:04.501: START::PANDAS-API::DataFrame.__init__ 2022-02-25,08:30:04.501: START::PANDAS-API::DataFrame.__setattr__ 2022-02-25,08:30:04.501: END::PANDAS-API::DataFrame.__setattr__ 2022-02-25,08:30:04.501: END::PANDAS-API::DataFrame.__init__ 2022-02-25,08:30:04.501: START::PANDAS-API::DataFrame.__setattr__ 2022-02-25,08:30:04.502: END::PANDAS-API::DataFrame.__setattr__ 2022-02-25,08:30:04.521: END::PANDAS-API::DataFrame.__init__ 2022-02-25,08:30:05.153: START::PANDAS-API::Series.__init__ 2022-02-25,08:30:05.358: START::PANDAS-API::PandasDataframe.__init__ 2022-02-25,08:30:05.358: START::PANDAS-API::PandasDataframe._filter_empties 2022-02-25,08:30:05.359: END::PANDAS-API::PandasDataframe._filter_empties 2022-02-25,08:30:05.359: END::PANDAS-API::PandasDataframe.__init__ 2022-02-25,08:30:05.359: START::PANDAS-API::PandasQueryCompiler.__init__ 2022-02-25,08:30:05.359: END::PANDAS-API::PandasQueryCompiler.__init__ 2022-02-25,08:30:05.359: START::PANDAS-API::DataFrame.__init__ 2022-02-25,08:30:05.359: START::PANDAS-API::DataFrame.__setattr__ 2022-02-25,08:30:05.359: END::PANDAS-API::DataFrame.__setattr__ 2022-02-25,08:30:05.359: END::PANDAS-API::DataFrame.__init__ 2022-02-25,08:30:05.359: START::PANDAS-API::PandasQueryCompiler.columnarize 2022-02-25,08:30:05.359: END::PANDAS-API::PandasQueryCompiler.columnarize 2022-02-25,08:30:05.359: START::PANDAS-API::PandasDataframe._validate_set_axis 2022-02-25,08:30:05.360: END::PANDAS-API::PandasDataframe._validate_set_axis 2022-02-25,08:30:05.360: START::PANDAS-API::PandasDataframe.synchronize_labels 2022-02-25,08:30:05.360: END::PANDAS-API::PandasDataframe.synchronize_labels 2022-02-25,08:30:05.360: END::PANDAS-API::Series.__init__ 2022-02-25,08:30:05.361: START::PANDAS-API::BasePandasDataset.__getitem__ 2022-02-25,08:30:05.361: START::PANDAS-API::BasePandasDataset.__len__ 2022-02-25,08:30:05.361: END::PANDAS-API::BasePandasDataset.__len__ 2022-02-25,08:30:05.361: START::PANDAS-API::DataFrame._getitem 2022-02-25,08:30:05.361: START::PANDAS-API::BaseQueryCompiler.has_multiindex 2022-02-25,08:30:05.361: END::PANDAS-API::BaseQueryCompiler.has_multiindex 2022-02-25,08:30:05.361: START::PANDAS-API::PandasQueryCompiler.getitem_array 2022-02-25,08:30:05.361: START::PANDAS-API::PandasQueryCompiler.to_pandas 2022-02-25,08:30:05.361: START::PANDAS-API::PandasDataframe.to_pandas 2022-02-25,08:30:05.361: START::PANDAS-API::PandasDataframe._propagate_index_objs 2022-02-25,08:30:05.361: START::PANDAS-API::PandasDataframe._filter_empties 2022-02-25,08:30:05.362: END::PANDAS-API::PandasDataframe._filter_empties 2022-02-25,08:30:05.363: END::PANDAS-API::PandasDataframe._propagate_index_objs 2022-02-25,08:30:05.880: END::PANDAS-API::PandasDataframe.to_pandas 2022-02-25,08:30:05.881: END::PANDAS-API::PandasQueryCompiler.to_pandas 2022-02-25,08:30:06.836: START::PANDAS-API::PandasQueryCompiler.getitem_row_array 2022-02-25,08:30:06.836: START::PANDAS-API::PandasDataframe.mask 2022-02-25,08:30:07.581: START::PANDAS-API::PandasDataframe._get_dict_of_block_index # Possible bottleneck 2022-02-25,08:30:17.776: END::PANDAS-API::PandasDataframe._get_dict_of_block_index 2022-02-25,08:30:18.255: START::PANDAS-API::PandasDataframe.__init__ 2022-02-25,08:30:18.255: START::PANDAS-API::PandasDataframe._filter_empties 2022-02-25,08:30:18.255: END::PANDAS-API::PandasDataframe._filter_empties 2022-02-25,08:30:18.255: END::PANDAS-API::PandasDataframe.__init__ 2022-02-25,08:30:18.289: END::PANDAS-API::PandasDataframe.mask 2022-02-25,08:30:18.289: START::PANDAS-API::PandasQueryCompiler.__init__ 2022-02-25,08:30:18.289: END::PANDAS-API::PandasQueryCompiler.__init__ 2022-02-25,08:30:18.289: END::PANDAS-API::PandasQueryCompiler.getitem_row_array 2022-02-25,08:30:18.290: END::PANDAS-API::PandasQueryCompiler.getitem_array 2022-02-25,08:30:18.290: START::PANDAS-API::DataFrame.__init__ 2022-02-25,08:30:18.290: START::PANDAS-API::DataFrame.__setattr__ 2022-02-25,08:30:18.290: END::PANDAS-API::DataFrame.__setattr__ 2022-02-25,08:30:18.290: END::PANDAS-API::DataFrame.__init__ 2022-02-25,08:30:18.290: END::PANDAS-API::DataFrame._getitem 2022-02-25,08:30:18.290: END::PANDAS-API::BasePandasDataset.__getitem__
The text was updated successfully, but these errors were encountered: