Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modin is slower than Pandas on filters when Series is used as a mask #4268

Closed
prutskov opened this issue Feb 25, 2022 · 3 comments · Fixed by #4753
Closed

Modin is slower than Pandas on filters when Series is used as a mask #4268

prutskov opened this issue Feb 25, 2022 · 3 comments · Fixed by #4753
Assignees
Labels
Performance 🚀 Performance related issues and pull requests.

Comments

@prutskov
Copy link
Contributor

prutskov commented Feb 25, 2022

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
  • Modin version (modin.__version__): fc539c3
  • Python version: 3.8.11
  • Code we can use to reproduce:
from time import time as timer

import numpy as np
# import pandas as pd
import modin.pandas as pd
import modin.config as cfg
import ray
ray.init()
cfg.BenchmarkMode.put(True)

nrows = 1_000_000_000
ncols = 10

data = {f"col{i}": np.random.rand(nrows) for i in range(ncols)}
df = pd.DataFrame(data)

mask = pd.Series(np.random.choice(a=[True, False], size=nrows))

t = timer()
df2 = df[mask]
print(f'mask time: {timer() - t} s')

Describe the problem

Modin is slower than Pandas on filters when Series is used as a mask.

The results for Ray execution engine are follows:

Shape (100k, 10) (1m, 10) (10m, 10) (100m, 10) (1b, 10)
modin 0.259 0.349 1.343 13.135 134.799
modin(NPartitions=1) 0.291 0.143 0.279 3.152 31.944
pandas 0.004 0.033 0.329 3.806 70.779

According to the logs, the main part of execution time (>75%) is here:

row_partitions_list = self._get_dict_of_block_index(
0, sorted_row_positions, are_indices_sorted=True
)

Also, we have to_pandas call in this flow here:
def getitem_array(self, key):
# TODO: dont convert to pandas for array indexing
if isinstance(key, type(self)):
key = key.to_pandas().squeeze(axis=1)
if is_bool_indexer(key):

Log of execution for shape (100m, 10), NPartitions=112
2022-02-25,08:29:55.250: START::PANDAS-API::DataFrame.__init__
2022-02-25,08:30:04.500: START::PANDAS-API::PandasDataframe.__init__
2022-02-25,08:30:04.501: START::PANDAS-API::PandasDataframe._filter_empties
2022-02-25,08:30:04.501: END::PANDAS-API::PandasDataframe._filter_empties
2022-02-25,08:30:04.501: END::PANDAS-API::PandasDataframe.__init__
2022-02-25,08:30:04.501: START::PANDAS-API::PandasQueryCompiler.__init__
2022-02-25,08:30:04.501: END::PANDAS-API::PandasQueryCompiler.__init__
2022-02-25,08:30:04.501: START::PANDAS-API::DataFrame.__init__
2022-02-25,08:30:04.501: START::PANDAS-API::DataFrame.__setattr__
2022-02-25,08:30:04.501: END::PANDAS-API::DataFrame.__setattr__
2022-02-25,08:30:04.501: END::PANDAS-API::DataFrame.__init__
2022-02-25,08:30:04.501: START::PANDAS-API::DataFrame.__setattr__
2022-02-25,08:30:04.502: END::PANDAS-API::DataFrame.__setattr__
2022-02-25,08:30:04.521: END::PANDAS-API::DataFrame.__init__
2022-02-25,08:30:05.153: START::PANDAS-API::Series.__init__
2022-02-25,08:30:05.358: START::PANDAS-API::PandasDataframe.__init__
2022-02-25,08:30:05.358: START::PANDAS-API::PandasDataframe._filter_empties
2022-02-25,08:30:05.359: END::PANDAS-API::PandasDataframe._filter_empties
2022-02-25,08:30:05.359: END::PANDAS-API::PandasDataframe.__init__
2022-02-25,08:30:05.359: START::PANDAS-API::PandasQueryCompiler.__init__
2022-02-25,08:30:05.359: END::PANDAS-API::PandasQueryCompiler.__init__
2022-02-25,08:30:05.359: START::PANDAS-API::DataFrame.__init__
2022-02-25,08:30:05.359: START::PANDAS-API::DataFrame.__setattr__
2022-02-25,08:30:05.359: END::PANDAS-API::DataFrame.__setattr__
2022-02-25,08:30:05.359: END::PANDAS-API::DataFrame.__init__
2022-02-25,08:30:05.359: START::PANDAS-API::PandasQueryCompiler.columnarize
2022-02-25,08:30:05.359: END::PANDAS-API::PandasQueryCompiler.columnarize
2022-02-25,08:30:05.359: START::PANDAS-API::PandasDataframe._validate_set_axis
2022-02-25,08:30:05.360: END::PANDAS-API::PandasDataframe._validate_set_axis
2022-02-25,08:30:05.360: START::PANDAS-API::PandasDataframe.synchronize_labels
2022-02-25,08:30:05.360: END::PANDAS-API::PandasDataframe.synchronize_labels
2022-02-25,08:30:05.360: END::PANDAS-API::Series.__init__
2022-02-25,08:30:05.361: START::PANDAS-API::BasePandasDataset.__getitem__
2022-02-25,08:30:05.361: START::PANDAS-API::BasePandasDataset.__len__
2022-02-25,08:30:05.361: END::PANDAS-API::BasePandasDataset.__len__
2022-02-25,08:30:05.361: START::PANDAS-API::DataFrame._getitem
2022-02-25,08:30:05.361: START::PANDAS-API::BaseQueryCompiler.has_multiindex
2022-02-25,08:30:05.361: END::PANDAS-API::BaseQueryCompiler.has_multiindex
2022-02-25,08:30:05.361: START::PANDAS-API::PandasQueryCompiler.getitem_array
2022-02-25,08:30:05.361: START::PANDAS-API::PandasQueryCompiler.to_pandas
2022-02-25,08:30:05.361: START::PANDAS-API::PandasDataframe.to_pandas
2022-02-25,08:30:05.361: START::PANDAS-API::PandasDataframe._propagate_index_objs
2022-02-25,08:30:05.361: START::PANDAS-API::PandasDataframe._filter_empties
2022-02-25,08:30:05.362: END::PANDAS-API::PandasDataframe._filter_empties
2022-02-25,08:30:05.363: END::PANDAS-API::PandasDataframe._propagate_index_objs
2022-02-25,08:30:05.880: END::PANDAS-API::PandasDataframe.to_pandas
2022-02-25,08:30:05.881: END::PANDAS-API::PandasQueryCompiler.to_pandas
2022-02-25,08:30:06.836: START::PANDAS-API::PandasQueryCompiler.getitem_row_array
2022-02-25,08:30:06.836: START::PANDAS-API::PandasDataframe.mask
2022-02-25,08:30:07.581: START::PANDAS-API::PandasDataframe._get_dict_of_block_index # Possible bottleneck
2022-02-25,08:30:17.776: END::PANDAS-API::PandasDataframe._get_dict_of_block_index
2022-02-25,08:30:18.255: START::PANDAS-API::PandasDataframe.__init__
2022-02-25,08:30:18.255: START::PANDAS-API::PandasDataframe._filter_empties
2022-02-25,08:30:18.255: END::PANDAS-API::PandasDataframe._filter_empties
2022-02-25,08:30:18.255: END::PANDAS-API::PandasDataframe.__init__
2022-02-25,08:30:18.289: END::PANDAS-API::PandasDataframe.mask
2022-02-25,08:30:18.289: START::PANDAS-API::PandasQueryCompiler.__init__
2022-02-25,08:30:18.289: END::PANDAS-API::PandasQueryCompiler.__init__
2022-02-25,08:30:18.289: END::PANDAS-API::PandasQueryCompiler.getitem_row_array
2022-02-25,08:30:18.290: END::PANDAS-API::PandasQueryCompiler.getitem_array
2022-02-25,08:30:18.290: START::PANDAS-API::DataFrame.__init__
2022-02-25,08:30:18.290: START::PANDAS-API::DataFrame.__setattr__
2022-02-25,08:30:18.290: END::PANDAS-API::DataFrame.__setattr__
2022-02-25,08:30:18.290: END::PANDAS-API::DataFrame.__init__
2022-02-25,08:30:18.290: END::PANDAS-API::DataFrame._getitem
2022-02-25,08:30:18.290: END::PANDAS-API::BasePandasDataset.__getitem__
@prutskov prutskov added the Performance 🚀 Performance related issues and pull requests. label Feb 25, 2022
@vnlitvinov
Copy link
Collaborator

@dchigarev could you please have a look at _get_dict_of_block_index() as IIRC you were the last to touch that?

@prutskov
Copy link
Contributor Author

Connected with #1903

@dchigarev dchigarev self-assigned this Mar 15, 2022
@vnlitvinov vnlitvinov self-assigned this Jul 13, 2022
@vnlitvinov
Copy link
Collaborator

Note: in our current configuration, unless a user explicitly calls ray.init() without arguments, Ray will get a runtiem_env startup option. When this option is passed, it amongst other effects leads to Ray not starting workers until something needs to be run on that worker.

Another note: initializing a worker requires importing pandas, ray and modin, which on my Windows laptop take anywhere from 1.5 to 2 seconds. So this init time skews the measurements a lot.

Here's my current init line after engine was configured:

pd.DataFrame(range(cfg.CpuCount.get() * cfg.MinPartitionSize().get())).to_numpy()

vnlitvinov added a commit to vnlitvinov/modin that referenced this issue Aug 2, 2022
…op labels

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
mvashishtha pushed a commit that referenced this issue Aug 16, 2022
…masks (#4753)

Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance 🚀 Performance related issues and pull requests.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants