Modin is slower than Pandas on filters when `Series` is used as a mask #4268

prutskov · 2022-02-25T10:20:00Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
Modin version (modin.__version__): fc539c3
Python version: 3.8.11
Code we can use to reproduce:

from time import time as timer

import numpy as np
# import pandas as pd
import modin.pandas as pd
import modin.config as cfg
import ray
ray.init()
cfg.BenchmarkMode.put(True)

nrows = 1_000_000_000
ncols = 10

data = {f"col{i}": np.random.rand(nrows) for i in range(ncols)}
df = pd.DataFrame(data)

mask = pd.Series(np.random.choice(a=[True, False], size=nrows))

t = timer()
df2 = df[mask]
print(f'mask time: {timer() - t} s')

Describe the problem

Modin is slower than Pandas on filters when Series is used as a mask.

The results for Ray execution engine are follows:

Shape	(100k, 10)	(1m, 10)	(10m, 10)	(100m, 10)	(1b, 10)
modin	0.259	0.349	1.343	13.135	134.799
modin(NPartitions=1)	0.291	0.143	0.279	3.152	31.944
pandas	0.004	0.033	0.329	3.806	70.779

According to the logs, the main part of execution time (>75%) is here:

modin/modin/core/dataframe/pandas/dataframe/dataframe.py

Lines 628 to 630 in c17dde7

    
           row_partitions_list = self._get_dict_of_block_index( 
        
               0, sorted_row_positions, are_indices_sorted=True 
        
           )

Also, we have to_pandas call in this flow here:

modin/modin/core/storage_formats/pandas/query_compiler.py

Lines 2132 to 2136 in c17dde7

    
           def getitem_array(self, key): 
        
               # TODO: dont convert to pandas for array indexing 
        
               if isinstance(key, type(self)): 
        
                   key = key.to_pandas().squeeze(axis=1) 
        
               if is_bool_indexer(key):

Log of execution for shape (100m, 10), NPartitions=112

2022-02-25,08:29:55.250: START::PANDAS-API::DataFrame.__init__
2022-02-25,08:30:04.500: START::PANDAS-API::PandasDataframe.__init__
2022-02-25,08:30:04.501: START::PANDAS-API::PandasDataframe._filter_empties
2022-02-25,08:30:04.501: END::PANDAS-API::PandasDataframe._filter_empties
2022-02-25,08:30:04.501: END::PANDAS-API::PandasDataframe.__init__
2022-02-25,08:30:04.501: START::PANDAS-API::PandasQueryCompiler.__init__
2022-02-25,08:30:04.501: END::PANDAS-API::PandasQueryCompiler.__init__
2022-02-25,08:30:04.501: START::PANDAS-API::DataFrame.__init__
2022-02-25,08:30:04.501: START::PANDAS-API::DataFrame.__setattr__
2022-02-25,08:30:04.501: END::PANDAS-API::DataFrame.__setattr__
2022-02-25,08:30:04.501: END::PANDAS-API::DataFrame.__init__
2022-02-25,08:30:04.501: START::PANDAS-API::DataFrame.__setattr__
2022-02-25,08:30:04.502: END::PANDAS-API::DataFrame.__setattr__
2022-02-25,08:30:04.521: END::PANDAS-API::DataFrame.__init__
2022-02-25,08:30:05.153: START::PANDAS-API::Series.__init__
2022-02-25,08:30:05.358: START::PANDAS-API::PandasDataframe.__init__
2022-02-25,08:30:05.358: START::PANDAS-API::PandasDataframe._filter_empties
2022-02-25,08:30:05.359: END::PANDAS-API::PandasDataframe._filter_empties
2022-02-25,08:30:05.359: END::PANDAS-API::PandasDataframe.__init__
2022-02-25,08:30:05.359: START::PANDAS-API::PandasQueryCompiler.__init__
2022-02-25,08:30:05.359: END::PANDAS-API::PandasQueryCompiler.__init__
2022-02-25,08:30:05.359: START::PANDAS-API::DataFrame.__init__
2022-02-25,08:30:05.359: START::PANDAS-API::DataFrame.__setattr__
2022-02-25,08:30:05.359: END::PANDAS-API::DataFrame.__setattr__
2022-02-25,08:30:05.359: END::PANDAS-API::DataFrame.__init__
2022-02-25,08:30:05.359: START::PANDAS-API::PandasQueryCompiler.columnarize
2022-02-25,08:30:05.359: END::PANDAS-API::PandasQueryCompiler.columnarize
2022-02-25,08:30:05.359: START::PANDAS-API::PandasDataframe._validate_set_axis
2022-02-25,08:30:05.360: END::PANDAS-API::PandasDataframe._validate_set_axis
2022-02-25,08:30:05.360: START::PANDAS-API::PandasDataframe.synchronize_labels
2022-02-25,08:30:05.360: END::PANDAS-API::PandasDataframe.synchronize_labels
2022-02-25,08:30:05.360: END::PANDAS-API::Series.__init__
2022-02-25,08:30:05.361: START::PANDAS-API::BasePandasDataset.__getitem__
2022-02-25,08:30:05.361: START::PANDAS-API::BasePandasDataset.__len__
2022-02-25,08:30:05.361: END::PANDAS-API::BasePandasDataset.__len__
2022-02-25,08:30:05.361: START::PANDAS-API::DataFrame._getitem
2022-02-25,08:30:05.361: START::PANDAS-API::BaseQueryCompiler.has_multiindex
2022-02-25,08:30:05.361: END::PANDAS-API::BaseQueryCompiler.has_multiindex
2022-02-25,08:30:05.361: START::PANDAS-API::PandasQueryCompiler.getitem_array
2022-02-25,08:30:05.361: START::PANDAS-API::PandasQueryCompiler.to_pandas
2022-02-25,08:30:05.361: START::PANDAS-API::PandasDataframe.to_pandas
2022-02-25,08:30:05.361: START::PANDAS-API::PandasDataframe._propagate_index_objs
2022-02-25,08:30:05.361: START::PANDAS-API::PandasDataframe._filter_empties
2022-02-25,08:30:05.362: END::PANDAS-API::PandasDataframe._filter_empties
2022-02-25,08:30:05.363: END::PANDAS-API::PandasDataframe._propagate_index_objs
2022-02-25,08:30:05.880: END::PANDAS-API::PandasDataframe.to_pandas
2022-02-25,08:30:05.881: END::PANDAS-API::PandasQueryCompiler.to_pandas
2022-02-25,08:30:06.836: START::PANDAS-API::PandasQueryCompiler.getitem_row_array
2022-02-25,08:30:06.836: START::PANDAS-API::PandasDataframe.mask
2022-02-25,08:30:07.581: START::PANDAS-API::PandasDataframe._get_dict_of_block_index # Possible bottleneck
2022-02-25,08:30:17.776: END::PANDAS-API::PandasDataframe._get_dict_of_block_index
2022-02-25,08:30:18.255: START::PANDAS-API::PandasDataframe.__init__
2022-02-25,08:30:18.255: START::PANDAS-API::PandasDataframe._filter_empties
2022-02-25,08:30:18.255: END::PANDAS-API::PandasDataframe._filter_empties
2022-02-25,08:30:18.255: END::PANDAS-API::PandasDataframe.__init__
2022-02-25,08:30:18.289: END::PANDAS-API::PandasDataframe.mask
2022-02-25,08:30:18.289: START::PANDAS-API::PandasQueryCompiler.__init__
2022-02-25,08:30:18.289: END::PANDAS-API::PandasQueryCompiler.__init__
2022-02-25,08:30:18.289: END::PANDAS-API::PandasQueryCompiler.getitem_row_array
2022-02-25,08:30:18.290: END::PANDAS-API::PandasQueryCompiler.getitem_array
2022-02-25,08:30:18.290: START::PANDAS-API::DataFrame.__init__
2022-02-25,08:30:18.290: START::PANDAS-API::DataFrame.__setattr__
2022-02-25,08:30:18.290: END::PANDAS-API::DataFrame.__setattr__
2022-02-25,08:30:18.290: END::PANDAS-API::DataFrame.__init__
2022-02-25,08:30:18.290: END::PANDAS-API::DataFrame._getitem
2022-02-25,08:30:18.290: END::PANDAS-API::BasePandasDataset.__getitem__

The text was updated successfully, but these errors were encountered:

vnlitvinov · 2022-02-25T16:23:37Z

@dchigarev could you please have a look at _get_dict_of_block_index() as IIRC you were the last to touch that?

prutskov · 2022-03-11T11:44:09Z

Connected with #1903

vnlitvinov · 2022-07-18T11:32:13Z

Note: in our current configuration, unless a user explicitly calls ray.init() without arguments, Ray will get a runtiem_env startup option. When this option is passed, it amongst other effects leads to Ray not starting workers until something needs to be run on that worker.

Another note: initializing a worker requires importing pandas, ray and modin, which on my Windows laptop take anywhere from 1.5 to 2 seconds. So this init time skews the measurements a lot.

Here's my current init line after engine was configured:

pd.DataFrame(range(cfg.CpuCount.get() * cfg.MinPartitionSize().get())).to_numpy()

…op labels Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

…masks (#4753) Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

prutskov added the Performance 🚀 Performance related issues and pull requests. label Feb 25, 2022

prutskov mentioned this issue Mar 11, 2022

Series.drop() is 10x slower on Modin than on Pandas #3844

Closed

prutskov mentioned this issue Mar 11, 2022

Modin performance enhancements #4315

Closed

dchigarev self-assigned this Mar 15, 2022

vnlitvinov self-assigned this Jul 13, 2022

prutskov mentioned this issue Jul 22, 2022

PERF-#3844: Improve perf of drop operation #4694

Merged

8 tasks

vnlitvinov added a commit to vnlitvinov/modin that referenced this issue Aug 2, 2022

FEAT-modin-project#4268: Allow ModinDataframe.broadcast_apply() to dr…

a235a3f

…op labels Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

vnlitvinov mentioned this issue Aug 2, 2022

PERF-#4268: Implement partition-parallel __getitem__ for bool Series masks #4753

Merged

8 tasks

mvashishtha closed this as completed in #4753 Aug 16, 2022

mvashishtha pushed a commit that referenced this issue Aug 16, 2022

PERF-#4268: Implement partition-parallel __getitem__ for bool Series …

bd326f1

…masks (#4753) Signed-off-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modin is slower than Pandas on filters when `Series` is used as a mask #4268

Modin is slower than Pandas on filters when `Series` is used as a mask #4268

prutskov commented Feb 25, 2022 •

edited

Loading

vnlitvinov commented Feb 25, 2022

prutskov commented Mar 11, 2022

vnlitvinov commented Jul 18, 2022

Modin is slower than Pandas on filters when Series is used as a mask #4268

Modin is slower than Pandas on filters when Series is used as a mask #4268

Comments

prutskov commented Feb 25, 2022 • edited Loading

System information

Describe the problem

vnlitvinov commented Feb 25, 2022

prutskov commented Mar 11, 2022

vnlitvinov commented Jul 18, 2022

Modin is slower than Pandas on filters when `Series` is used as a mask #4268

Modin is slower than Pandas on filters when `Series` is used as a mask #4268

prutskov commented Feb 25, 2022 •

edited

Loading