Avoid dataframe deep copy while creating dataset #4596

memeplex · 2021-09-05T14:18:59Z

Summary

The python API does the following while creating a dataset from a dataframe with cat cols:

        if len(cat_cols):  # cat_cols is list
            data = data.copy()  # not alter origin DataFrame
            data[cat_cols] = data[cat_cols].apply(lambda x: x.cat.codes).replace({-1: np.nan})

By default copy does a deep copy (copies all data and indices) but AFAICS there is no need of that here and it could trigger a memory expensive operation.

Motivation

Avoid extra memory usage. The sequence of steps here will end up with three copies of the data:

the original dataframe
the duplicated dataframe (transient)
the dataset

Description

Use copy(deep=False):

df = pd.DataFrame(dict(
    x=pd.Categorical(["a", "b", "b"]),
    y=[0, 1., 0],
))
df2 = df.copy(deep=False)
df2[["x"]] = df2[["x"]].apply(lambda x: x.cat.codes).replace({-1: np.nan})
print(df)
print(df2)
print(df2.values)

   x    y
0  a  0.0
1  b  1.0
2  b  0.0

   x    y
0  0  0.0
1  1  1.0
2  1  0.0

[[0. 0.]
 [1. 1.]
 [1. 0.]]

The text was updated successfully, but these errors were encountered:

jameslamb · 2021-09-06T02:13:56Z

Thanks very much! I think it will take a bit of investigation to test whether such a change can be made safely. Are you interested in working on this and submitting a pull request?

memeplex · 2021-09-06T11:26:25Z

Hi @jameslamb , yes, I'm rather busy right now but I've already set a remainder to start working on this. It's a very small change anyway, but I understand we have to ensure it's safe. AFAICS it is, because if the code that follows modifies the dataframe in an undesirable way then a fortiori it's a problem for the code not taking the len(cat_cols) branch, and this branch only replaces some columns which is what a shallow copy should prevent to propagate to the original dataframe.

jameslamb · 2021-09-07T03:04:52Z

Sounds great, thanks very much!

guolinke · 2022-03-02T15:40:39Z

Thank you @memeplex, indeed, this copy could be safely removed. PR is very welcome.

jmoralez · 2022-05-13T22:03:54Z

I believe it's safe to add deep=False in that copy since we don't modify the original arrays when replacing the categorical features with their codes, and all the tests pass.

In terms of memory savings I ran the following script to test the memory usage which mimics what we do in _data_from_pandas, it defines 100 numerical features and 2 categorical features.

sample_script.py

import argparse

import numpy as np
import pandas as pd


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--deep', action='store_true')
    args = parser.parse_args()

    N = 100_000
    X = np.random.rand(N, 100)
    df = pd.DataFrame(X)
    cat1 = np.random.choice(['a', 'b'], N)
    df['cat1'] = cat1
    df['cat2'] = np.random.choice(['a', 'b', 'c'], N)
    cat_cols = ['cat1', 'cat2']
    df[cat_cols] = df[cat_cols].astype('category')
    df2 = df.copy(deep=args.deep)
    df2[cat_cols] = df2[cat_cols].apply(lambda c: c.cat.codes)
    res = df2.astype('float64').values
    np.testing.assert_equal(df['cat1'], cat1)
    np.testing.assert_equal(df2['cat1'], df['cat1'].cat.codes.values)

I profiled this script with scalene and got the following results (I only show the relevant lines):

scalene --reduced-profile --cli sample_script.py --deep:

                                              Memory usage: ▁▁▃▃▃▃▃▃▄▄█▇▇▇▇ (max: 410.37MB, growth rate:  78%)
                                                    sample_script.py: % of time = 100.00 out of   1.05.
       ╷       ╷       ╷      ╷       ╷       ╷       ╷               ╷       ╷
       │Time   │–––––– │––––… │–––––– │Memory │–––––– │–––––––––––    │Copy   │
  Line │Python │native │syst… │GPU    │Python │peak   │timeline/%     │(MB/s) │sample_script.py
╺━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸
    13 │       │       │      │       │   2%  │   77M │▁  18%         │       │    X = np.random.rand(N, 100)
    20 │       │    3% │   2% │       │       │   75M │▁  18%         │       │    df2 = df.copy(deep=args.deep)
    22 │       │    1% │   3% │       │       │  157M │▃▄▅▄  55%      │   224 │    res = df2.astype('float64').values

scalene --reduced-profile --cli sample_script.py:

                                              Memory usage: ▁▁▃▃▃▃▃▃▅▅█▇▇▇▇ (max: 334.22MB, growth rate:  74%)
                                                    sample_script.py: % of time = 100.00 out of   1.13.
       ╷       ╷       ╷      ╷       ╷       ╷       ╷               ╷       ╷
       │Time   │–––––– │––––… │–––––– │Memory │–––––– │–––––––––––    │Copy   │
  Line │Python │native │syst… │GPU    │Python │peak   │timeline/%     │(MB/s) │sample_script.py
╺━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸
    13 │       │       │      │       │   2%  │   77M │▁  22%         │       │    X = np.random.rand(N, 100)
    22 │    1% │    3% │   3% │       │       │  157M │▃▅▆▅  67%      │   141 │    res = df2.astype('float64').values

We see that we skip one copy of the dataframe setting deep=False and thus reduce the peak memory usage by the size of the dataframe. Taking the values attribute seems to temporarily hold two copies though.

If other maintainers agree I can make a PR adding the deep=False argument to that copy.

StrikerRUS · 2022-05-14T18:34:56Z

@jmoralez Thanks a lot for the profiling! I'm +1.

jameslamb · 2022-05-16T03:14:06Z

Very nice work @jmoralez ! I support this change, please put up a PR whenever you have time.

StrikerRUS · 2022-05-21T23:47:33Z

Isn't one more copy (deep) done here? If so, I think we can refactor the code to avoid making it.

LightGBM/python-package/lightgbm/basic.py

Lines 539 to 540 in 1617a63

    
           if feature_name == 'auto' or feature_name is None: 
        
               data = data.rename(columns=str)

https://github.com/pandas-dev/pandas/blob/4bfe3d07b4858144c219b9346329027024102ab6/pandas/core/frame.py#L4964-L5092
https://github.com/pandas-dev/pandas/blob/4bfe3d07b4858144c219b9346329027024102ab6/pandas/core/generic.py#L985-L1163
By default copy=True here
https://github.com/pandas-dev/pandas/blob/4bfe3d07b4858144c219b9346329027024102ab6/pandas/core/generic.py#L1128

…res with codes (fixes #4596) (#5225)

jmoralez · 2022-05-22T12:39:41Z

Isn't one more copy (deep) done here? If so, I think we can refactor the code to avoid making it.

You're right, I should have profiled the whole function. We have to set copy=False there as well, otherwise a deep copy is still being made.

StrikerRUS · 2022-05-22T18:37:11Z

You're right

Thanks for confirming my findings!

Reopening this issue.

We have to set copy=False there as well, otherwise a deep copy is still being made.

Maybe we can avoid doing a copy completely? Even a shallow one. I think we can track feature names in a separate structure like a raw list to not alter a passed DataFrame.

jmoralez · 2022-05-23T23:18:36Z

I think we can track feature names in a separate structure like a raw list to not alter a passed DataFrame.

TBH the fact that column names can be integers is very annoying and we'd have to modify a couple of sections in the _data_from_pandas function. Setting copy=False in the rename seems to be easier and shouldn't be memory expensive. I can post profiles of that function here if you think that'd help.

StrikerRUS · 2022-05-24T00:40:17Z

I can post profiles of that function here if you think that'd help.

That would be great to compare a shallow copy with a case without any copy.

jmoralez · 2022-05-24T20:57:19Z

I used memory_profiler this time because scalene wasn't measuring the allocations in the basic.py file, only my sample script. So using this script as the one to profile:

import lightgbm as lgb
import numpy as np
import pandas as pd
from memory_profiler import profile


N = 100_000
X = np.random.rand(N, 100)
df = pd.DataFrame(X)
df['cat1'] = np.random.choice(['a', 'b'], N)
df['cat2'] = np.random.choice(['a', 'b', 'c'], N)
cat_cols = ['cat1', 'cat2']
df[cat_cols] = df[cat_cols].astype('category')
decorated = profile(lgb.basic._data_from_pandas)
data = decorated(df, 'auto', 'auto', None)[0]

and running python profile.py | grep -v '0.0 MiB' | egrep '(^Line|MiB)' I get these results:

Current (f2e1ad6)

Line #    Mem usage    Increment  Occurrences   Line Contents
   535    216.8 MiB    216.8 MiB           1   def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorical):
   540    293.2 MiB     76.3 MiB           1               data = data.rename(columns=str)
   553    293.3 MiB      0.1 MiB           5               data[cat_cols] = data[cat_cols].apply(lambda x: x.cat.codes).replace({-1: np.nan})
   567    448.9 MiB    155.6 MiB           1           data = data.astype(target_dtype, copy=False).values

Setting copy=False in the rename function:

diff

@@ -537,7 +537,7 @@ def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorica
         if len(data.shape) != 2 or data.shape[0] < 1:
             raise ValueError('Input data must be 2 dimensional and non empty.')
         if feature_name == 'auto' or feature_name is None:
-            data = data.rename(columns=str)
+            data = data.rename(columns=str, copy=False)
         cat_cols = [col for col, dtype in zip(data.columns, data.dtypes) if isinstance(dtype, pd_CategoricalDtype)]
         cat_cols_not_ordered = [col for col in cat_cols if not data[col].cat.ordered]
         if pandas_categorical is None:  # train dataset

Line #    Mem usage    Increment  Occurrences   Line Contents
   535    216.9 MiB    216.9 MiB           1   def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorical):
   553    217.0 MiB      0.1 MiB           5               data[cat_cols] = data[cat_cols].apply(lambda x: x.cat.codes).replace({-1: np.nan})
   567    372.7 MiB    155.7 MiB           1           data = data.astype(target_dtype, copy=False).values

Using a list for the feature names:

diff

@@ -537,7 +537,7 @@ def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorica
         if len(data.shape) != 2 or data.shape[0] < 1:
             raise ValueError('Input data must be 2 dimensional and non empty.')
         if feature_name == 'auto' or feature_name is None:
-            data = data.rename(columns=str)
+            feature_name = [str(col) for col in data.columns]
         cat_cols = [col for col, dtype in zip(data.columns, data.dtypes) if isinstance(dtype, pd_CategoricalDtype)]
         cat_cols_not_ordered = [col for col in cat_cols if not data[col].cat.ordered]
         if pandas_categorical is None:  # train dataset
@@ -552,14 +552,10 @@ def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorica
             data = data.copy(deep=False)  # not alter origin DataFrame
             data[cat_cols] = data[cat_cols].apply(lambda x: x.cat.codes).replace({-1: np.nan})
         if categorical_feature is not None:
-            if feature_name is None:
-                feature_name = list(data.columns)
             if categorical_feature == 'auto':  # use cat cols from DataFrame
-                categorical_feature = cat_cols_not_ordered
+                categorical_feature = [str(col) for col in cat_cols_not_ordered]
             else:  # use cat cols specified by user
                 categorical_feature = list(categorical_feature)
-        if feature_name == 'auto':
-            feature_name = list(data.columns)
         _check_for_bad_pandas_dtypes(data.dtypes)
         df_dtypes = [dtype.type for dtype in data.dtypes]
         df_dtypes.append(np.float32)  # so that the target dtype considers floats

Line #    Mem usage    Increment  Occurrences   Line Contents
   535    216.9 MiB    216.9 MiB           1   def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorical):
   553    217.0 MiB      0.1 MiB           5               data[cat_cols] = data[cat_cols].apply(lambda x: x.cat.codes).replace({-1: np.nan})
   563    372.7 MiB    155.6 MiB           1           data = data.astype(target_dtype, copy=False).values

So we see that currently we're still making a deep copy because of the rename and that 2 and 3 are pretty much the same. However I'd prefer 2 because it's easier, in 3 we have to keep track of where the column names could be integers.

StrikerRUS · 2022-05-28T22:16:18Z

@jmoralez Thanks a lot for profiling again!

However I'd prefer 2 because it's easier

Well, that's fine with me.

#5254) * dont copy dataframe on rename * test with feature_name and 'auto'

github-actions · 2023-08-15T20:35:16Z

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

jameslamb added the efficiency label Sep 6, 2021

jameslamb mentioned this issue Apr 14, 2022

[RFC] 4.0.0 Release #5153

Closed

60 tasks

trivialfis mentioned this issue May 14, 2022

Categorical data support (part 2) dmlc/xgboost#7899

Open

16 tasks

jmoralez mentioned this issue May 19, 2022

[python-package] make a shallow copy when replacing categorical features with codes (fixes #4596) #5225

Merged

guolinke closed this as completed in #5225 May 22, 2022

guolinke pushed a commit that referenced this issue May 22, 2022

[python-package] make a shallow copy when replacing categorical featu…

c000b8c

…res with codes (fixes #4596) (#5225)

StrikerRUS reopened this May 22, 2022

jmoralez mentioned this issue May 30, 2022

[python-package] make a shallow copy on dataframe rename (fixes #4596) #5254

Merged

StrikerRUS closed this as completed in #5254 Jun 5, 2022

StrikerRUS pushed a commit that referenced this issue Jun 5, 2022

[python-package] make a shallow copy on dataframe rename (fixes #4596) (

65b3db1

#5254) * dont copy dataframe on rename * test with feature_name and 'auto'

jameslamb mentioned this issue Oct 7, 2022

[DO NOT MERGE] Release v3.3.3 #5525

Closed

40 tasks

github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid dataframe deep copy while creating dataset #4596

Avoid dataframe deep copy while creating dataset #4596

memeplex commented Sep 5, 2021 •

edited

Loading

jameslamb commented Sep 6, 2021

memeplex commented Sep 6, 2021 •

edited

Loading

jameslamb commented Sep 7, 2021

guolinke commented Mar 2, 2022

jmoralez commented May 13, 2022

StrikerRUS commented May 14, 2022

jameslamb commented May 16, 2022

StrikerRUS commented May 21, 2022

jmoralez commented May 22, 2022

StrikerRUS commented May 22, 2022

jmoralez commented May 23, 2022

StrikerRUS commented May 24, 2022

jmoralez commented May 24, 2022

StrikerRUS commented May 28, 2022

github-actions bot commented Aug 15, 2023

Avoid dataframe deep copy while creating dataset #4596

Avoid dataframe deep copy while creating dataset #4596

Comments

memeplex commented Sep 5, 2021 • edited Loading

Summary

Motivation

Description

jameslamb commented Sep 6, 2021

memeplex commented Sep 6, 2021 • edited Loading

jameslamb commented Sep 7, 2021

guolinke commented Mar 2, 2022

jmoralez commented May 13, 2022

StrikerRUS commented May 14, 2022

jameslamb commented May 16, 2022

StrikerRUS commented May 21, 2022

jmoralez commented May 22, 2022

StrikerRUS commented May 22, 2022

jmoralez commented May 23, 2022

StrikerRUS commented May 24, 2022

jmoralez commented May 24, 2022

StrikerRUS commented May 28, 2022

github-actions bot commented Aug 15, 2023

memeplex commented Sep 5, 2021 •

edited

Loading

memeplex commented Sep 6, 2021 •

edited

Loading