Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid dataframe deep copy while creating dataset #4596

Closed
Tracked by #5153
memeplex opened this issue Sep 5, 2021 · 15 comments · Fixed by #5225 or #5254
Closed
Tracked by #5153

Avoid dataframe deep copy while creating dataset #4596

memeplex opened this issue Sep 5, 2021 · 15 comments · Fixed by #5225 or #5254

Comments

@memeplex
Copy link

memeplex commented Sep 5, 2021

Summary

The python API does the following while creating a dataset from a dataframe with cat cols:

        if len(cat_cols):  # cat_cols is list
            data = data.copy()  # not alter origin DataFrame
            data[cat_cols] = data[cat_cols].apply(lambda x: x.cat.codes).replace({-1: np.nan})

By default copy does a deep copy (copies all data and indices) but AFAICS there is no need of that here and it could trigger a memory expensive operation.

Motivation

Avoid extra memory usage. The sequence of steps here will end up with three copies of the data:

  • the original dataframe
  • the duplicated dataframe (transient)
  • the dataset

Description

Use copy(deep=False):

df = pd.DataFrame(dict(
    x=pd.Categorical(["a", "b", "b"]),
    y=[0, 1., 0],
))
df2 = df.copy(deep=False)
df2[["x"]] = df2[["x"]].apply(lambda x: x.cat.codes).replace({-1: np.nan})
print(df)
print(df2)
print(df2.values)
   x    y
0  a  0.0
1  b  1.0
2  b  0.0

   x    y
0  0  0.0
1  1  1.0
2  1  0.0

[[0. 0.]
 [1. 1.]
 [1. 0.]]
@jameslamb
Copy link
Collaborator

Thanks very much! I think it will take a bit of investigation to test whether such a change can be made safely. Are you interested in working on this and submitting a pull request?

@memeplex
Copy link
Author

memeplex commented Sep 6, 2021

Hi @jameslamb , yes, I'm rather busy right now but I've already set a remainder to start working on this. It's a very small change anyway, but I understand we have to ensure it's safe. AFAICS it is, because if the code that follows modifies the dataframe in an undesirable way then a fortiori it's a problem for the code not taking the len(cat_cols) branch, and this branch only replaces some columns which is what a shallow copy should prevent to propagate to the original dataframe.

@jameslamb
Copy link
Collaborator

Sounds great, thanks very much!

@guolinke
Copy link
Collaborator

guolinke commented Mar 2, 2022

Thank you @memeplex, indeed, this copy could be safely removed. PR is very welcome.

@jameslamb jameslamb mentioned this issue Apr 14, 2022
60 tasks
@jmoralez
Copy link
Collaborator

I believe it's safe to add deep=False in that copy since we don't modify the original arrays when replacing the categorical features with their codes, and all the tests pass.

In terms of memory savings I ran the following script to test the memory usage which mimics what we do in _data_from_pandas, it defines 100 numerical features and 2 categorical features.

sample_script.py
import argparse

import numpy as np
import pandas as pd


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--deep', action='store_true')
    args = parser.parse_args()

    N = 100_000
    X = np.random.rand(N, 100)
    df = pd.DataFrame(X)
    cat1 = np.random.choice(['a', 'b'], N)
    df['cat1'] = cat1
    df['cat2'] = np.random.choice(['a', 'b', 'c'], N)
    cat_cols = ['cat1', 'cat2']
    df[cat_cols] = df[cat_cols].astype('category')
    df2 = df.copy(deep=args.deep)
    df2[cat_cols] = df2[cat_cols].apply(lambda c: c.cat.codes)
    res = df2.astype('float64').values
    np.testing.assert_equal(df['cat1'], cat1)
    np.testing.assert_equal(df2['cat1'], df['cat1'].cat.codes.values)

I profiled this script with scalene and got the following results (I only show the relevant lines):

  • scalene --reduced-profile --cli sample_script.py --deep:
                                              Memory usage: ▁▁▃▃▃▃▃▃▄▄█▇▇▇▇ (max: 410.37MB, growth rate:  78%)
                                                    sample_script.py: % of time = 100.00 out of   1.05.
       ╷       ╷       ╷      ╷       ╷       ╷       ╷               ╷       ╷
       │Time   │–––––– │––––… │–––––– │Memory │–––––– │–––––––––––    │Copy   │
  Line │Python │native │syst… │GPU    │Python │peak   │timeline/%     │(MB/s) │sample_script.py
╺━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸
    13 │       │       │      │       │   2%  │   77M │▁  18%         │       │    X = np.random.rand(N, 100)
    20 │       │    3% │   2% │       │       │   75M │▁  18%         │       │    df2 = df.copy(deep=args.deep)
    22 │       │    1% │   3% │       │       │  157M │▃▄▅▄  55%      │   224 │    res = df2.astype('float64').values
  • scalene --reduced-profile --cli sample_script.py:
                                              Memory usage: ▁▁▃▃▃▃▃▃▅▅█▇▇▇▇ (max: 334.22MB, growth rate:  74%)
                                                    sample_script.py: % of time = 100.00 out of   1.13.
       ╷       ╷       ╷      ╷       ╷       ╷       ╷               ╷       ╷
       │Time   │–––––– │––––… │–––––– │Memory │–––––– │–––––––––––    │Copy   │
  Line │Python │native │syst… │GPU    │Python │peak   │timeline/%     │(MB/s) │sample_script.py
╺━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸
    13 │       │       │      │       │   2%  │   77M │▁  22%         │       │    X = np.random.rand(N, 100)
    22 │    1% │    3% │   3% │       │       │  157M │▃▅▆▅  67%      │   141 │    res = df2.astype('float64').values

We see that we skip one copy of the dataframe setting deep=False and thus reduce the peak memory usage by the size of the dataframe. Taking the values attribute seems to temporarily hold two copies though.

If other maintainers agree I can make a PR adding the deep=False argument to that copy.

@StrikerRUS
Copy link
Collaborator

@jmoralez Thanks a lot for the profiling! I'm +1.

@jameslamb
Copy link
Collaborator

Very nice work @jmoralez ! I support this change, please put up a PR whenever you have time.

@StrikerRUS
Copy link
Collaborator

guolinke pushed a commit that referenced this issue May 22, 2022
@jmoralez
Copy link
Collaborator

Isn't one more copy (deep) done here? If so, I think we can refactor the code to avoid making it.

You're right, I should have profiled the whole function. We have to set copy=False there as well, otherwise a deep copy is still being made.

@StrikerRUS
Copy link
Collaborator

You're right

Thanks for confirming my findings!

Reopening this issue.

We have to set copy=False there as well, otherwise a deep copy is still being made.

Maybe we can avoid doing a copy completely? Even a shallow one. I think we can track feature names in a separate structure like a raw list to not alter a passed DataFrame.

@StrikerRUS StrikerRUS reopened this May 22, 2022
@jmoralez
Copy link
Collaborator

I think we can track feature names in a separate structure like a raw list to not alter a passed DataFrame.

TBH the fact that column names can be integers is very annoying and we'd have to modify a couple of sections in the _data_from_pandas function. Setting copy=False in the rename seems to be easier and shouldn't be memory expensive. I can post profiles of that function here if you think that'd help.

@StrikerRUS
Copy link
Collaborator

I can post profiles of that function here if you think that'd help.

That would be great to compare a shallow copy with a case without any copy.

@jmoralez
Copy link
Collaborator

I used memory_profiler this time because scalene wasn't measuring the allocations in the basic.py file, only my sample script. So using this script as the one to profile:

import lightgbm as lgb
import numpy as np
import pandas as pd
from memory_profiler import profile


N = 100_000
X = np.random.rand(N, 100)
df = pd.DataFrame(X)
df['cat1'] = np.random.choice(['a', 'b'], N)
df['cat2'] = np.random.choice(['a', 'b', 'c'], N)
cat_cols = ['cat1', 'cat2']
df[cat_cols] = df[cat_cols].astype('category')
decorated = profile(lgb.basic._data_from_pandas)
data = decorated(df, 'auto', 'auto', None)[0]

and running python profile.py | grep -v '0.0 MiB' | egrep '(^Line|MiB)' I get these results:

  1. Current (f2e1ad6)
Line #    Mem usage    Increment  Occurrences   Line Contents
   535    216.8 MiB    216.8 MiB           1   def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorical):
   540    293.2 MiB     76.3 MiB           1               data = data.rename(columns=str)
   553    293.3 MiB      0.1 MiB           5               data[cat_cols] = data[cat_cols].apply(lambda x: x.cat.codes).replace({-1: np.nan})
   567    448.9 MiB    155.6 MiB           1           data = data.astype(target_dtype, copy=False).values
  1. Setting copy=False in the rename function:
diff
@@ -537,7 +537,7 @@ def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorica
         if len(data.shape) != 2 or data.shape[0] < 1:
             raise ValueError('Input data must be 2 dimensional and non empty.')
         if feature_name == 'auto' or feature_name is None:
-            data = data.rename(columns=str)
+            data = data.rename(columns=str, copy=False)
         cat_cols = [col for col, dtype in zip(data.columns, data.dtypes) if isinstance(dtype, pd_CategoricalDtype)]
         cat_cols_not_ordered = [col for col in cat_cols if not data[col].cat.ordered]
         if pandas_categorical is None:  # train dataset
Line #    Mem usage    Increment  Occurrences   Line Contents
   535    216.9 MiB    216.9 MiB           1   def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorical):
   553    217.0 MiB      0.1 MiB           5               data[cat_cols] = data[cat_cols].apply(lambda x: x.cat.codes).replace({-1: np.nan})
   567    372.7 MiB    155.7 MiB           1           data = data.astype(target_dtype, copy=False).values
  1. Using a list for the feature names:
diff
@@ -537,7 +537,7 @@ def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorica
         if len(data.shape) != 2 or data.shape[0] < 1:
             raise ValueError('Input data must be 2 dimensional and non empty.')
         if feature_name == 'auto' or feature_name is None:
-            data = data.rename(columns=str)
+            feature_name = [str(col) for col in data.columns]
         cat_cols = [col for col, dtype in zip(data.columns, data.dtypes) if isinstance(dtype, pd_CategoricalDtype)]
         cat_cols_not_ordered = [col for col in cat_cols if not data[col].cat.ordered]
         if pandas_categorical is None:  # train dataset
@@ -552,14 +552,10 @@ def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorica
             data = data.copy(deep=False)  # not alter origin DataFrame
             data[cat_cols] = data[cat_cols].apply(lambda x: x.cat.codes).replace({-1: np.nan})
         if categorical_feature is not None:
-            if feature_name is None:
-                feature_name = list(data.columns)
             if categorical_feature == 'auto':  # use cat cols from DataFrame
-                categorical_feature = cat_cols_not_ordered
+                categorical_feature = [str(col) for col in cat_cols_not_ordered]
             else:  # use cat cols specified by user
                 categorical_feature = list(categorical_feature)
-        if feature_name == 'auto':
-            feature_name = list(data.columns)
         _check_for_bad_pandas_dtypes(data.dtypes)
         df_dtypes = [dtype.type for dtype in data.dtypes]
         df_dtypes.append(np.float32)  # so that the target dtype considers floats
Line #    Mem usage    Increment  Occurrences   Line Contents
   535    216.9 MiB    216.9 MiB           1   def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorical):
   553    217.0 MiB      0.1 MiB           5               data[cat_cols] = data[cat_cols].apply(lambda x: x.cat.codes).replace({-1: np.nan})
   563    372.7 MiB    155.6 MiB           1           data = data.astype(target_dtype, copy=False).values

So we see that currently we're still making a deep copy because of the rename and that 2 and 3 are pretty much the same. However I'd prefer 2 because it's easier, in 3 we have to keep track of where the column names could be integers.

@StrikerRUS
Copy link
Collaborator

@jmoralez Thanks a lot for profiling again!

However I'd prefer 2 because it's easier

Well, that's fine with me.

StrikerRUS pushed a commit that referenced this issue Jun 5, 2022
#5254)

* dont copy dataframe on rename

* test with feature_name and 'auto'
@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.