You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #4089, a new input format of lightgbm.Sequence is allowed as raw data input to dataset construction. A Sequence class should provide random and range data access. Users can customize their own Sequence class to support their own input format, as long as random access and range access are implemented.
Then we should call CategoryEncodingProvider class in the internal C++ method LGBM_DatasetCreateFromSampledColumn which constructs dataset from Sequence,
Seems it is not trivial to support target encoding for sequence data. After discussing with @shiyu1994 , I think there are following items needed to be done:
support accumulating from batch in CategoryEncodingProvider
Write an API for creating CategoryEncodingProvider with batch data in Python
For sequence data
first create CategoryEncodingProvider
use CategoryEncodingProvider in LGBM_DatasetCreateFromSampledColumn
use CategoryEncodingProvider in LGBM_DatasetPushRows
Summary
In #4089, a new input format of
lightgbm.Sequence
is allowed as raw data input to dataset construction. ASequence
class should provide random and range data access. Users can customize their ownSequence
class to support their own input format, as long as random access and range access are implemented.LightGBM/python-package/lightgbm/basic.py
Lines 653 to 716 in b1facf5
#3234 implements target and count encoding for categorical features, which has supported most of the input formats except
Sequence
.We want to add support for these encoding methods for sequence input.
Description
To implement this, we should get access to the label in the method where dataset is constructed from
Sequence
.LightGBM/python-package/lightgbm/basic.py
Lines 1620 to 1647 in b1facf5
as what we have done for other data format in #3234, for example
https://github.com/shiyu1994/LightGBM/blob/e139fd1cf50c43a3ceced66e67ce148a23704e54/python-package/lightgbm/basic.py#L1724-L1747
Then we should call
CategoryEncodingProvider
class in the internal C++ methodLGBM_DatasetCreateFromSampledColumn
which constructs dataset fromSequence
,LightGBM/src/c_api.cpp
Lines 966 to 984 in b1facf5
Contributor may see
https://github.com/shiyu1994/LightGBM/blob/e139fd1cf50c43a3ceced66e67ce148a23704e54/src/c_api.cpp#L1086-L1184
for reference.
The text was updated successfully, but these errors were encountered: