Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Target and Count Encoding for Sequence Input #4780

Open
Tracked by #5153
shiyu1994 opened this issue Nov 8, 2021 · 1 comment
Open
Tracked by #5153

Support Target and Count Encoding for Sequence Input #4780

shiyu1994 opened this issue Nov 8, 2021 · 1 comment
Assignees

Comments

@shiyu1994
Copy link
Collaborator

Summary

In #4089, a new input format of lightgbm.Sequence is allowed as raw data input to dataset construction. A Sequence class should provide random and range data access. Users can customize their own Sequence class to support their own input format, as long as random access and range access are implemented.

class Sequence(abc.ABC):
"""
Generic data access interface.
Object should support the following operations:
.. code-block::
# Get total row number.
>>> len(seq)
# Random access by row index. Used for data sampling.
>>> seq[10]
# Range data access. Used to read data in batch when constructing Dataset.
>>> seq[0:100]
# Optionally specify batch_size to control range data read size.
>>> seq.batch_size
- With random access, **data sampling does not need to go through all data**.
- With range data access, there's **no need to read all data into memory thus reduce memory usage**.
.. versionadded:: 3.3.0
Attributes
----------
batch_size : int
Default size of a batch.
"""
batch_size = 4096 # Defaults to read 4K rows in each batch.
@abc.abstractmethod
def __getitem__(self, idx: Union[int, slice, List[int]]) -> np.ndarray:
"""Return data for given row index.
A basic implementation should look like this:
.. code-block:: python
if isinstance(idx, numbers.Integral):
return self._get_one_line(idx)
elif isinstance(idx, slice):
return np.stack([self._get_one_line(i) for i in range(idx.start, idx.stop)])
elif isinstance(idx, list):
# Only required if using ``Dataset.subset()``.
return np.array([self._get_one_line(i) for i in idx])
else:
raise TypeError(f"Sequence index must be integer, slice or list, got {type(idx).__name__}")
Parameters
----------
idx : int, slice[int], list[int]
Item index.
Returns
-------
result : numpy 1-D array, numpy 2-D array
1-D array if idx is int, 2-D array if idx is slice or list.
"""
raise NotImplementedError("Sub-classes of lightgbm.Sequence must implement __getitem__()")
@abc.abstractmethod
def __len__(self) -> int:
"""Return row count of this sequence."""
raise NotImplementedError("Sub-classes of lightgbm.Sequence must implement __len__()")

#3234 implements target and count encoding for categorical features, which has supported most of the input formats except Sequence.

We want to add support for these encoding methods for sequence input.

Description

To implement this, we should get access to the label in the method where dataset is constructed from Sequence.

def __init_from_seqs(self, seqs: List[Sequence], ref_dataset: Optional['Dataset'] = None):
"""
Initialize data from list of Sequence objects.
Sequence: Generic Data Access Object
Supports random access and access by batch if properly defined by user
Data scheme uniformity are trusted, not checked
"""
total_nrow = sum(len(seq) for seq in seqs)
# create validation dataset from ref_dataset
if ref_dataset is not None:
self._init_from_ref_dataset(total_nrow, ref_dataset)
else:
param_str = param_dict_to_str(self.get_params())
sample_cnt = _get_sample_count(total_nrow, param_str)
sample_data, col_indices = self.__sample(seqs, total_nrow)
self._init_from_sample(sample_data, col_indices, sample_cnt, total_nrow)
for seq in seqs:
nrow = len(seq)
batch_size = getattr(seq, 'batch_size', None) or Sequence.batch_size
for start in range(0, nrow, batch_size):
end = min(start + batch_size, nrow)
self._push_rows(seq[start:end])
return self

as what we have done for other data format in #3234, for example
https://github.com/shiyu1994/LightGBM/blob/e139fd1cf50c43a3ceced66e67ce148a23704e54/python-package/lightgbm/basic.py#L1724-L1747

Then we should call CategoryEncodingProvider class in the internal C++ method LGBM_DatasetCreateFromSampledColumn which constructs dataset from Sequence,

LightGBM/src/c_api.cpp

Lines 966 to 984 in b1facf5

int LGBM_DatasetCreateFromSampledColumn(double** sample_data,
int** sample_indices,
int32_t ncol,
const int* num_per_col,
int32_t num_sample_row,
int32_t num_total_row,
const char* parameters,
DatasetHandle* out) {
API_BEGIN();
auto param = Config::Str2Map(parameters);
Config config;
config.Set(param);
OMP_SET_NUM_THREADS(config.num_threads);
DatasetLoader loader(config, nullptr, 1, nullptr);
*out = loader.ConstructFromSampleData(sample_data, sample_indices, ncol, num_per_col,
num_sample_row,
static_cast<data_size_t>(num_total_row));
API_END();
}

Contributor may see
https://github.com/shiyu1994/LightGBM/blob/e139fd1cf50c43a3ceced66e67ce148a23704e54/src/c_api.cpp#L1086-L1184
for reference.

@hzy46
Copy link
Contributor

hzy46 commented Nov 15, 2021

Seems it is not trivial to support target encoding for sequence data. After discussing with @shiyu1994 , I think there are following items needed to be done:

  • support accumulating from batch in CategoryEncodingProvider
  • Write an API for creating CategoryEncodingProvider with batch data in Python
  • For sequence data
    • first create CategoryEncodingProvider
    • use CategoryEncodingProvider in LGBM_DatasetCreateFromSampledColumn
    • use CategoryEncodingProvider in LGBM_DatasetPushRows

@jameslamb jameslamb mentioned this issue Apr 14, 2022
60 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants