Support Target and Count Encoding for Sequence Input #4780

shiyu1994 · 2021-11-08T05:55:43Z

Summary

In #4089, a new input format of lightgbm.Sequence is allowed as raw data input to dataset construction. A Sequence class should provide random and range data access. Users can customize their own Sequence class to support their own input format, as long as random access and range access are implemented.

LightGBM/python-package/lightgbm/basic.py

Lines 653 to 716 in b1facf5

    
           class Sequence(abc.ABC): 
        
               """ 
        
               Generic data access interface. 
        
               Object should support the following operations: 
        
               .. code-block:: 
        
                   # Get total row number. 
        
                   >>> len(seq) 
        
                   # Random access by row index. Used for data sampling. 
        
                   >>> seq[10] 
        
                   # Range data access. Used to read data in batch when constructing Dataset. 
        
                   >>> seq[0:100] 
        
                   # Optionally specify batch_size to control range data read size. 
        
                   >>> seq.batch_size 
        
               - With random access, **data sampling does not need to go through all data**. 
        
               - With range data access, there's **no need to read all data into memory thus reduce memory usage**. 
        
               .. versionadded:: 3.3.0 
        
               Attributes 
        
               ---------- 
        
               batch_size : int 
        
                   Default size of a batch. 
        
               """ 
        
               batch_size = 4096  # Defaults to read 4K rows in each batch. 
        
               @abc.abstractmethod 
        
               def __getitem__(self, idx: Union[int, slice, List[int]]) -> np.ndarray: 
        
                   """Return data for given row index. 
        
                   A basic implementation should look like this: 
        
                   .. code-block:: python 
        
                       if isinstance(idx, numbers.Integral): 
        
                           return self._get_one_line(idx) 
        
                       elif isinstance(idx, slice): 
        
                           return np.stack([self._get_one_line(i) for i in range(idx.start, idx.stop)]) 
        
                       elif isinstance(idx, list): 
        
                           # Only required if using ``Dataset.subset()``. 
        
                           return np.array([self._get_one_line(i) for i in idx]) 
        
                       else: 
        
                           raise TypeError(f"Sequence index must be integer, slice or list, got {type(idx).__name__}") 
        
                   Parameters 
        
                   ---------- 
        
                   idx : int, slice[int], list[int] 
        
                       Item index. 
        
                   Returns 
        
                   ------- 
        
                   result : numpy 1-D array, numpy 2-D array 
        
                       1-D array if idx is int, 2-D array if idx is slice or list. 
        
                   """ 
        
                   raise NotImplementedError("Sub-classes of lightgbm.Sequence must implement __getitem__()") 
        
               @abc.abstractmethod 
        
               def __len__(self) -> int: 
        
                   """Return row count of this sequence.""" 
        
                   raise NotImplementedError("Sub-classes of lightgbm.Sequence must implement __len__()")

#3234 implements target and count encoding for categorical features, which has supported most of the input formats except Sequence.

We want to add support for these encoding methods for sequence input.

Description

To implement this, we should get access to the label in the method where dataset is constructed from Sequence.

LightGBM/python-package/lightgbm/basic.py

Lines 1620 to 1647 in b1facf5

    
               def __init_from_seqs(self, seqs: List[Sequence], ref_dataset: Optional['Dataset'] = None): 
        
                   """ 
        
                   Initialize data from list of Sequence objects. 
        
                   Sequence: Generic Data Access Object 
        
                       Supports random access and access by batch if properly defined by user 
        
                   Data scheme uniformity are trusted, not checked 
        
                   """ 
        
                   total_nrow = sum(len(seq) for seq in seqs) 
        
                   # create validation dataset from ref_dataset 
        
                   if ref_dataset is not None: 
        
                       self._init_from_ref_dataset(total_nrow, ref_dataset) 
        
                   else: 
        
                       param_str = param_dict_to_str(self.get_params()) 
        
                       sample_cnt = _get_sample_count(total_nrow, param_str) 
        
                       sample_data, col_indices = self.__sample(seqs, total_nrow) 
        
                       self._init_from_sample(sample_data, col_indices, sample_cnt, total_nrow) 
        
                   for seq in seqs: 
        
                       nrow = len(seq) 
        
                       batch_size = getattr(seq, 'batch_size', None) or Sequence.batch_size 
        
                       for start in range(0, nrow, batch_size): 
        
                           end = min(start + batch_size, nrow) 
        
                           self._push_rows(seq[start:end]) 
        
                   return self

as what we have done for other data format in #3234, for example
https://github.com/shiyu1994/LightGBM/blob/e139fd1cf50c43a3ceced66e67ce148a23704e54/python-package/lightgbm/basic.py#L1724-L1747

Then we should call CategoryEncodingProvider class in the internal C++ method LGBM_DatasetCreateFromSampledColumn which constructs dataset from Sequence,

LightGBM/src/c_api.cpp

Lines 966 to 984 in b1facf5

    
           int LGBM_DatasetCreateFromSampledColumn(double** sample_data, 
        
                                                   int** sample_indices, 
        
                                                   int32_t ncol, 
        
                                                   const int* num_per_col, 
        
                                                   int32_t num_sample_row, 
        
                                                   int32_t num_total_row, 
        
                                                   const char* parameters, 
        
                                                   DatasetHandle* out) { 
        
             API_BEGIN(); 
        
             auto param = Config::Str2Map(parameters); 
        
             Config config; 
        
             config.Set(param); 
        
             OMP_SET_NUM_THREADS(config.num_threads); 
        
             DatasetLoader loader(config, nullptr, 1, nullptr); 
        
             *out = loader.ConstructFromSampleData(sample_data, sample_indices, ncol, num_per_col, 
        
                                                   num_sample_row, 
        
                                                   static_cast<data_size_t>(num_total_row)); 
        
             API_END(); 
        
           }

Contributor may see
https://github.com/shiyu1994/LightGBM/blob/e139fd1cf50c43a3ceced66e67ce148a23704e54/src/c_api.cpp#L1086-L1184
for reference.

The text was updated successfully, but these errors were encountered:

hzy46 · 2021-11-15T06:59:41Z

Seems it is not trivial to support target encoding for sequence data. After discussing with @shiyu1994 , I think there are following items needed to be done:

support accumulating from batch in CategoryEncodingProvider
Write an API for creating CategoryEncodingProvider with batch data in Python
For sequence data
- first create CategoryEncodingProvider
- use CategoryEncodingProvider in LGBM_DatasetCreateFromSampledColumn
- use CategoryEncodingProvider in LGBM_DatasetPushRows

shiyu1994 added the feature request label Nov 8, 2021

shiyu1994 mentioned this issue Nov 8, 2021

[Draft] Oct~Nov iteration Plan #4677

Closed

16 tasks

hzy46 self-assigned this Nov 9, 2021

jameslamb mentioned this issue Apr 14, 2022

[RFC] 4.0.0 Release #5153

Closed

60 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Target and Count Encoding for Sequence Input #4780

Support Target and Count Encoding for Sequence Input #4780

shiyu1994 commented Nov 8, 2021

hzy46 commented Nov 15, 2021 •

edited

Loading

Support Target and Count Encoding for Sequence Input #4780

Support Target and Count Encoding for Sequence Input #4780

Comments

shiyu1994 commented Nov 8, 2021

Summary

Description

hzy46 commented Nov 15, 2021 • edited Loading

hzy46 commented Nov 15, 2021 •

edited

Loading