Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trees with linear models at leaves #3299

Merged
merged 159 commits into from
Dec 24, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
159 commits
Select commit Hold shift + click to select a range
3370bc5
Add Eigen library.
btrotta Jun 26, 2020
08cb3d1
Working for simple test.
btrotta Jun 29, 2020
22040c7
Apply changes to config params.
btrotta Jun 29, 2020
1001383
Handle nan data.
btrotta Jun 29, 2020
aa77951
Update docs.
btrotta Jun 29, 2020
8261f7b
Add test.
btrotta Jun 29, 2020
3e43722
Only load raw data if boosting=gbdt_linear
btrotta Jun 30, 2020
0b00394
Remove unneeded code.
btrotta Jun 30, 2020
8f9d69e
Minor updates.
btrotta Jul 1, 2020
566895b
Update to work with sk-learn interface.
btrotta Jul 2, 2020
e507f72
Update to work with chunked datasets.
btrotta Jul 2, 2020
3a37ba7
Throw error if we try to create a Booster with an already-constructed…
btrotta Jul 4, 2020
414c028
Save raw data in binary dataset file.
btrotta Jul 5, 2020
2fdc4ab
Update docs and fix parameter checking.
btrotta Jul 6, 2020
a56fb45
Fix dataset loading.
btrotta Jul 6, 2020
90477db
Add test for regularization.
btrotta Jul 6, 2020
64da46a
Fix bugs when saving and loading tree.
btrotta Jul 7, 2020
30fc91f
Add test for load/save linear model.
btrotta Jul 7, 2020
f690536
Remove unneeded code.
btrotta Jul 7, 2020
c1ee624
Fix case where not enough leaf data for linear model.
btrotta Jul 7, 2020
24930f0
Simplify code.
btrotta Jul 8, 2020
f41c5e7
Speed up code.
btrotta Jul 8, 2020
f6cdc7d
Speed up code.
btrotta Jul 8, 2020
874c1c8
Simplify code.
btrotta Jul 8, 2020
7d9bdad
Speed up code.
btrotta Jul 8, 2020
7d15fa5
Fix bugs.
btrotta Jul 9, 2020
7460d53
Working version.
btrotta Jul 14, 2020
4fc8121
Store feature data column-wise (not fully working yet).
btrotta Jul 15, 2020
48e6d5b
Fix bugs.
btrotta Jul 16, 2020
bc643c9
Speed up.
btrotta Jul 16, 2020
a6fa69e
Speed up.
btrotta Jul 16, 2020
f42f280
Remove unneeded code.
btrotta Jul 16, 2020
0e7ac0f
Small speedup.
btrotta Jul 22, 2020
dfd08da
Speed up.
btrotta Jul 23, 2020
cda0940
Minor updates.
btrotta Jul 23, 2020
4aa9cc1
Remove unneeded code.
btrotta Jul 23, 2020
d236cb6
Fix bug.
btrotta Jul 23, 2020
692d4b5
Fix bug.
btrotta Jul 25, 2020
3629450
Speed up.
btrotta Jul 25, 2020
8893e60
Speed up.
btrotta Jul 25, 2020
35e3ac3
Simplify code.
btrotta Jul 25, 2020
1d34306
Remove unneeded code.
btrotta Jul 26, 2020
26876ad
Fix bug, add more tests.
btrotta Jul 26, 2020
09e2f34
Fix bug and add test.
btrotta Jul 26, 2020
90c1958
Only store numerical features
btrotta Jul 27, 2020
ef0e3dc
Fix bug and speed up using templates.
btrotta Jul 28, 2020
7d5f529
Speed up prediction.
btrotta Jul 28, 2020
7065e7b
Fix bug with regularisation
btrotta Jul 28, 2020
b06f825
Visual studio files.
btrotta Jul 29, 2020
f8b6ecd
Working version
btrotta Jul 30, 2020
b6f45aa
Only check nans if necessary
btrotta Jul 30, 2020
694b67e
Store coeff matrix as an array.
btrotta Jul 31, 2020
c1e680e
Align cache lines
btrotta Jul 31, 2020
158b984
Align cache lines
btrotta Jul 31, 2020
9fd4129
Preallocation coefficient calculation matrices
btrotta Jul 31, 2020
eb4241c
Small speedups
btrotta Jul 31, 2020
0ca31dd
Small speedup
btrotta Aug 1, 2020
d284083
Reverse cache alignment changes
btrotta Aug 1, 2020
6dc7c09
Change to dynamic schedule
btrotta Aug 1, 2020
5f168a6
Update docs.
btrotta Aug 1, 2020
882b397
Refactor so that linear tree learner is not a separate class.
btrotta Aug 3, 2020
9259f67
Add refit capability.
btrotta Aug 6, 2020
d32d72f
Speed up
btrotta Aug 7, 2020
75f6177
Small speedups.
btrotta Aug 9, 2020
a6d0047
Speed up add prediction to score.
btrotta Aug 9, 2020
fe487a1
Fix bug
btrotta Aug 9, 2020
f250c0c
Fix bug and speed up.
btrotta Aug 9, 2020
7dd7ac8
Speed up dataload.
btrotta Aug 9, 2020
94c6912
Speed up dataload
btrotta Aug 10, 2020
a02dfc3
Use vectors instead of pointers
btrotta Aug 10, 2020
919cc35
Fix bug
btrotta Aug 10, 2020
9ebf454
Add OMP exception handling.
btrotta Aug 10, 2020
1f51158
Merge
btrotta Aug 10, 2020
75846fc
Change return type of LGBM_BoosterGetLinear to bool
btrotta Aug 10, 2020
5a0dd32
Change return type of LGBM_BoosterGetLinear back to int, only paramet…
btrotta Aug 10, 2020
cf40fdb
Remove unused internal_parent_ property of tree
btrotta Aug 10, 2020
c92933c
Remove unused parameter to CreateTreeLearner
btrotta Aug 10, 2020
95cd2ff
Remove reference to LinearTreeLearner
btrotta Aug 10, 2020
9648da8
Minor style issues
btrotta Aug 10, 2020
03be117
Remove unneeded check
btrotta Aug 10, 2020
a097bc1
Reverse temporary testing change
btrotta Aug 10, 2020
dbab78b
Fix Visual Studio project files
btrotta Aug 10, 2020
c11786c
Restore LightGBM.vcxproj.filters
btrotta Aug 10, 2020
a8ea87a
Speed up
btrotta Aug 11, 2020
b8c0cdd
Speed up
btrotta Aug 11, 2020
eab9022
Simplify code
btrotta Aug 11, 2020
a48f4be
Update docs
btrotta Aug 11, 2020
d8d2717
Simplify code
btrotta Aug 11, 2020
d02d84d
Initialise storage space for max num threads
btrotta Aug 11, 2020
34b481d
Merge remote-tracking branch 'private/in-order' into linear-leaf
btrotta Aug 11, 2020
6ec8097
Move Eigen to include directory and delete unused files
btrotta Aug 12, 2020
a8f208d
Merge remote-tracking branch 'private/in-order' into linear-leaf
btrotta Aug 12, 2020
cb79a9d
Remove old files.
btrotta Aug 12, 2020
1e67f59
Fix so it compiles with mingw
btrotta Aug 12, 2020
360b707
Fix gpu tree learner
btrotta Aug 12, 2020
a9c0f96
Change AddPredictionToScore back to const
btrotta Aug 12, 2020
625db0b
Fix python lint error
btrotta Aug 12, 2020
9c75cda
Fix C++ lint errors
btrotta Aug 12, 2020
d950f21
Change eigen to a submodule
btrotta Aug 13, 2020
dd8fa52
Update comment
btrotta Aug 13, 2020
c9cd0a2
Add the eigen folder
btrotta Aug 13, 2020
8ef65db
Try to fix build issues with eigen
btrotta Aug 13, 2020
8c5d08e
Remove eigen files
btrotta Aug 14, 2020
5499875
Add eigen as submodule
btrotta Aug 14, 2020
b50b59c
Fix include paths
btrotta Aug 14, 2020
ea42ec9
Exclude eigen files from Python linter
btrotta Aug 14, 2020
d483ed4
Ignore eigen folders for pydocstyle
btrotta Aug 14, 2020
f5db131
Fix C++ linting errors
btrotta Aug 14, 2020
e740d9b
Fix docs
btrotta Aug 14, 2020
b4e5da8
Fix docs
btrotta Aug 14, 2020
5641fc7
Exclude eigen directories from doxygen
btrotta Aug 15, 2020
68a747c
Update manifest to include eigen
btrotta Aug 15, 2020
08d7816
Update build_r to include eigen files
btrotta Aug 15, 2020
4475197
Fix compiler warnings
btrotta Aug 16, 2020
9c5169b
Store raw feature data as float
btrotta Aug 16, 2020
2e00ef0
Use float for calculating linear coefficients
btrotta Aug 16, 2020
70188bd
Remove eigen directory from GLOB
btrotta Aug 17, 2020
3513bba
Don't compile linear model code when building R package
btrotta Aug 17, 2020
18ccb7f
Fix doxygen issue
btrotta Aug 18, 2020
d00ee09
Fix lint issue
btrotta Aug 18, 2020
4410c9c
Fix lint issue
btrotta Aug 18, 2020
3b2f4f9
Remove uneeded code
btrotta Aug 19, 2020
9c9d515
Restore delected lines
btrotta Sep 2, 2020
a9201e9
Restore delected lines
btrotta Sep 2, 2020
84603fe
Change return type of has_raw to bool
btrotta Sep 3, 2020
0ef2728
Update docs
btrotta Sep 3, 2020
548a28d
Rename some variables and functions for readability
btrotta Sep 3, 2020
4dd716c
Make tree_learner parameter const in AddScore
btrotta Sep 3, 2020
55a6c8f
Fix style issues
btrotta Sep 3, 2020
047fe3e
Pass vectors as const reference when setting tree properties
btrotta Sep 3, 2020
0c65173
Make temporary storage of serial_tree_learner mutable so we can make …
btrotta Sep 4, 2020
da3a52a
Remove get_raw_size, use num_numeric_features instead
btrotta Sep 4, 2020
f31f75e
Fix typo
btrotta Sep 5, 2020
0a70d6f
Make contains_nan_ and any_nan_ properties immutable again
btrotta Sep 5, 2020
71c87a6
Remove data_has_nan_ property of tree
btrotta Sep 5, 2020
d9d1dc2
Remove temporary test code
btrotta Sep 5, 2020
5062f1a
Make linear_tree a dataset param
btrotta Sep 5, 2020
73e4a25
Fix lint error
btrotta Sep 5, 2020
fb92d10
Make LinearTreeLearner a separate class
btrotta Sep 13, 2020
961ee82
Fix lint errors
btrotta Sep 13, 2020
59af8d5
Fix lint error
btrotta Sep 13, 2020
2d1556d
Add linear_tree_learner.o
btrotta Sep 13, 2020
4af6065
Merge
btrotta Oct 17, 2020
576404c
Simulate omp_get_max_threads if openmp is not available
btrotta Oct 17, 2020
fcbedb3
Update PushOneData to also store raw data.
btrotta Oct 22, 2020
5398642
Cast size to int
btrotta Oct 22, 2020
5474c89
Fix bug in ReshapeRaw
btrotta Oct 22, 2020
6120244
Speed up code with multithreading
btrotta Oct 22, 2020
8160b2e
Use OMP_NUM_THREADS
btrotta Oct 22, 2020
7ea85ce
Speed up with multithreading
btrotta Oct 22, 2020
4e6f172
Merge
btrotta Oct 22, 2020
089686b
Merge remote-tracking branch 'upstream/master' into linear-leaf
btrotta Oct 23, 2020
c07c3a8
Merge
btrotta Dec 22, 2020
ed1e870
Update to use ArrayToString
btrotta Dec 22, 2020
171bafc
Fix tests
btrotta Dec 22, 2020
9eb082f
Fix test
btrotta Dec 22, 2020
5775593
Fix bug introduced in merge
btrotta Dec 22, 2020
c8a4def
Minor updates
btrotta Dec 22, 2020
24c9dbe
Update docs
btrotta Dec 23, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .ci/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,8 @@ if [[ $TRAVIS == "true" ]] && [[ $TASK == "lint" ]]; then
"r-lintr>=2.0"
pip install --user cpplint
echo "Linting Python code"
pycodestyle --ignore=E501,W503 --exclude=./compute,./.nuget,./external_libs . || exit -1
pydocstyle --convention=numpy --add-ignore=D105 --match-dir="^(?!^compute|external_libs|test|example).*" --match="(?!^test_|setup).*\.py" . || exit -1
pycodestyle --ignore=E501,W503 --exclude=./compute,./eigen,./.nuget,./external_libs . || exit -1
pydocstyle --convention=numpy --add-ignore=D105 --match-dir="^(?!^compute|^eigen|external_libs|test|example).*" --match="(?!^test_|setup).*\.py" . || exit -1
jameslamb marked this conversation as resolved.
Show resolved Hide resolved
echo "Linting R code"
Rscript ${BUILD_DIRECTORY}/.ci/lint_r_code.R ${BUILD_DIRECTORY} || exit -1
echo "Linting C++ code"
Expand Down
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
[submodule "include/boost/compute"]
path = compute
url = https://github.com/boostorg/compute
[submodule "eigen"]
path = eigen
url = https://gitlab.com/libeigen/eigen.git
[submodule "external_libs/fmt"]
path = external_libs/fmt
url = https://github.com/fmtlib/fmt.git
Expand Down
3 changes: 3 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,9 @@ if(USE_SWIG)
endif()
endif(USE_SWIG)

SET(EIGEN_DIR "${PROJECT_SOURCE_DIR}/eigen")
include_directories(${EIGEN_DIR})

if(__BUILD_FOR_R)
list(APPEND CMAKE_MODULE_PATH "${PROJECT_SOURCE_DIR}/cmake/modules")
find_package(LibR REQUIRED)
Expand Down
1 change: 1 addition & 0 deletions R-package/src/Makevars.in
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ OBJECTS = \
treelearner/data_parallel_tree_learner.o \
treelearner/feature_parallel_tree_learner.o \
treelearner/gpu_tree_learner.o \
treelearner/linear_tree_learner.o \
treelearner/serial_tree_learner.o \
treelearner/tree_learner.o \
treelearner/voting_parallel_tree_learner.o \
Expand Down
1 change: 1 addition & 0 deletions R-package/src/Makevars.win.in
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ OBJECTS = \
treelearner/data_parallel_tree_learner.o \
treelearner/feature_parallel_tree_learner.o \
treelearner/gpu_tree_learner.o \
treelearner/linear_tree_learner.o \
treelearner/serial_tree_learner.o \
treelearner/tree_learner.o \
treelearner/voting_parallel_tree_learner.o \
Expand Down
24 changes: 24 additions & 0 deletions docs/Parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,26 @@ Core Parameters

- **Note**: internally, LightGBM uses ``gbdt`` mode for the first ``1 / learning_rate`` iterations

- ``linear_tree`` :raw-html:`<a id="linear_tree" title="Permalink to this parameter" href="#linear_tree">&#x1F517;&#xFE0E;</a>`, default = ``false``, type = bool

- fit piecewise linear gradient boosting tree, only works with cpu and serial tree learner

- tree splits are chosen in the usual way, but the model at each leaf is linear instead of constant

- the linear model at each leaf includes all the numerical features in that leaf's branch

- categorical features are used for splits as normal but are not used in the linear models

- missing values must be encoded as ``np.nan`` (Python) or ``NA`` (cli), not ``0``

- it is recommended to rescale data before training so that features have similar mean and standard deviation

- not yet supported in R-package
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is blocking this from being implemented in R?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry, I see this in the PR description (missed it the first time):

I have only implemented this for Python. I think to get it working for R it would require some changes to the data loading interface, but I'm not very familiar with R, so maybe someone else would like to take that on.

Could you please open an issue with tag r-package describing the changes you think are needed in R to support this feature? Then we can implement it later without blocking this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #3319


- ``regression_l1`` objective is not supported with linear tree boosting

- setting ``linear_tree = True`` significantly increases the memory use of LightGBM

- ``data`` :raw-html:`<a id="data" title="Permalink to this parameter" href="#data">&#x1F517;&#xFE0E;</a>`, default = ``""``, type = string, aliases: ``train``, ``train_data``, ``train_data_file``, ``data_filename``

- path of training data, LightGBM will train from this data
Expand Down Expand Up @@ -384,6 +404,10 @@ Learning Control Parameters

- L2 regularization

- ``linear_lambda`` :raw-html:`<a id="linear_lambda" title="Permalink to this parameter" href="#linear_lambda">&#x1F517;&#xFE0E;</a>`, default = ``0.0``, type = double, constraints: ``linear_lambda >= 0.0``

- Linear tree regularisation, the parameter `lambda` in Eq 3 of <https://arxiv.org/pdf/1802.05640.pdf>

- ``min_gain_to_split`` :raw-html:`<a id="min_gain_to_split" title="Permalink to this parameter" href="#min_gain_to_split">&#x1F517;&#xFE0E;</a>`, default = ``0.0``, type = double, aliases: ``min_split_gain``, constraints: ``min_gain_to_split >= 0.0``

- the minimal gain to perform split
Expand Down
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,7 @@ def generate_doxygen_xml(app):
"SKIP_FUNCTION_MACROS=NO",
"SORT_BRIEF_DOCS=YES",
"WARN_AS_ERROR=YES",
"EXCLUDE_PATTERNS=*/eigen/*"
]
doxygen_input = '\n'.join(doxygen_args)
doxygen_input = bytes(doxygen_input, "utf-8")
Expand Down
1 change: 1 addition & 0 deletions eigen
Submodule eigen added at 8ba1b0
113 changes: 113 additions & 0 deletions examples/binary_classification/train_linear.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# task type, support train and predict
task = train

# boosting type, support gbdt for now, alias: boosting, boost
boosting_type = gbdt

# application type, support following application
# regression , regression task
# binary , binary classification task
# lambdarank , lambdarank task
# alias: application, app
objective = binary

linear_tree = true

# eval metrics, support multi metric, delimite by ',' , support following metrics
# l1
# l2 , default metric for regression
# ndcg , default metric for lambdarank
# auc
# binary_logloss , default metric for binary
# binary_error
metric = binary_logloss,auc

# frequence for metric output
metric_freq = 1

# true if need output metric for training data, alias: tranining_metric, train_metric
is_training_metric = true

# number of bins for feature bucket, 255 is a recommend setting, it can save memories, and also has good accuracy.
max_bin = 255

# training data
# if exsting weight file, should name to "binary.train.weight"
# alias: train_data, train
data = binary.train

# validation data, support multi validation data, separated by ','
# if exsting weight file, should name to "binary.test.weight"
# alias: valid, test, test_data,
valid_data = binary.test

# number of trees(iterations), alias: num_tree, num_iteration, num_iterations, num_round, num_rounds
num_trees = 100

# shrinkage rate , alias: shrinkage_rate
learning_rate = 0.1

# number of leaves for one tree, alias: num_leaf
num_leaves = 63

# type of tree learner, support following types:
# serial , single machine version
# feature , use feature parallel to train
# data , use data parallel to train
# voting , use voting based parallel to train
# alias: tree
tree_learner = serial

# number of threads for multi-threading. One thread will use one CPU, defalut is setted to #cpu.
# num_threads = 8

# feature sub-sample, will random select 80% feature to train on each iteration
# alias: sub_feature
feature_fraction = 0.8

# Support bagging (data sub-sample), will perform bagging every 5 iterations
bagging_freq = 5

# Bagging farction, will random select 80% data on bagging
# alias: sub_row
bagging_fraction = 0.8

# minimal number data for one leaf, use this to deal with over-fit
# alias : min_data_per_leaf, min_data
min_data_in_leaf = 50

# minimal sum hessians for one leaf, use this to deal with over-fit
min_sum_hessian_in_leaf = 5.0

# save memory and faster speed for sparse feature, alias: is_sparse
is_enable_sparse = true

# when data is bigger than memory size, set this to true. otherwise set false will have faster speed
# alias: two_round_loading, two_round
use_two_round_loading = false

# true if need to save data to binary file and application will auto load data from binary file next time
# alias: is_save_binary, save_binary
is_save_binary_file = false

# output model file
output_model = LightGBM_model.txt

# support continuous train from trained gbdt model
# input_model= trained_model.txt

# output prediction file for predict task
# output_result= prediction.txt


# number of machines in parallel training, alias: num_machine
num_machines = 1

# local listening port in parallel training, alias: local_port
local_listen_port = 12400

# machines list file for parallel training, alias: mlist
machine_list_file = mlist.txt

# force splits
# forced_splits = forced_splits.json
2 changes: 2 additions & 0 deletions include/LightGBM/boosting.h
Original file line number Diff line number Diff line change
Expand Up @@ -312,6 +312,8 @@ class LIGHTGBM_EXPORT Boosting {
* \return The boosting object
*/
static Boosting* CreateBoosting(const std::string& type, const char* filename);

virtual bool IsLinear() const { return false; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why this interface is needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in the refit method of the python interface. When we construct the new dataset for refitting, we need to know if the model is linear, so that we can save the full feature data not just the binned data.

};

class GBDTBase : public Boosting {
Expand Down
8 changes: 8 additions & 0 deletions include/LightGBM/c_api.h
Original file line number Diff line number Diff line change
Expand Up @@ -389,6 +389,14 @@ LIGHTGBM_C_EXPORT int LGBM_DatasetGetNumData(DatasetHandle handle,
LIGHTGBM_C_EXPORT int LGBM_DatasetGetNumFeature(DatasetHandle handle,
int* out);

/*!
* \brief Get boolean representing whether booster is fitting linear trees.
* \param handle Handle of dataset
* \param[out] out The address to hold linear indicator
* \return 0 when succeed, -1 when failure happens
*/
LIGHTGBM_C_EXPORT int LGBM_BoosterGetLinear(BoosterHandle handle, bool* out);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same as above


/*!
* \brief Add features from ``source`` to ``target``.
* \param target The handle of the dataset to add features to
Expand Down
15 changes: 15 additions & 0 deletions include/LightGBM/config.h
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,17 @@ struct Config {
// descl2 = **Note**: internally, LightGBM uses ``gbdt`` mode for the first ``1 / learning_rate`` iterations
std::string boosting = "gbdt";

// desc = fit piecewise linear gradient boosting tree, only works with cpu and serial tree learner
// descl2 = tree splits are chosen in the usual way, but the model at each leaf is linear instead of constant
// descl2 = the linear model at each leaf includes all the numerical features in that leaf's branch
// descl2 = categorical features are used for splits as normal but are not used in the linear models
// descl2 = missing values must be encoded as ``np.nan`` (Python) or ``NA`` (cli), not ``0``
// descl2 = it is recommended to rescale data before training so that features have similar mean and standard deviation
// descl2 = not yet supported in R-package
// descl2 = ``regression_l1`` objective is not supported with linear tree boosting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the linear tree will increase the memory usage. I think we should note this in doc

// descl2 = setting ``linear_tree = True`` significantly increases the memory use of LightGBM
bool linear_tree = false;

// alias = train, train_data, train_data_file, data_filename
// desc = path of training data, LightGBM will train from this data
// desc = **Note**: can be used only in CLI version
Expand Down Expand Up @@ -366,6 +377,10 @@ struct Config {
// desc = L2 regularization
double lambda_l2 = 0.0;

// check = >=0.0
// desc = Linear tree regularisation, the parameter `lambda` in Eq 3 of <https://arxiv.org/pdf/1802.05640.pdf>
double linear_lambda = 0.0;

// alias = min_split_gain
// check = >=0.0
// desc = the minimal gain to perform split
Expand Down
53 changes: 52 additions & 1 deletion include/LightGBM/dataset.h
Original file line number Diff line number Diff line change
Expand Up @@ -337,6 +337,12 @@ class Dataset {
const int group = feature2group_[feature_idx];
const int sub_feature = feature2subfeature_[feature_idx];
feature_groups_[group]->PushData(tid, sub_feature, row_idx, feature_values[i]);
if (has_raw_) {
int feat_ind = numeric_feature_map_[feature_idx];
if (feat_ind >= 0) {
raw_data_[feat_ind][row_idx] = feature_values[i];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use row-first raw data? I guess linear model is faster with row-first data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment: #3299 (comment)

Copy link
Collaborator

@guolinke guolinke Dec 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, sorry for missing it.

}
}
}
}
}
Expand All @@ -352,13 +358,25 @@ class Dataset {
const int group = feature2group_[feature_idx];
const int sub_feature = feature2subfeature_[feature_idx];
feature_groups_[group]->PushData(tid, sub_feature, row_idx, inner_data.second);
if (has_raw_) {
shiyu1994 marked this conversation as resolved.
Show resolved Hide resolved
int feat_ind = numeric_feature_map_[feature_idx];
if (feat_ind >= 0) {
raw_data_[feat_ind][row_idx] = inner_data.second;
}
}
}
}
FinishOneRow(tid, row_idx, is_feature_added);
}

inline void PushOneData(int tid, data_size_t row_idx, int group, int sub_feature, double value) {
inline void PushOneData(int tid, data_size_t row_idx, int group, int feature_idx, int sub_feature, double value) {
feature_groups_[group]->PushData(tid, sub_feature, row_idx, value);
if (has_raw_) {
int feat_ind = numeric_feature_map_[feature_idx];
if (feat_ind >= 0) {
raw_data_[feat_ind][row_idx] = value;
}
}
}

inline int RealFeatureIndex(int fidx) const {
Expand Down Expand Up @@ -569,6 +587,9 @@ class Dataset {
/*! \brief Get Number of used features */
inline int num_features() const { return num_features_; }

/*! \brief Get number of numeric features */
inline int num_numeric_features() const { return num_numeric_features_; }

/*! \brief Get Number of feature groups */
inline int num_feature_groups() const { return num_groups_;}

Expand Down Expand Up @@ -632,6 +653,31 @@ class Dataset {

void AddFeaturesFrom(Dataset* other);

/*! \brief Get has_raw_ */
inline bool has_raw() const { return has_raw_; }

/*! \brief Set has_raw_ */
inline void SetHasRaw(bool has_raw) { has_raw_ = has_raw; }

/*! \brief Resize raw_data_ */
inline void ResizeRaw(int num_rows) {
if (static_cast<int>(raw_data_.size()) > num_numeric_features_) {
raw_data_.resize(num_numeric_features_);
}
for (size_t i = 0; i < raw_data_.size(); ++i) {
raw_data_[i].resize(num_rows);
}
int curr_size = static_cast<int>(raw_data_.size());
for (int i = curr_size; i < num_numeric_features_; ++i) {
raw_data_.push_back(std::vector<float>(num_rows, 0));
}
}

/*! \brief Get pointer to raw_data_ feature */
inline const float* raw_index(int feat_ind) const {
return raw_data_[numeric_feature_map_[feat_ind]].data();
}

private:
std::string data_filename_;
/*! \brief Store used features */
Expand Down Expand Up @@ -668,6 +714,11 @@ class Dataset {
bool use_missing_;
bool zero_as_missing_;
std::vector<int> feature_need_push_zeros_;
std::vector<std::vector<float>> raw_data_;
bool has_raw_;
/*! map feature (inner index) to its index in the list of numeric (non-categorical) features */
std::vector<int> numeric_feature_map_;
int num_numeric_features_;
};

} // namespace LightGBM
Expand Down
2 changes: 2 additions & 0 deletions include/LightGBM/dataset_loader.h
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,8 @@ class DatasetLoader {
std::vector<std::string> feature_names_;
/*! \brief Mapper from real feature index to used index*/
std::unordered_set<int> categorical_features_;
/*! \brief Whether to store raw feature values */
bool store_raw_;
};

} // namespace LightGBM
Expand Down
Loading