From 6bd1725c4273395638e6734e6c7399aa42c971f0 Mon Sep 17 00:00:00 2001 From: motefly Date: Tue, 12 Mar 2019 07:58:14 +0000 Subject: [PATCH 1/2] Fix the number and download link of used CRITEO samples (1M -> 100K) in the description --- .../00_quick_start/lightgbm_tinycriteo.ipynb | 62 +++++++++---------- 1 file changed, 31 insertions(+), 31 deletions(-) diff --git a/notebooks/00_quick_start/lightgbm_tinycriteo.ipynb b/notebooks/00_quick_start/lightgbm_tinycriteo.ipynb index b1add332fa..ed2ca7a328 100644 --- a/notebooks/00_quick_start/lightgbm_tinycriteo.ipynb +++ b/notebooks/00_quick_start/lightgbm_tinycriteo.ipynb @@ -115,8 +115,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Data Preparation\n", - "Here we use CSV format as the example data input. Our example data is a sample (about 1 million samples) from Criteo dataset [2]. The Criteo dataset is a well-known industry benchmarking dataset for developing CTR prediction models, and it's frequently adopted as evaluation dataset by research papers. The original dataset is too large for a lightweight demo, so we sample a small portion from it as a demo dataset.
\n", + "# Data Preparation\n", + "Here we use CSV format as the example data input. Our example data is a sample (about 100 thousand samples) from Criteo dataset [2]. The Criteo dataset is a well-known industry benchmarking dataset for developing CTR prediction models, and it's frequently adopted as evaluation dataset by research papers. The original dataset is too large for a lightweight demo, so we sample a small portion from it as a demo dataset.
\n", "Specifically, there are 39 columns of features in Criteo, where 13 columns are numerical features (I1-I13) and the other 26 columns are categorical features (C1-C26)." ] }, @@ -137,7 +137,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Extracting component files from /tmp/tmp_it899gq/dac_sample.tar.gz.\n" + "Extracting component files from /tmp/tmpz0rodvbn/dac_sample.tar.gz.\n" ] }, { @@ -552,33 +552,33 @@ "name": "stderr", "output_type": "stream", "text": [ - "2019-03-08 04:32:23,741 [INFO] Filtering and fillna features\n", - "100%|██████████████████████████████████████████████████████████████| 26/26 [00:01<00:00, 14.19it/s]\n", - "100%|█████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 664.07it/s]\n", - "2019-03-08 04:32:25,690 [INFO] Ordinal encoding cate features\n", - "2019-03-08 04:32:27,060 [INFO] Target encoding cate features\n", - "100%|██████████████████████████████████████████████████████████████| 26/26 [00:03<00:00, 6.66it/s]\n", - "2019-03-08 04:32:30,974 [INFO] Start manual binary encoding\n", - "100%|██████████████████████████████████████████████████████████████| 65/65 [00:03<00:00, 16.17it/s]\n", - "100%|██████████████████████████████████████████████████████████████| 26/26 [00:02<00:00, 8.43it/s]\n", - "2019-03-08 04:32:37,119 [INFO] Filtering and fillna features\n", - "100%|█████████████████████████████████████████████████████████████| 26/26 [00:00<00:00, 166.30it/s]\n", - "100%|████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 1813.00it/s]\n", - "2019-03-08 04:32:37,286 [INFO] Ordinal encoding cate features\n", - "2019-03-08 04:32:37,451 [INFO] Target encoding cate features\n", - "100%|██████████████████████████████████████████████████████████████| 26/26 [00:00<00:00, 52.52it/s]\n", - "2019-03-08 04:32:37,948 [INFO] Start manual binary encoding\n", - "100%|██████████████████████████████████████████████████████████████| 65/65 [00:02<00:00, 26.73it/s]\n", - "100%|██████████████████████████████████████████████████████████████| 26/26 [00:01<00:00, 22.91it/s]\n", - "2019-03-08 04:32:41,597 [INFO] Filtering and fillna features\n", - "100%|█████████████████████████████████████████████████████████████| 26/26 [00:00<00:00, 168.82it/s]\n", - "100%|████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 2312.77it/s]\n", - "2019-03-08 04:32:41,761 [INFO] Ordinal encoding cate features\n", - "2019-03-08 04:32:41,922 [INFO] Target encoding cate features\n", - "100%|██████████████████████████████████████████████████████████████| 26/26 [00:00<00:00, 52.25it/s]\n", - "2019-03-08 04:32:42,422 [INFO] Start manual binary encoding\n", - "100%|██████████████████████████████████████████████████████████████| 65/65 [00:02<00:00, 26.44it/s]\n", - "100%|██████████████████████████████████████████████████████████████| 26/26 [00:01<00:00, 23.01it/s]" + "2019-03-12 07:55:56,415 [INFO] Filtering and fillna features\n", + "100%|██████████| 26/26 [00:02<00:00, 12.67it/s]\n", + "100%|██████████| 13/13 [00:00<00:00, 604.84it/s]\n", + "2019-03-12 07:55:58,494 [INFO] Ordinal encoding cate features\n", + "2019-03-12 07:55:59,943 [INFO] Target encoding cate features\n", + "100%|██████████| 26/26 [00:03<00:00, 6.48it/s]\n", + "2019-03-12 07:56:03,878 [INFO] Start manual binary encoding\n", + "100%|██████████| 65/65 [00:03<00:00, 16.50it/s]\n", + "100%|██████████| 26/26 [00:02<00:00, 7.86it/s]\n", + "2019-03-12 07:56:10,790 [INFO] Filtering and fillna features\n", + "100%|██████████| 26/26 [00:00<00:00, 167.89it/s]\n", + "100%|██████████| 13/13 [00:00<00:00, 1874.97it/s]\n", + "2019-03-12 07:56:10,956 [INFO] Ordinal encoding cate features\n", + "2019-03-12 07:56:11,120 [INFO] Target encoding cate features\n", + "100%|██████████| 26/26 [00:00<00:00, 52.55it/s]\n", + "2019-03-12 07:56:11,618 [INFO] Start manual binary encoding\n", + "100%|██████████| 65/65 [00:03<00:00, 21.43it/s]\n", + "100%|██████████| 26/26 [00:01<00:00, 18.60it/s]\n", + "2019-03-12 07:56:16,094 [INFO] Filtering and fillna features\n", + "100%|██████████| 26/26 [00:00<00:00, 151.00it/s]\n", + "100%|██████████| 13/13 [00:00<00:00, 2127.93it/s]\n", + "2019-03-12 07:56:16,288 [INFO] Ordinal encoding cate features\n", + "2019-03-12 07:56:16,453 [INFO] Target encoding cate features\n", + "100%|██████████| 26/26 [00:00<00:00, 52.20it/s]\n", + "2019-03-12 07:56:16,953 [INFO] Start manual binary encoding\n", + "100%|██████████| 65/65 [00:03<00:00, 21.40it/s]\n", + "100%|██████████| 26/26 [00:01<00:00, 18.42it/s]" ] }, { @@ -777,7 +777,7 @@ "source": [ "## Reference\n", "\\[1\\] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems. 3146–3154.
\n", - "\\[2\\] The Criteo datasets: http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/ .
\n", + "\\[2\\] The Criteo datasets: http://labs.criteo.com/wp-content/uploads/2015/04/dac_sample.tar.gz .
\n", "\\[3\\] Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin. 2018. CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363 (2018).
\n", "\\[4\\] Scikit-learn. 2018. categorical_encoding. https://github.com/scikit-learn-contrib/categorical-encoding .
\n", "\\[5\\] The parameters of LightGBM: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst ." From b58d023c4d9184496ade20a8dfe18434c81f6fa8 Mon Sep 17 00:00:00 2001 From: motefly Date: Tue, 12 Mar 2019 08:07:00 +0000 Subject: [PATCH 2/2] Fix the number and download link of used CRITEO samples (1M -> 100K) in the description --- notebooks/00_quick_start/lightgbm_tinycriteo.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/00_quick_start/lightgbm_tinycriteo.ipynb b/notebooks/00_quick_start/lightgbm_tinycriteo.ipynb index ed2ca7a328..fa85a29490 100644 --- a/notebooks/00_quick_start/lightgbm_tinycriteo.ipynb +++ b/notebooks/00_quick_start/lightgbm_tinycriteo.ipynb @@ -115,7 +115,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Data Preparation\n", + "## Data Preparation\n", "Here we use CSV format as the example data input. Our example data is a sample (about 100 thousand samples) from Criteo dataset [2]. The Criteo dataset is a well-known industry benchmarking dataset for developing CTR prediction models, and it's frequently adopted as evaluation dataset by research papers. The original dataset is too large for a lightweight demo, so we sample a small portion from it as a demo dataset.
\n", "Specifically, there are 39 columns of features in Criteo, where 13 columns are numerical features (I1-I13) and the other 26 columns are categorical features (C1-C26)." ]