Skip to content

Add sorting by timestamp before the fit in catboost models #1337

Merged
merged 2 commits into from
Jul 31, 2023

Conversation

Mr-Geekman
Copy link
Contributor

@Mr-Geekman Mr-Geekman commented Jul 27, 2023

Before submitting (must do checklist)

  • Did you read the contribution guide?
  • Did you update the docs? We use Numpy format for all the methods and classes.
  • Did you write any new necessary tests?
  • Did you update the CHANGELOG?

Proposed Changes

Add sorting by timestamp before the fit in catboost models.

Closing issues

Closes #792.

@Mr-Geekman Mr-Geekman self-assigned this Jul 27, 2023
@Mr-Geekman
Copy link
Contributor Author

It was checked that sorting doesn't influence the result of predict. So, it is done only in fit.

@github-actions
Copy link

@github-actions github-actions bot temporarily deployed to pull request July 27, 2023 15:13 Inactive
@Mr-Geekman
Copy link
Contributor Author

The results of running with and without sorting in fit.

default:

  • SMAPE: 8.113868
  • MAE: 33.117068

default + has_time=True

  • SMAPE: 8.445682
  • MAE: 33.865089

sort(train):

  • SMAPE: 8.096917
  • MAE: 33.602366

sort(train) + has_time=True:

  • SMAPE: 8.318793
  • MAE: 34.054797

Script:

import pandas as pd

from etna.models import CatBoostMultiSegmentModel
from etna.datasets import TSDataset
from etna.transforms import LagTransform, SegmentEncoderTransform, DateFlagsTransform
from etna.pipeline import Pipeline
from etna.metrics import SMAPE, MAE

HORIZON = 14


def main():
    df = pd.read_csv("examples/data/example_dataset.csv")
    df_wide = TSDataset.to_dataset(df)
    ts = TSDataset(df=df_wide, freq="D")

    model = CatBoostMultiSegmentModel(has_time=True)
    transforms = [
        LagTransform(in_column="target", lags=list(range(HORIZON, 50)), out_column="lags"),
        SegmentEncoderTransform(),
        DateFlagsTransform(),
    ]
    pipeline = Pipeline(model=model, transforms=transforms, horizon=HORIZON)

    metrics, _, _ = pipeline.backtest(ts=ts, metrics=[SMAPE(), MAE()], n_folds=5)

    print(metrics.mean())


if __name__ == "__main__":
    main()

@codecov-commenter
Copy link

Codecov Report

Merging #1337 (b7af7d3) into master (aac0fe1) will increase coverage by 0.13%.
The diff coverage is 100.00%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##           master    #1337      +/-   ##
==========================================
+ Coverage   88.95%   89.09%   +0.13%     
==========================================
  Files         204      204              
  Lines       12641    12636       -5     
==========================================
+ Hits        11245    11258      +13     
+ Misses       1396     1378      -18     
Files Changed Coverage Δ
etna/models/catboost.py 100.00% <100.00%> (ø)

... and 8 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Collaborator

@alex-hse-repository alex-hse-repository left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alex-hse-repository alex-hse-repository merged commit e4c121c into master Jul 31, 2023
12 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix catboost to work with has_time
3 participants