Unable to reproduce results from training with xgboost #7631

zahs123 · 2022-02-04T11:53:19Z

Hi,

I have the following steps below to train an xgboost classifier.

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1, random_state=42)
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, stratify=y_train, test_size=0.1, random_state=42)

cv = StratifiedKFold(n_splits=3, random_state=42,shuffle=True) 

 params = { 
        'xgb': { 'clf__n_estimators': [100,200,400],'clf__max_depth': [3,5,7],'clf__learning_rate' : [0.01,0.1],'clf__min_child_weight':[3,10]}
            }
model = XGBClassifier(objective='binary:logistic',n_jobs=15_xgb,use_label_encoder=False,random_state = 42)
sel = SelectKBest(k='all')
numeric_transformer = Pipeline(steps=[('imputer',SimpleImputer(missing_values=np.nan,strategy='constant', fill_value=0))])


preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_cols)])

pipe = Pipeline(steps=[('preprocessor', preprocessor),('var',VarianceThreshold()),('sel',sel),('clf', model)])

from HyperclassifierSearch import HyperclassifierSearch
search = HyperclassifierSearch(pipe, params)
best_model = search.train_model(X_train, y_train, cv=cv,scoring='accuracy')

Each time i run the above, i get a different set of bestparams after grid search and slightly different accuracies, and predictions..... How is this happening when i use random_state for each of below:

split the data
in my cross validation function
in my xgbclassifier model
for SelectKBest i also use 'all' features for now. But i am confused as to how i got different results on each run i did. FYI hyperclassufuer search is just a wrapper around gridsearch.

Any ideas on why this could be happening based on above?

I have tried setting tree_method = 'exact' i still get different results. I have also tried setting 'seed'? Why does this still exist when i t was deprecated.
How can i get reproducible results?
XGBOOST version is: 1.5.0.

The text was updated successfully, but these errors were encountered:

trivialfis · 2022-02-04T15:03:37Z

Are you sure that it's xgboost not being deterministic? From the snippet, you have used a mix of libraries.

Any ideas on why this could be happening based on above?

Could you please provide a reproducible example that we can run?

Why does this still exist when i t was deprecated.

XGBoost has the seed parameter that's used across many interfaces including Python, R, Java, and Scala. The sklearn interface accepts **kwargs, which is passed down to xgboost.

zahs123 · 2022-02-04T16:44:21Z

What is the seed used for though ? I have set both seed and random_state but am wondering what difference is

trivialfis · 2022-02-04T16:55:41Z

random_state can be either a int or a np.random.RandomState

xgboost/python-package/xgboost/sklearn.py

Line 610 in 34a238c

if isinstance(params['random_state'], np.random.RandomState):

seed must be a int. Inside XGBoost (c++ libxgboost.so) they are exactly the same.

trivialfis · 2022-02-04T16:57:01Z

Since you are not using any sampling, it's not relevant to the non-deterministic issue here.

trivialfis · 2022-02-04T18:51:27Z

Would be great if you can try our nighly build: https://xgboost.readthedocs.io/en/stable/install.html#id1 we avoided some non-deterministic behaviour in evaluation caused by floating-point summation: #7303

zahs123 · 2022-02-07T08:47:21Z

Hi thanks for this - i realised the problem isn't with xgboost but how my data is being read randomly
will close this

trivialfis added the status: need update label Feb 5, 2022

zahs123 closed this as completed Feb 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to reproduce results from training with xgboost #7631

Unable to reproduce results from training with xgboost #7631

zahs123 commented Feb 4, 2022 •

edited

Loading

trivialfis commented Feb 4, 2022 •

edited

Loading

zahs123 commented Feb 4, 2022

trivialfis commented Feb 4, 2022

trivialfis commented Feb 4, 2022

trivialfis commented Feb 4, 2022

zahs123 commented Feb 7, 2022

Unable to reproduce results from training with xgboost #7631

Unable to reproduce results from training with xgboost #7631

Comments

zahs123 commented Feb 4, 2022 • edited Loading

trivialfis commented Feb 4, 2022 • edited Loading

zahs123 commented Feb 4, 2022

trivialfis commented Feb 4, 2022

trivialfis commented Feb 4, 2022

trivialfis commented Feb 4, 2022

zahs123 commented Feb 7, 2022

zahs123 commented Feb 4, 2022 •

edited

Loading

trivialfis commented Feb 4, 2022 •

edited

Loading