Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to reproduce results from training with xgboost #7631

Closed
zahs123 opened this issue Feb 4, 2022 · 6 comments
Closed

Unable to reproduce results from training with xgboost #7631

zahs123 opened this issue Feb 4, 2022 · 6 comments

Comments

@zahs123
Copy link

zahs123 commented Feb 4, 2022

Hi,

I have the following steps below to train an xgboost classifier.

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1, random_state=42)
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, stratify=y_train, test_size=0.1, random_state=42)

cv = StratifiedKFold(n_splits=3, random_state=42,shuffle=True) 

 params = { 
        'xgb': { 'clf__n_estimators': [100,200,400],'clf__max_depth': [3,5,7],'clf__learning_rate' : [0.01,0.1],'clf__min_child_weight':[3,10]}
            }
model = XGBClassifier(objective='binary:logistic',n_jobs=15_xgb,use_label_encoder=False,random_state = 42)
sel = SelectKBest(k='all')
numeric_transformer = Pipeline(steps=[('imputer',SimpleImputer(missing_values=np.nan,strategy='constant', fill_value=0))])


preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_cols)])

pipe = Pipeline(steps=[('preprocessor', preprocessor),('var',VarianceThreshold()),('sel',sel),('clf', model)])

from HyperclassifierSearch import HyperclassifierSearch
search = HyperclassifierSearch(pipe, params)
best_model = search.train_model(X_train, y_train, cv=cv,scoring='accuracy')

Each time i run the above, i get a different set of bestparams after grid search and slightly different accuracies, and predictions..... How is this happening when i use random_state for each of below:

  1. split the data
  2. in my cross validation function
  3. in my xgbclassifier model
    for SelectKBest i also use 'all' features for now. But i am confused as to how i got different results on each run i did. FYI hyperclassufuer search is just a wrapper around gridsearch.

Any ideas on why this could be happening based on above?

I have tried setting tree_method = 'exact' i still get different results. I have also tried setting 'seed'? Why does this still exist when i t was deprecated.
How can i get reproducible results?
XGBOOST version is: 1.5.0.

@trivialfis
Copy link
Member

trivialfis commented Feb 4, 2022

Are you sure that it's xgboost not being deterministic? From the snippet, you have used a mix of libraries.

Any ideas on why this could be happening based on above?

Could you please provide a reproducible example that we can run?

Why does this still exist when i t was deprecated.

XGBoost has the seed parameter that's used across many interfaces including Python, R, Java, and Scala. The sklearn interface accepts **kwargs, which is passed down to xgboost.

@zahs123
Copy link
Author

zahs123 commented Feb 4, 2022

What is the seed used for though ? I have set both seed and random_state but am wondering what difference is

@trivialfis
Copy link
Member

random_state can be either a int or a np.random.RandomState

if isinstance(params['random_state'], np.random.RandomState):

seed must be a int. Inside XGBoost (c++ libxgboost.so) they are exactly the same.

@trivialfis
Copy link
Member

Since you are not using any sampling, it's not relevant to the non-deterministic issue here.

@trivialfis
Copy link
Member

Would be great if you can try our nighly build: https://xgboost.readthedocs.io/en/stable/install.html#id1 we avoided some non-deterministic behaviour in evaluation caused by floating-point summation: #7303

@zahs123
Copy link
Author

zahs123 commented Feb 7, 2022

Hi thanks for this - i realised the problem isn't with xgboost but how my data is being read randomly
will close this

@zahs123 zahs123 closed this as completed Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants