Model wrapper for sklearn cross_val_score

xan

This is an minimal example using XGBClassifier, but am interested how this would work in general. I am trying to wrap the model class in order to use it in cross validation. In this case I am only weighing the imbalanced classes, but my ultimate goal is a bit broader change in the pipeline.

My first try was to simply override the fit function:

from sklearn import metrics
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.base import BaseEstimator, ClassifierMixin

class WeightedXGBClassifier(XGBClassifier, BaseEstimator, ClassifierMixin):
    
    @staticmethod
    def get_weights(y):
        sample_weights = compute_sample_weight(class_weight='balanced', y=y)
        return sample_weights
    
    def fit(self, X, y, **kwargs):
        weights = self.get_weights(y)
        super(XGBClassifier, self).fit(X, y, sample_weight=weights, **kwargs)

which works fine, when I'm trying to fit the model, use predictions etc.. But using this in sklearn cross_val_score

xgb_model_cv = WeightedXGBClassifier(n_estimators=100, max_depth=4, alpha=100, use_label_encoder=False)

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
auc_scorer = metrics.make_scorer(metrics.roc_auc_score, needs_proba=True)
scores = cross_val_score(xgb_model_cv, X, y, scoring=auc_scorer, cv=cv, n_jobs=-1, verbose=1)

throws an error

File "/home/ubuntu/anaconda3/envs/pyTF/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/home/ubuntu/anaconda3/envs/pyTF/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 106, in __call__
    score = scorer._score(cached_call, estimator, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pyTF/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 306, in _score
    y_pred = self._select_proba_binary(y_pred, clf.classes_)
AttributeError: 'WeightedXGBClassifier' object has no attribute 'classes_'

Now, it is my understanding the classes_ attribute is created, when the model is fitted, but I am not sure how to then properly wrap the model to capture this. Note that running

model = XGBClassifier(use_label_encoder=False, scale_pos_weight=(~y).sum()/y.sum())
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)

works fine. My second try was:

class XGBClassifierWrapper(BaseEstimator, ClassifierMixin):
    def __init__(self, **kwargs):
#         super(BaseEstimator).__init__()
#         super(ClassifierMixin).__init__()
        self.xgb_classifier_obj = XGBClassifier(**kwargs)
    
    @staticmethod
    def get_weights(y):
        sample_weights = compute_sample_weight(class_weight='balanced', y=y)
        return sample_weights
    
    def fit(self, X, y, **kwargs):
        weights = self.get_weights(y)
        self.xgb_classifier_obj.fit(X, y, sample_weight=weights, **kwargs)
        return self
    
    def predict(self, X, **kwargs):
        return self.xgb_classifier_obj.predict(X, **kwargs)
    
    def predict_proba(self, X, **kwargs):
        return self.xgb_classifier_obj.predict_proba(X, **kwargs)

which again resulted in the same error as in the case above, i.e., missing classes_ attribute.

Ben Reiniger

(I don't actually get an error when I run any of your code; however, I do get a scores consisting only of nan, and adding error_score='raise' I get your error message.)

In the first approach, I believe the only real problem is in your initialization. super(XGBClassifier, self): that's looking for a parent class of XGBClassifier, and not XGBClassifier itself, as I assume you want. Replacing with just the vanilla super() and everything works.

You should also add return self to the end of fit in your first attempt, but it's not important here. You can probably safely drop BaseEstimator and ClassifierMixin from the inheritance, since XGBClassifier already inherits from them.

Your second, wrapper, approach just fails because the wrapped xgb_classifier_obj has all the fitted attributes, including classes_, but your wrapper doesn't expose that directly. You can just set self.classes_ = self.xgb_classifier_obj.classes_ in fit, or perhaps define a @property delegation.

You should also consider that your __init__ this time doesn't meet the sklearn API, so cloning won't work correctly. I'd advise using the first approach for this reason (fixing it requires rather more tedious work, in my opinion).

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

Sklearn cross_val_score gives significantly differnt number than model.score?

Evaluate multiple scores on sklearn cross_val_score

Difference between using train_test_split and cross_val_score in sklearn.cross_validation

Reusing model fitted by cross_val_score in sklearn using joblib

Using sklearn cross_val_score and kfolds to fit and help predict model

sklearn cross_val_score gives lower accuracy than manual cross validation

How to standardize data with sklearn's cross_val_score()

Negative cross_val_score with decision_tree_regressor model

Cross-validation in sklearn: do I need to call fit() as well as cross_val_score()?

How to use custom scoring function in sklearn cross_val_score

Sklearn cross_val_score with multi input KerasClassifier

How does sklearn cross_val_score use kfold?

Passing a not-supervised learning model to cross_val_score

how to speed up cross_val_score?

Kfold, cross_val_score: on the basis of what data the output is shown (sklearn wrapper)?

Forcing sklearn cross val score to use stratified k fold?

Does cross_val_score not fit the actual input model?

Ray + cross_val_score

The R^2 score I get from GridSearchCV is very different from the one I get from cross_val_score, why? (sklearn, python)

What is the score function formula of sklearn.model_selection.cross_val_score?

It´s possible to apply cross_val_score() form sklearn to neupy NN that has an addon like Weigth Elimination?

Why does cross_val_score in sklearn flip the value of the metric?

Generate negative predictive value using cross_val_score in sklearn for model performance evaluation

Why the classifier's score function return a quite different result from cross_val_score function in sklearn?

sklearn cross_val_score() returns NaN values

How to apply cross_val_score to cross valid our own model

Cannot evaluate f1-score on sklearn cross_val_score

Using sklearn's cross_val_score with different training and testing datasets

sklearn.model_selection.cross_val_score has different results from a manual calculation done on a confusion matrix