Shape mismatch when One-Hot-Encoding Train and Test data. Train_Data has more Dummy Columns than Test_data while using get_dummies with pipeline

keshav N

I'm trying to create a get_dummies Class for my Data which I want to use it in Pipeline later:

class Dummies(BaseEstimator, TransformerMixin):
     def transform(self, df):
           dummies=pd.get_dummies(df[],drop_first=True) ## getting dummy cols
           df=pd.concat([df,dummies],axis=1) ## concatenating our dummies
           df.drop(,axis=1,inplace=True) ## dropping our original cat_cols

     def fit(self, df):
           for i in df.columns.tolist():    
               if i[0]=='c': ## My data has categorical cols start with 'c'  
          ## Storing all my categorical_columns for dummies

Now when I call fit_transform on X_train and then transform X_test


The columns in shape of X_train and X_test are Different:



(10983, 1797) (3661, 1529)

There are more Dummies in X_train than in my X_test. Clearly, my X_test has fewer categories than X_train. How do I write logic in my Class such that the categories in X_test broadcast to the shape of X_train? I want X_test to have the same number of dummy variables as my X_train.


What you want to use here (I think) is scikit learn's OneHotEncoder

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncode(categories = "auto")
X_train_encoded = encoder.fit_transform("X_train")
X_test_encoded = encoder.transform("X_test")

This keeps the fit_transform syntax and ensures X_test_encoded has the same shape as X_train_encoded. It can also be used in a pipeline as you mentioned instead of Dummies(). Example:

pipe1=make_pipeline(OneHotEncoder(categories = "auto"), StandardScaler(), PCA(n_components=7), LogisticRegression())

