Shape mismatch when One-Hot-Encoding Train and Test data. Train_Data has more Dummy Columns than Test_data while using get_dummies with pipeline

keshav N Published at Dev

keshav N

I'm trying to create a get_dummies Class for my Data which I want to use it in Pipeline later:

class Dummies(BaseEstimator, TransformerMixin):
     def transform(self, df):
           dummies=pd.get_dummies(df[self.cat],drop_first=True) ## getting dummy cols
           df=pd.concat([df,dummies],axis=1) ## concatenating our dummies
           df.drop(self.cat,axis=1,inplace=True) ## dropping our original cat_cols

     def fit(self, df):
           self.cat=[]    
           for i in df.columns.tolist():    
               if i[0]=='c': ## My data has categorical cols start with 'c'  
                  self.cat.append(i)  ## Storing all my categorical_columns for dummies
              else:
                continue

Now when I call fit_transform on X_train and then transform X_test

z=Dummies()
X_train=z.fit_transform(X_train)
X_test=z.transform(X_test)

The columns in shape of X_train and X_test are Different:

X_train.shape
X_test.shape

Output:

(10983, 1797) (3661, 1529)

There are more Dummies in X_train than in my X_test. Clearly, my X_test has fewer categories than X_train. How do I write logic in my Class such that the categories in X_test broadcast to the shape of X_train? I want X_test to have the same number of dummy variables as my X_train.

MaximeKan

What you want to use here (I think) is scikit learn's OneHotEncoder

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncode(categories = "auto")
X_train_encoded = encoder.fit_transform("X_train")
X_test_encoded = encoder.transform("X_test")

This keeps the fit_transform syntax and ensures X_test_encoded has the same shape as X_train_encoded. It can also be used in a pipeline as you mentioned instead of Dummies(). Example:

pipe1=make_pipeline(OneHotEncoder(categories = "auto"), StandardScaler(), PCA(n_components=7), LogisticRegression())

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2020-12-3

Comments

0 comments

One hot encoding train with values not present on test

Shape mismatch when One-Hot-Encoding Train and Test data. Train_Data has more Dummy Columns than Test_data while using get_dummies with pipeline

Shape mismatch when One-Hot-Encoding Train and Test data. Train_Data has more Dummy Columns than Test_data while using get_dummies with pipeline

pump.io port in URL

Loopback Error: connect ECONNREFUSED 127.0.0.1:3306 (MAMP)

Can't pre-populate phone number and message body in SMS link on iPhones when SMS app is not running in the background

How to import an asset in swift using Bundle.main.path() in a react-native native module

Failed to listen on localhost:8000 (reason: Cannot assign requested address)

Spring Boot JPA PostgreSQL Web App - Internal Authentication Error

ngClass error (Can't bind ngClass since it isn't a known property of div) in Angular 11.0.3

Using Response.Redirect with Friendly URLS in ASP.NET

Can a 32-bit antivirus program protect you from 64-bit threats

Double spacing in rmarkdown pdf

How to fix "pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'" using YOLOv3?

3D Touch Peek Swipe Like Mail

Bootstrap 5 Static Modal Still Closes when I Click Outside

Assembly definition can't resolve namespaces from external packages

Vector input in shiny R and then use it

Emulator wrong screen resolution in Android Studio 1.3

Svchost high CPU from Microsoft.BingWeather app errors

Graphics Context misaligned on first paint

Python connect to firebird docker database

Is this docker-for-mac password dialog legit?

How to save models trained locally in Amazon SageMaker?