Shape mismatch when One-Hot-Encoding Train and Test data. Train_Data has more Dummy Columns than Test_data while using get_dummies with pipeline

keshav N

I'm trying to create a get_dummies Class for my Data which I want to use it in Pipeline later:

class Dummies(BaseEstimator, TransformerMixin):
     def transform(self, df):
           dummies=pd.get_dummies(df[self.cat],drop_first=True) ## getting dummy cols
           df=pd.concat([df,dummies],axis=1) ## concatenating our dummies
           df.drop(self.cat,axis=1,inplace=True) ## dropping our original cat_cols

     def fit(self, df):
           self.cat=[]    
           for i in df.columns.tolist():    
               if i[0]=='c': ## My data has categorical cols start with 'c'  
                  self.cat.append(i)  ## Storing all my categorical_columns for dummies
              else:
                continue

Now when I call fit_transform on X_train and then transform X_test

z=Dummies()
X_train=z.fit_transform(X_train)
X_test=z.transform(X_test)

The columns in shape of X_train and X_test are Different:

X_train.shape
X_test.shape

Output:

(10983, 1797) (3661, 1529)

There are more Dummies in X_train than in my X_test. Clearly, my X_test has fewer categories than X_train. How do I write logic in my Class such that the categories in X_test broadcast to the shape of X_train? I want X_test to have the same number of dummy variables as my X_train.

MaximeKan

What you want to use here (I think) is scikit learn's OneHotEncoder

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncode(categories = "auto")
X_train_encoded = encoder.fit_transform("X_train")
X_test_encoded = encoder.transform("X_test")

This keeps the fit_transform syntax and ensures X_test_encoded has the same shape as X_train_encoded. It can also be used in a pipeline as you mentioned instead of Dummies(). Example:

pipe1=make_pipeline(OneHotEncoder(categories = "auto"), StandardScaler(), PCA(n_components=7), LogisticRegression())

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

One hot encoding train with values not present on test

Train test dataset in Data Pipeline

One-hot (dummy) encoding of categorical data in Excel

one hot encoding many columns of mixed data

How do I resolve one hot encoding if my test data has missing values in a col?

Size Mismatch using pytorch when trying to train data

Train and Test with TFRecord Data

Unable to train or test data

invalid type (list) for variable 'train_data'

Python - How to reverse the encoding of data encoded with LabelEncoder after it has been split by train_test_split?

Error comes up, when my test set has data which my train data doesn't have?

pandas get_dummies cannot handle unseen labels in test data

Do I have to do one-hot-encoding separately for train and test dataset?

Dummy creation in pipeline with different levels in train and test set

How to input data into Keras? Specifically what is the x_train and y_train if I have more than 2 columns?

Different accuracy when splitting data with train_test_split than loading csv file afterwards

One-hot encoding using tf.data mixes up columns

Data split for train test for a model

Train and test split data with features

AssertionError: <class 'numpy.ndarray'>, while spiltting the data into test and train

Data vectorization (get_dummies 3 columns to matrix)

scikit-learn: how to predict new data if after one hot encoding it has fewer features than the training/testing sets

How to group by a dataframe by all columns except one column (data frame has more than 50 columns)

Why do I get different results when I do a manual split of test and train data as opposed to using the Python splitting function

sklearn - how to incorporate missing data when one-hot encoding

Different results when using train_test_split vs manually splitting the data

Shuffle Data from Dictionary for Test and Train Data

Train data and test data that have target column

Azure Data factory Copy Pipeline failing if more than 6 columns