I'm trying to create a get_dummies Class for my Data which I want to use it in Pipeline later:
class Dummies(BaseEstimator, TransformerMixin):
def transform(self, df):
dummies=pd.get_dummies(df[self.cat],drop_first=True) ## getting dummy cols
df=pd.concat([df,dummies],axis=1) ## concatenating our dummies
df.drop(self.cat,axis=1,inplace=True) ## dropping our original cat_cols
def fit(self, df):
self.cat=[]
for i in df.columns.tolist():
if i[0]=='c': ## My data has categorical cols start with 'c'
self.cat.append(i) ## Storing all my categorical_columns for dummies
else:
continue
Now when I call fit_transform on X_train and then transform X_test
z=Dummies()
X_train=z.fit_transform(X_train)
X_test=z.transform(X_test)
The columns in shape of X_train and X_test are Different:
X_train.shape
X_test.shape
Output:
(10983, 1797) (3661, 1529)
There are more Dummies in X_train than in my X_test. Clearly, my X_test has fewer categories than X_train. How do I write logic in my Class such that the categories in X_test broadcast to the shape of X_train? I want X_test to have the same number of dummy variables as my X_train.
What you want to use here (I think) is scikit learn's OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncode(categories = "auto")
X_train_encoded = encoder.fit_transform("X_train")
X_test_encoded = encoder.transform("X_test")
This keeps the fit_transform
syntax and ensures X_test_encoded has the same shape as X_train_encoded. It can also be used in a pipeline as you mentioned instead of Dummies()
. Example:
pipe1=make_pipeline(OneHotEncoder(categories = "auto"), StandardScaler(), PCA(n_components=7), LogisticRegression())
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments