我正在学习《动手学习》一书,并编写一些转换管道的代码来清理我的数据,并发现同一管道方法的输出会根据我选择输入的数据帧的大小而有所不同。这是代码:
from sklearn.base import BaseEstimator,TransformerMixin
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names =attribute_names
def fit(self,X,y=None):
return self
def transform(self,X):
return X[self.attribute_names].values
from sklearn.pipeline import FeatureUnion
class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
def __init__(self, sparse_output=False):
self.sparse_output = sparse_output
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
enc = LabelBinarizer(sparse_output=self.sparse_output)
return enc.fit_transform(X)
num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy='median')),
('attribs_adder', CombinedAttributesAdder()),
('std_scalar', StandardScaler())
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', CustomLabelBinarizer())
])
full_pipeline = FeatureUnion(transformer_list=[
('num_pipeline', num_pipeline),
('cat_pipeline', cat_pipeline)
])
housing_prepared = full_pipeline.fit_transform(housing)
data_prepared = full_pipeline.transform(housing.iloc[:5])
data_prepared1 = full_pipeline.transform(housing.iloc[:1000])
data_prepared2 = full_pipeline.transform(housing.iloc[:10000])
print(data_prepared.shape)
print(data_prepared1.shape)
print(data_prepared2.shape)
这三个打印件的输出将是(5,14)(1000,15)(10000,16)有人可以帮我解释一下吗?
那是因为,在中,CustomLabelBinarizer
您要为对的每次调用装配LabelBinarizer transform()
,因此它将每次都学习不同的标签,因此每次运行中的行数也将不同。
更改为此:
class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
def __init__(self, sparse_output=False):
self.sparse_output = sparse_output
def fit(self, X, y=None):
self.enc = LabelBinarizer(sparse_output=self.sparse_output)
self.enc.fit(X)
return self
def transform(self, X, y=None):
return self.enc.transform(X)
现在,我在您的代码上得到了正确的形状:
(5, 14)
(1000, 14)
(10000, 14)
注意:这里也曾问过同样的问题。我假设您正在使用此处的代码链接。如果您使用的是其他网站,则该代码可能是我链接的代码的旧版本。请尝试上述链接上的代码以获取无错误的更新版本。
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句