使用数值和分类变量在 PySpark 中创建“特征”列

agrajag_42

我正在尝试使用 Python 在 Spark 中创建一个“功能”列，以便机器学习库使用。但是，我在生成“特征”列的 VectorAssembler 中遇到了包括数值和分类变量在内的问题。

cat_cols = ["cat_1", "cat_2", "cat_3"]
num_cols = ["num_1", "num_2", "num_3", "num_4"]

indexers = [StringIndexer(inputCol = c, outputCol="{0}_indexed".format(c)) for c in cat_cols]

encoders = [StringIndexer(inputCol = indexer.getOutputCol(), outputCol = "{0}_encoded".format(indexer.getOutputCol())) 
for indexer in indexers]

assembler = VectorAssembler(inputCols = [encoder.getOutputCol() for encoder in encoders], outputCol = "features")

pipeline = Pipeline(stages = indexers + encoders + [assembler])
df = pipeline.fit(df).transform(df)

到目前为止构建的管道可以创建一个仅包含分类变量的“特征”列，但我不知道如何扩展它以使“特征”列同时包含分类变量和数值变量。

请注意，我使用的是 Spark 2.3 和 Python 3。

agrajag_42

我找到了一种方法，但我不确定这是否是实现我想要的最有效方法。

cat_cols = ["cat_1", "cat_2", "cat_3"]
num_cols = ["num_1", "num_2", "num_3", "num_4"]

indexers = [StringIndexer(inputCol = c, outputCol="{0}_indexed".format(c)) for c in cat_cols]

encoders = [StringIndexer(inputCol = indexer.getOutputCol(), outputCol = "{0}_encoded".format(indexer.getOutputCol())) 
for indexer in indexers]

assemblerCat = VectorAssembler(inputCols = [encoder.getOutputCol() for encoder in encoders], outputCol = "cat")

pipelineCat = Pipeline(stages = indexers + encoders + [assemblerCat])
df = pipelineCat.fit(df).transform(df)

assemblerNum = VectorAssembler(inputCols = num_cols, outputCol = "num")

pipelineNum = Pipeline(stages = [assemblerNum])
df = pipelineNum.fit(df).transform(df)

assembler = VectorAssembler(inputCols = ["cat", "num"], outputCol = "features")

pipeline = Pipeline(stages = [assembler])
df = pipeline.fit(df).transform(df)

本质上，我正在为分类变量创建一个管道，为数字变量创建一个管道，然后我将它们合并以创建一个包含两者的单个“特征”列。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2021-06-26

我来说两句

0 条评论

登录后参与评论

上一篇：从图像的右中心使用 jquery 的图像幻灯片

在Pyspark中创建新列的使用和条件

如何在机器学习中的数值和分类特征上使用统一管道？

在数据框pyspark中创建新的列和行

使用数值和分类变量在 PySpark 中创建“特征”列

使用数值和分类变量在 PySpark 中创建“特征”列

蓝屏死机没有修复解决方案

计算数据帧中每行的NA

UITableView的项目向下滚动后更改颜色，然后快速备份

Node.js中未捕获的异常错误，发生调用

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

Linux的官方Adobe Flash存储库是否已过时？

验证REST API参数

ggplot：对齐多个分面图-所有大小不同的分面

Mac OS X更新后的GRUB 2问题

通过 Git 在运行 Jenkins 作业时获取 ClassNotFoundException

带有错误“ where”条件的查询如何返回结果？

用日期数据透视表和日期顺序查询

VB.net将2条特定行导出到DataGridView

如何从视图一次更新多行（ASP.NET - Core）

Java Eclipse中的错误13，如何解决？

尝试反复更改屏幕上按钮的位置 - kotlin android studio

离子动态工具栏背景色

应用发明者仅从列表中选择一个随机项一次

当我尝试下载 StanfordNLP en 模型时，出现错误

python中的boto3文件上传

在同一Pushwoosh应用程序上Pushwoosh多个捆绑ID