Pyspark多标签文本分类

凯蒂斯·范·凯蒂斯

我正在尝试预测未知文本的标签。我的数据如下所示：

+-----------------+-----------+
|      label      |   text    |
+-----------------+-----------+
| [0, 1, 0, 1, 0] | blah blah |
| [1, 1, 0, 0, 0] | foo bar   |
+-----------------+-----------+

第一列使用多标签二值化方法进行编码。我的管道：

tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lsvc = LinearSVC(maxIter=10, regParam=0.1)
ovr = OneVsRest(classifier=lsvc)

pipeline = Pipeline(stages=[tokenizer, hashingTF, ovr])

model = pipeline.fit(result)

运行此代码时，出现此错误：

ValueError: invalid literal for int() with base 10: '[1, 0, 1, 0, 1, 1, 1, 0, 0]'

有什么想法怎么了？

ido堂

看着错误

int（）的无效文字

我们看到问题在于标签的预期类型不是数组，而是与样本类相对应的单个值。换句话说，您需要将标签从多标签二进制编码转换为单个数字。

一种方法是先将数组转换为字符串，然后使用StringIndexer：

to_string_udf = udf(lambda x: ''.join(str(e) for e in x), StringType())
df = df.withColumn("labelstring", to_string_udf(df.label))

indexer = StringIndexer(inputCol="labelstring", outputCol="label")
indexed = indexer.fit(df).transform(df)

这将为每个唯一数组创建一个单独的类别（类标签）。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。