Pyspark，决策树（Spark 2.0.0）

Ruslan 发表于 Dev

鲁斯兰

我是新来的火花（使用pyspark）。我尝试从此处（链接）运行决策树教程。我执行代码：

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.util import MLUtils

# Load and parse the data file, converting it to a DataFrame.
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)

# Now this line fails
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

我收到错误消息：IllegalArgumentException：u'requirement失败：列要素必须为org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7类型，但实际上为org.apache.spark.mllib.linalg.VectorUDT@f71b0bce。

谷歌搜索此错误时，我找到了一个回答，说：

use from pyspark.ml.linalg import Vectors, VectorUDT 
instead of 
from pyspark.mllib.linalg import Vectors, VectorUDT

这很奇怪，因为我没有使用过。另外，将此导入添加到我的代码中并不能解决任何问题，并且仍然会出现相同的错误。

对于如何调试这种情况，我不太清楚。当查看原始数据时，我看到：

data.show()
+--------------------+-----+
|            features|label|
+--------------------+-----+
|(692,[127,128,129...|  0.0|
|(692,[158,159,160...|  1.0|
|(692,[124,125,126...|  1.0|
|(692,[152,153,154...|  1.0|

看起来像一个列表，以'（'开头。

我不确定如何解决此问题，甚至无法调试...关于我做错了什么的建议？

谢谢

亚龙

问题的根源似乎是在执行spark 1.5.2。spark 2.0.0上的示例（请参见下面对spark 2.0示例的引用）。

spark.ml和spark.mllib之间的区别

从Spark 2.0开始，spark.mllib软件包中基于RDD的API已进入维护模式。Spark的主要机器学习API现在是spark.ml软件包中基于DataFrame的API。

可在此处找到更多详细信息：http : //spark.apache.org/docs/latest/ml-guide.html

使用Spark 2.0，请尝试使用Spark 2.0.0示例（https://spark.apache.org/docs/2.0.0/mllib-decision-tree.html）

from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a DecisionTree model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     impurity='gini', maxDepth=5, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString())

# Save and load model
model.save(sc, "target/tmp/myDecisionTreeClassificationModel")
sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")

在Spark存储库中的“ examples / src / main / python / mllib / decision_tree_classification_example.py”中找到完整的示例代码。

本文收集自互联网，转载请注明来源。

如有侵权，请联系 [email protected] 删除。

编辑于 2020-11-3

我来说两句

0 条评论

登录后参与评论

上一篇：shouldComponentUpdate是否阻止连接的子项更新

Pyspark，决策树（Spark 2.0.0）

Pyspark，决策树（Spark 2.0.0）

计算数据帧R中的字符串频率

Android Studio Kotlin：提取为常量

Excel 2016图表将增长与4个参数进行比较

获取并汇总所有关联的数据

如何使用Redux-Toolkit重置Redux Store

http：// localhost：3000 /＃！/为什么我在localhost链接中得到“＃！/”。

将加号/减号添加到jQuery菜单

算术中的c ++常量类型转换

TYPO3：将 Formhandler 添加到新闻扩展

TreeMap中的自定义排序

如何开始为Ubuntu开发

在 Python 2.7 中。如何从文件中读取特定文本并分配给变量

无法使用 envoy 访问 .ssh/config

在Ubuntu和Windows中，触摸板有时会滞后。硬件问题？

遍历元素数组以每X秒在浏览器上显示

在Jenkins服务器中使用Selenium和Ruby进行的黄瓜测试失败，但在本地计算机中通过

警告消息：在matrix（unlist（drop.item），ncol = 10，byrow = TRUE）中：数据长度[16]不是列数的倍数[10]>？

未捕获的SyntaxError：带有Ajax帖子的意外令牌u

如何使用tweepy流式传输来自指定用户的推文（仅在该用户发布推文时流式传输）

尝试在Dell XPS13 9360上安装Windows 7时出错

如果从DB接收到的值为空，则JMeter JDBC调用将返回该值作为参数名称