Mllib支持常见的回归模型,如线性回归,广义线性回归,决策树回归,随机森林回归,梯度提升树回归,生存回归,保序回归。
下面仅以线性回归和决策树回归为例。
from pyspark.ml.regression import LinearRegression
# 载入数据
dfdata = spark.read.format("libsvm")\
.load("data/sample_linear_regression_data.txt")
# 定义模型
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
# 训练模型
lrModel = lr.fit(dfdata)
# 模型参数
print("Coefficients: %s" % str(lrModel.coefficients))
print("Intercept: %s" % str(lrModel.intercept))
# 评估模型
trainingSummary = lrModel.summary
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)
Coefficients: [0.0,0.32292516677405936,-0.3438548034562218,1.9156017023458414,0.05288058680386263,0.765962720459771,0.0,-0.15105392669186682,-0.21587930360904642,0.22025369188813426]
Intercept: 0.1598936844239736
numIterations: 7
objectiveHistory: [0.49999999999999994, 0.4967620357443381, 0.4936361664340463, 0.4936351537897608, 0.4936351214177871, 0.49363512062528014, 0.4936351206216114]
+--------------------+
| residuals|
+--------------------+
| -9.889232683103197|
| 0.5533794340053554|
| -5.204019455758823|
| -20.566686715507508|
| -9.4497405180564|
| -6.909112502719486|
| -10.00431602969873|
| 2.062397807050484|
| 3.1117508432954772|
| -15.893608229419382|
| -5.036284254673026|
| 6.483215876994333|
| 12.429497299109002|
| -20.32003219007654|
| -2.0049838218725005|
| -17.867901734183793|
| 7.646455887420495|
| -2.2653482182417406|
|-0.10308920436195645|
| -1.380034070385301|
+--------------------+
only showing top 20 rows
RMSE: 10.189077
r2: 0.022861
from pyspark.ml import Pipeline
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
# 载入数据
dfdata = spark.read.format("libsvm").load("data/sample_libsvm_data.txt")
(dftrain, dftest) = dfdata.randomSplit([0.7, 0.3])
# 处理类别特征
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(dfdata)
# 使用决策树模型
dt = DecisionTreeRegressor(featuresCol="indexedFeatures")
# 构建流水线
pipeline = Pipeline(stages=[featureIndexer, dt])
# 训练流水线
model = pipeline.fit(dftrain)
# 进行预测
dfpredictions = model.transform(dftest)
dfpredictions.select("prediction", "label", "features").show(5)
# 评估模型
evaluator = RegressionEvaluator(
labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(dfpredictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
treeModel = model.stages[1]
print(treeModel)
+----------+-----+--------------------+
|prediction|label| features|
+----------+-----+--------------------+
| 0.0| 0.0|(692,[123,124,125...|
| 0.0| 0.0|(692,[124,125,126...|
| 0.0| 0.0|(692,[126,127,128...|
| 0.0| 0.0|(692,[126,127,128...|
| 0.0| 0.0|(692,[126,127,128...|
+----------+-----+--------------------+
only showing top 5 rows
Root Mean Squared Error (RMSE) on test data = 0
DecisionTreeRegressionModel: uid=DecisionTreeRegressor_06213a3aaeb0, depth=2, numNodes=5, numFeatures=692
该代码样列数据集下载链接