通过上一篇的简介,我们对Spark MLlib的基础有了一些了解。那么,从这一篇开始,我们进入实战阶段。因为是介绍Spark MLlib的应用,所以我这里不会详细介绍算法的推导,后续我会抽时间整理成专题进行介绍。而这一篇主要介绍Spark MLlib中的监督学习算法:Logistics Regression、Naive Bayes、SVM(Support Vector Machine)、Decision Tree,和Linear Regression。
值得一提的是,虽然Spark MLlib中已经提供了常用算法的接口,但是在看了它的源代码后,如果发现其性能和稳定性不如自己的实现过程好或者其他原因,也可以自己实现这些算法。
首先,我们先了解一下监督学习的定义,以下是Wikipedia给出的定义:
"Supervised learning is the machine learning task of inferring a function from labeled training data."
可以简单的理解为:监督学习是为了从数据中找规律(即函数)。从一组数中找规律是我们初中就接触的东西,如:1,2,4,8.......它的规律就是2^x次方,x是0到无穷大的整数。
而本篇要介绍的监督学习算法中,Logistics Regression、Naive Bayes、SVM(Support Vector Machine)、Decision Tree又属于分类算法。分类算法的定义说的正式点儿是:根据文本的特征或属性,划分到已有的类别中。概况说就四个字——分门别类。
Linear Regression是来确定两种或两种以上变量间相互依赖的定量关系的一种统计分析方法。在Spark MLlib中提供了两种实现Linear Regression的接口:LinearRegressionWithSGD和LassoWithSGD。其实,LassoWithSGD可以看做是LinearRegression的加强版,是处理如果特征比样本点还多,也就是说输入数据的矩阵X不是满秩矩阵的时候,相当于缩减系数来“理解”数据。这里以LinearRegressionWithSGD的使用为例,代码如下:
import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LinearRegressionWithSGD import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.{SparkContext, SparkConf} object LinearRegression { def main(args: Array[String]): Unit ={ val length = args.length if(length != 2 && length != 3){ System.err.println("Usage: <input file> <iteration number> <step size(optional)>") System.exit(1) } val conf = new SparkConf() val sc = new SparkContext(conf) val data = sc.textFile(args(0)) //Iteration number val iteration = args(1).toInt //Step size, default vaule is 0.01 val stepSize = if(length == 3) args(2).toInt else 0.01 //Parse the data into LabeledPoint val parseData = data.map{line => val elem = line.split(":") LabeledPoint(elem(0).toDouble, Vectors.dense(elem(1).split(" ").map(_.toDouble))) } //Train model val model = LinearRegressionWithSGD.train(parseData, iteration, stepSize) //Check its coefficients val weight = model.weights println(weight) } }
Logistics Regression主要用于二分类,如是否是垃圾邮件,它的y值是:0和1。算法详细介绍可以看我的另一篇博文Logistic Regression笔记。在Spark MLlib中也提供两种实现Logistics Regression的接口:LogisticRegressionWithSGD和LogisticRegressionWithLBFGS。而LogisticRegressionWithLBFGS是优选的,因为它消除了优化步长。虽然二者接口使用大同小异,但是为了更直观的看到步长,这里就以LogisticsRegressionWithSGD接口为例,代码如下:
import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.mllib.classification.LogisticRegressionWithSGD import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint object LogisticRegression { def main (args: Array[String]): Unit = { val length = args.length if(length != 2 && length != 3){ System.err.println("Usage: <input file> <iteration number> <step size(optional)>") System.exit(1) } val conf = new SparkConf() val sc = new SparkContext(conf) val data = sc.textFile(args(0)) //Iteration number val iteration = args(1).toInt //Step size, default value is 0.01 val stepSize = if(length == 3) args(2).toDouble else 0.01 //Parse the data into LabeledPoint val parseData = data.map{line => val elem = line.split(":") LabeledPoint(elem(0).toDouble, Vectors.dense(elem(1).split(" ").map(x => x.toDouble))) } //Train a model val model = LogisticRegressionWithSGD.train(parseData, iteration, stepSize) //Check its coefficients val weight = model.weights println(weight) sc.stop() } }
Naive Bayes是基于贝叶斯定理与特征条件独立假设的分类方法。算法的详细介绍可以参考我的另一篇博文Naive Bayes笔记。在Spark MLlib中其实现接口就叫:NaiveBayes,代码如下:
import org.apache.spark.mllib.classification.NaiveBayes import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.{SparkContext, SparkConf} object NaiveBayesDemo { def main (args: Array[String]): Unit = { val length = args.length if(length != 2){ System.err.println("Usage: <input file> <lambda>") System.exit(1) } val conf = new SparkConf() val sc = new SparkContext(conf) val data = sc.textFile(args(0)) //Lambda, default value is 1L val lambda = if(length == 2) args(1).toDouble else 1L //Parse the data into LabeledPoint val parseData = data.map{line => val elem = line.split(":") LabeledPoint(elem(0).toDouble, Vectors.dense(elem(1).split(" ").map(_.toDouble))) } //Split the data half and half into the training and test datasets val splits = parseData.randomSplit(Array(0.5, 0.5), seed = 11L) val training = splits(0) val test = splits(1) //Train a model with the training dataset val model = NaiveBayes.train(training, lambda) //Predict the label of the test dataset val prediction = test.map(p => (model.predict(p.features), p.label)) println(prediction) sc.stop() } }
SVM是找到一个超平面把数据分为1和-1两类,而最靠近分隔超平面的点叫做支持向量(Support Vector)。在Spark MLlib中,其实现接口是SVMWithSGD,代码如下:
import org.apache.spark.mllib.classification.SVMWithSGD import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.{SparkContext, SparkConf} object SVM { def main(args: Array[String]): Unit ={ if(args.length != 2){ System.err.println("Usage: <input file> <iteration number>") System.exit(1) } val conf = new SparkConf() val sc = new SparkContext(conf) val data = sc.textFile(args(0)) //Iteration number val iteration = args(1).toInt //Parse the data into LabeledPoint val parseData = data.map{line => val elem = line.split(":") LabeledPoint(elem(0).toDouble, Vectors.dense(elem(1).split(" ").map(x => x.toDouble))) } //Train a model val model = SVMWithSGD.train(parseData, iteration) //Check its coefficients val weight = model.weights println(weight) sc.stop() } }