AdaBoost是adaptive boosting(自适应boosting)的缩写。
AdaBoost是一种Boosting族的集成学习方法。弱学习器之间是强依赖序列化的,即不同的学习器是通过串行训练而获得的,每个新分类器都根据已训练出来的分类器的性能来进行训练。Boosting是通过集中关注被已有分类器错分的那些数据来获得新的分类器。
AdaBoost特点如下:
在计算出D之后,AdaBoost又开始进入下一轮迭代。AdaBoost会不断重复训练和调整权重,直至训练错误率为0或弱分类器的数目达到用户的指定值为止。
单层决策树也成决策树桩,是最简单的一种决策树。
构建伪代码如下:
Python代码如下:
def buildStump(dataArray,classLabels,D):
dataMat = mat(dataArray)
labelsMat = mat(classLabels).T
m,n = shape(dataMat)
numSteps = 10.0
bestStump = {}
minError = inf
bestclasEst = mat(zeros((m,1)))
for i in range(n):
rangeMin = dataMat[:,i].min()
rangeMax = dataMat[:,i].max()
stepSize = (rangeMax-rangeMin)/numSteps
for j in range(-1,int(numSteps)+1):
for inequal in ['lt','gt']:
threshVal = rangeMin+float(j)*stepSize
predictedVal = ones((m,1))
if inequal == 'lt':
predictedVal[dataMat[:,i]<=threshVal] = -1.0
else:
predictedVal[dataMat[:,i]>threshVal] = -1.0
errorArr = ones((m,1))
errorArr[predictedVal==labelsMat] = 0
weightedError = float(D.T*errorArr)
print("dimen:%d,thresh:%.2f,thresh inequal:,%s,weighted error: %.3f"%(i,threshVal,inequal,weightedError))
if weightedError < minError:
minError = weightedError
bestclasEst = predictedVal.copy()
bestStump['dim'] = i
bestStump['thresh'] = threshVal
bestStump['ineq'] = inequal
return bestStump,minError,bestclasEst
单层决策树只需要保存Dim、ThreshVal、InEquality和alpha即可。
单层决策树的生成函数是决策树的一个简化版本,也就是所谓的弱分类器,下面就将使用多个弱分类器来构建AdaBoost算法。
Python代码如下:
def adaBoostTrainsDS(dataArr,classLabels,numIt=40):
weakClassArr = []
dataMat = mat(dataArr)
m = shape(dataMat)[0]
D = mat(ones((m,1))/m)
aggClassEst = mat(zeros((m,1)))
for i in range(numIt):
bestStump,error,classEst = buildStump(dataMat,classLabels,D)
alpha = 0.5*log((1-error)/max(error,1e-16))
bestStump['alpha'] = alpha
weakClassArr.append(bestStump)
expon = multiply(-alpha*mat(classEst),mat(classLabels).T)
D = multiply(D,exp(expon))
D = D/D.sum()
aggClassEst += alpha*classEst
aggError = multiply(sign(aggClassEst)!=mat(classLabels).T,ones((m,1)))
errorRate = aggError.sum()/m
print('the total error is:%.2f'%(errorRate))
if errorRate==0:
break
return weakClassArr
用户需要指定训练的迭代次数。
Python代码如下:
def stumpClassify(dataMatrix,dimen,threshVal,threshIneq):
retArray = ones((shape(dataMatrix)[0],1))
if threshIneq=='lt':
retArray[dataMatrix[:,dimen]<=threshVal] = -1.0
else:
retArray[dataMatrix[:,dimen]>threshVal] = -1.0
return retArray
def adaClassify(datToClass,classifierArr):
dataMat = mat(datToClass)
m = shape(dataMat)[0]
aggClassEst = mat(zeros((m,1)))
for i in range(len(classifierArr)):
classEst = stumpClassify(dataMat,classifierArr[i]['dim'],classifierArr[i]['thresh'],classifierArr[i]['ineq'])
aggClassEst += classEst*classifierArr[i]['alpha']
print(aggClassEst)
return sign(aggClassEst)
读文件数据的代码:
def loadDataSet(filename):
numFeat = len(open(filename).readline().split('\t'))-1
fr = open(filename)
dataArr = []
classLabels = []
for line in fr.readlines():
lineArr = line.strip().split('\t')
featureArr = []
for j in range(numFeat):
featureArr.append(float(lineArr[j]))
dataArr.append(featureArr)
classLabels.append(float(lineArr[-1]))
dataMat = mat(dataArr)
return dataMat,classLabels
使用AdaBoost进行预测的过程如下:
>>> datArr,labelArr = adaboost.loadDataSet('horseColicTraining2.txt')
>>> classifierArray = adaboost.adaBoostTrainDS(datArr,labelArr,10)
total error: 0.284280936455
total error: 0.284280936455
.
.
total error: 0.230769230769
>>> testArr,testLabelArr = adaboost.loadDataSet('horseColicTest2.txt')
>>> prediction10 = adaboost.adaClassify(testArr,classifierArray)
To get the number of misclassified examples type in:
>>> errArr=mat(ones((67,1)))
>>> errArr[prediction10!=mat(testLabelArr).T].sum()
16.0
>>> 16/67
0.23880597014925373