这个实例原型是《Hands On Machine Learning with Sklearn and Tensorflow》书第二章End-to-End Machine Learning Project的课后题,题目的解答和正文掺杂在一起,我做了一些整理,也有一些补充(所以不敢保证没有纰漏)。仅供需要的同学批判性地参考,顺便给自己当一个笔记。
数据集是来自 https://github.com/ageron/handson-ml 下的master/datasets/housing,简而言之,就是根据房子的一些属性来预测房子的价格。数据文件中一共有10个字段,具体的,特征
X={‘longitude’, ‘latitude’, ‘housing_median_age’,
‘total_rooms’, ‘total_bedrooms’, ‘population’,
‘households’, ‘median_income’, ‘ocean_proximity’},
其中’ocean_proximity’为文本类型,表示房子靠海的程度,只有5种情况——{‘NEAR BAY’, ‘<1H OCEAN’, ‘INLAND’, ‘NEAR OCEAN’, ‘ISLAND’}。
标签
Y={‘median_house_value’},
为连续型数据,因此此为一个多元回归问题。该问题的示意图略图如下:
上图中间即为该end-to-end机器学习模型的流程图(pipeline),其中两侧的6个框中的记为pipeline的组件,亦即本模型的操作。需要说明的有以下几点:
(1)本数据中“分类型数据”不含异常值,因为异常值处理没有组件;
(2)其中Imputer和StandardScaler为python自带的类,其他的类是需要自己编写;
(3)其中LabelBinary类,直接用会报错,原因见于https://stackoverflow.com/questions/46162855/fit-transform-takes-2-positional-arguments-but-3-were-given-with-labelbinarize ,
修改方法是自己在此类的基础上做一下修改,得到新类MyLabelBinary(见代码中)。
【更详细的内容,请参考原书或下述代码】
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
import numpy as np
import time
#### 读入数据,划分训练集和测试集
data=pd.read_csv('../handsOn/datasets/housing/housing.csv')
housing = data.drop("median_house_value", axis=1)
housing_labels = data["median_house_value"].copy()
L=int(0.8*len(housing))
train_x=housing.ix[0:L,:]
train_y=housing_labels.ix[0:L]
test_x=housing.ix[L:,:]
test_y=housing_labels.ix[L:]
cat_attribute=['ocean_proximity']
housing_num=housing.drop(cat_attribute,axis=1)
num_attribute=list(housing_num)
start=time.clock()
#### 构建 pipeline
from sklearn.preprocessing import Imputer,StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import LabelBinarizer
##构建新特征需要的类
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class CombinedAttributeAdder(BaseEstimator,TransformerMixin):
def __init__(self,add_bedrooms_per_room=True):
self.add_bedrooms_per_room=add_bedrooms_per_room
def fit(self, X, y=None):
return(self)
def transform(self,X):
'''X is an array'''
rooms_per_household=X[:,rooms_ix]/X[:,household_ix]
population_per_household=X[:,population_ix]/X[:,household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room=X[:,bedrooms_ix]/X[:,rooms_ix]
return(np.c_[X,rooms_per_household,population_per_household,bedrooms_per_room])
else:
return(np.c_[X,rooms_per_household,population_per_household])
##根据列名选取出该列的类
class DataFrameSelector(BaseEstimator,TransformerMixin):
def __init__(self,attribute_names):
self.attribute_names=attribute_names
def fit(self, X, y=None):
return(self)
def transform(self,X):
'''X is a DataFrame'''
return(X[self.attribute_names].values)
##对分类属性进行onhehot编码
class MyLabelBinarizer(BaseEstimator,TransformerMixin):
def __init__(self):
self.encoder = LabelBinarizer()
def fit(self, x, y=0):
self.encoder.fit(x)
return self
def transform(self, x, y=0):
return self.encoder.transform(x)
##构造数值型特征处理的pipeline
#流水线中的所有组件,除了最后一个,必须是transformer;最后一个估计器可以是任何类型
num_pipeline=Pipeline([
('selector',DataFrameSelector(num_attribute)),
('imputer',Imputer(strategy='median')),
('attribute_adder',CombinedAttributeAdder()),
('std_scaler',StandardScaler())
])
##构造分类型特征处理的pipeline
cat_pipeline=Pipeline([
('selector',DataFrameSelector(cat_attribute)),
('label_binarizer',MyLabelBinarizer())
])
##将上述pipeline用FeatureUnion组合,将两组特征组合起来
full_pipeline=FeatureUnion(transformer_list=[
('num_pipeline',num_pipeline),
('cat_pipeline',cat_pipeline)
])
##特征选择类:用随机森林计算特征的重要程度,返回最高的K个特征
class TopFeaturesSelector(BaseEstimator,TransformerMixin):
def __init__(self,feature_importance_k=5):
self.top_k_attributes=None
self.k=feature_importance_k
def fit(self,X,y):
reg=RandomForestRegressor()
reg.fit(X,y)
feature_importance=reg.feature_importances_
top_k_attributes=np.argsort(-feature_importance)[0:self.k]
self.top_k_attributes=top_k_attributes
return(self)
def transform(self, X,**fit_params):
return(X[:,self.top_k_attributes])
##数据预处理以及选取最重要的K个特征的pipeline
prepare_and_top_feature_pipeline=Pipeline([
('full_pipeline',full_pipeline),
('feature_selector',TopFeaturesSelector(feature_importance_k=5))
])
##用GridSearchCV计算最随机森林最优的参数
train_x_=full_pipeline.fit_transform(train_x)
# Tree model,select best parameter with GridSearchCV
param_grid={
'n_estimators':[10,50],
'max_depth':[8,10]
}
reg=RandomForestRegressor()
grid_search=GridSearchCV(reg,param_grid=param_grid,cv=5)
grid_search.fit(train_x_,train_y)
##构造最终的数据处理和预测pipeline
prepare_and_predict_pipeline=Pipeline([
('prepare',prepare_and_top_feature_pipeline),
('random_forest',RandomForestRegressor(**grid_search.best_params_))
])
#### 对上述总的pipeline用GridSearchCV选取最好的pipeline参数
param_grid2={'prepare__feature_selector__feature_importance_k':[1,3,5,10],
'prepare__full_pipeline__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent']}
grid_search2=GridSearchCV(prepare_and_predict_pipeline,param_grid=param_grid2,cv=2,
scoring='neg_mean_squared_error', verbose=2, n_jobs=4)
grid_search2.fit(train_x,train_y)
pred=grid_search2.predict(test_x)
end=time.clock()
print('RMSE on test set={}'.format(np.sqrt(mean_squared_error(test_y,pred))))
print('cost time={}'.format(end-start))
print('grid_search2.best_params_=\n',grid_search2.best_params_)
上述程序的的输出结果在准确率上并不高,因为为了提高速度,GridSearchCV都是用了最简单的参数,可以扩大参数范围,提高精度。