sklearn-pipeline用法实例——房价预测

文章目录

    • 一、问题描述
    • 二、Pipeline处理流程示意图
    • 三、实例代码

一、问题描述

这个实例原型是《Hands On Machine Learning with Sklearn and Tensorflow》书第二章End-to-End Machine Learning Project的课后题,题目的解答和正文掺杂在一起,我做了一些整理,也有一些补充(所以不敢保证没有纰漏)。仅供需要的同学批判性地参考,顺便给自己当一个笔记。
  数据集是来自 https://github.com/ageron/handson-ml 下的master/datasets/housing,简而言之,就是根据房子的一些属性来预测房子的价格。数据文件中一共有10个字段,具体的,特征
     X={‘longitude’,  ‘latitude’,     ‘housing_median_age’,     
       ‘total_rooms’, ‘total_bedrooms’, ‘population’,
       ‘households’, ‘median_income’, ‘ocean_proximity’},
其中’ocean_proximity’为文本类型,表示房子靠海的程度,只有5种情况——{‘NEAR BAY’, ‘<1H OCEAN’, ‘INLAND’, ‘NEAR OCEAN’, ‘ISLAND’}。
标签
            Y={‘median_house_value’},
为连续型数据,因此此为一个多元回归问题。该问题的示意图略图如下:
sklearn-pipeline用法实例——房价预测_第1张图片

二、Pipeline处理流程示意图

sklearn-pipeline用法实例——房价预测_第2张图片
上图中间即为该end-to-end机器学习模型的流程图(pipeline),其中两侧的6个框中的记为pipeline的组件,亦即本模型的操作。需要说明的有以下几点:
(1)本数据中“分类型数据”不含异常值,因为异常值处理没有组件;
(2)其中Imputer和StandardScaler为python自带的类,其他的类是需要自己编写;
(3)其中LabelBinary类,直接用会报错,原因见于https://stackoverflow.com/questions/46162855/fit-transform-takes-2-positional-arguments-but-3-were-given-with-labelbinarize ,
修改方法是自己在此类的基础上做一下修改,得到新类MyLabelBinary(见代码中)。
【更详细的内容,请参考原书或下述代码】

三、实例代码

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
import numpy as np
import time

####  读入数据,划分训练集和测试集
data=pd.read_csv('../handsOn/datasets/housing/housing.csv')
housing = data.drop("median_house_value", axis=1)
housing_labels = data["median_house_value"].copy()

L=int(0.8*len(housing))
train_x=housing.ix[0:L,:]
train_y=housing_labels.ix[0:L]
test_x=housing.ix[L:,:]
test_y=housing_labels.ix[L:]

cat_attribute=['ocean_proximity']
housing_num=housing.drop(cat_attribute,axis=1)
num_attribute=list(housing_num)

start=time.clock()
####   构建 pipeline
from sklearn.preprocessing import Imputer,StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import LabelBinarizer

##构建新特征需要的类
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class CombinedAttributeAdder(BaseEstimator,TransformerMixin):
    def __init__(self,add_bedrooms_per_room=True):
        self.add_bedrooms_per_room=add_bedrooms_per_room
    def fit(self, X, y=None):
        return(self)
    def transform(self,X):
        '''X is an array'''
        rooms_per_household=X[:,rooms_ix]/X[:,household_ix]
        population_per_household=X[:,population_ix]/X[:,household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room=X[:,bedrooms_ix]/X[:,rooms_ix]
            return(np.c_[X,rooms_per_household,population_per_household,bedrooms_per_room])
        else:
            return(np.c_[X,rooms_per_household,population_per_household])
##根据列名选取出该列的类
class DataFrameSelector(BaseEstimator,TransformerMixin):
    def __init__(self,attribute_names):
        self.attribute_names=attribute_names
    def fit(self, X, y=None):
        return(self)
    def transform(self,X):
        '''X is a DataFrame'''
        return(X[self.attribute_names].values)
##对分类属性进行onhehot编码
class MyLabelBinarizer(BaseEstimator,TransformerMixin):
    def __init__(self):
        self.encoder = LabelBinarizer()
    def fit(self, x, y=0):
        self.encoder.fit(x)
        return self
    def transform(self, x, y=0):
        return self.encoder.transform(x)

##构造数值型特征处理的pipeline
#流水线中的所有组件,除了最后一个,必须是transformer;最后一个估计器可以是任何类型
num_pipeline=Pipeline([
    ('selector',DataFrameSelector(num_attribute)),
    ('imputer',Imputer(strategy='median')),
    ('attribute_adder',CombinedAttributeAdder()),
    ('std_scaler',StandardScaler())
])

##构造分类型特征处理的pipeline
cat_pipeline=Pipeline([
    ('selector',DataFrameSelector(cat_attribute)),
    ('label_binarizer',MyLabelBinarizer())
])

##将上述pipeline用FeatureUnion组合,将两组特征组合起来
full_pipeline=FeatureUnion(transformer_list=[
    ('num_pipeline',num_pipeline),
    ('cat_pipeline',cat_pipeline)
])

##特征选择类:用随机森林计算特征的重要程度,返回最高的K个特征
class TopFeaturesSelector(BaseEstimator,TransformerMixin):
    def __init__(self,feature_importance_k=5):
        self.top_k_attributes=None
        self.k=feature_importance_k
    def fit(self,X,y):
        reg=RandomForestRegressor()
        reg.fit(X,y)
        feature_importance=reg.feature_importances_
        top_k_attributes=np.argsort(-feature_importance)[0:self.k]
        self.top_k_attributes=top_k_attributes
        return(self)
    def transform(self, X,**fit_params):
        return(X[:,self.top_k_attributes])
##数据预处理以及选取最重要的K个特征的pipeline
prepare_and_top_feature_pipeline=Pipeline([
    ('full_pipeline',full_pipeline),
    ('feature_selector',TopFeaturesSelector(feature_importance_k=5))
])

##用GridSearchCV计算最随机森林最优的参数
train_x_=full_pipeline.fit_transform(train_x)
# Tree model,select best parameter with GridSearchCV
param_grid={
    'n_estimators':[10,50],
    'max_depth':[8,10]
}
reg=RandomForestRegressor()
grid_search=GridSearchCV(reg,param_grid=param_grid,cv=5)
grid_search.fit(train_x_,train_y)

##构造最终的数据处理和预测pipeline
prepare_and_predict_pipeline=Pipeline([
    ('prepare',prepare_and_top_feature_pipeline),
    ('random_forest',RandomForestRegressor(**grid_search.best_params_))
])

####   对上述总的pipeline用GridSearchCV选取最好的pipeline参数
param_grid2={'prepare__feature_selector__feature_importance_k':[1,3,5,10],
             'prepare__full_pipeline__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent']}
grid_search2=GridSearchCV(prepare_and_predict_pipeline,param_grid=param_grid2,cv=2,
                          scoring='neg_mean_squared_error', verbose=2, n_jobs=4)
grid_search2.fit(train_x,train_y)
pred=grid_search2.predict(test_x)

end=time.clock()
print('RMSE on test set={}'.format(np.sqrt(mean_squared_error(test_y,pred))))
print('cost time={}'.format(end-start))
print('grid_search2.best_params_=\n',grid_search2.best_params_)

上述程序的的输出结果在准确率上并不高,因为为了提高速度,GridSearchCV都是用了最简单的参数,可以扩大参数范围,提高精度。

你可能感兴趣的:(#,《Hands,On,ML》笔记)