通过分析房屋价格理解机器学习流程 --Adam Studio

Machine Learning Workflow for House Prices

通过分析房屋价格理解机器学习流程 --Adam Studio_第1张图片

1- Introduction

This is a A Comprehensive ML Workflow for House Prices data set, it is clear that everyone in this community is familiar with house prices dataset but if you need to review your information about the dataset please visit this link.

I have tried to help Fans of Machine Learning in Kaggle how to face machine learning problems. and I think it is a great opportunity for who want to learn machine learning workflow with python completely.

I want to covere most of the methods that are implemented for house prices until 2018, you can start to learn and review your knowledge about ML with a simple dataset and try to learn and memorize the workflow for your journey in Data science world.

Before we get into the notebook, let me introduce some helpful resources.

这是房屋价格数据集的综合ML工作流程,很明显,该社区的每个人都熟悉房价数据集,但如果您需要查看有关数据集的信息,请访问此链接。

我试图帮助Kaggle的机器学习爱好者如何面对机器学习问题。 我认为这对于想要完全使用python学习机器学习工作流程的人来说是一个很好的机会。

我想强调大部分直到2018年实施房价的方法,你可以用一个简单的数据集开始学习和回顾你对ML的了解,并尝试学习和记住你在数据科学世界中旅程的工作流程。

在我们进入笔记本之前,让我介绍一些有用的资源。

2- Machine Learning Workflow

If you have already read some machine learning books. You have noticed that there are different ways to stream data into machine learning.

Most of these books share the following steps:

  • Define Problem
  • Specify Inputs & Outputs
  • Exploratory data analysis
  • Data Collection
  • Data Preprocessing
  • Data Cleaning
  • Visualization
  • Model Design, Training, and Offline Evaluation
  • Model Deployment, Online Evaluation, and Monitoring
  • Model Maintenance, Diagnosis, and Retraining

Of course, the same solution can not be provided for all problems, so the best way is to create a general framework and adapt it to new problem.

如果您已经阅读过一些机器学习书籍。 您已经注意到有不同的方法将数据流式传输到机器学习中。

这些书中的大多数共享以下步骤:

  • 定义问题
  • 指定输入和输出
  • 探索性数据分析
  • 数据采集
  • 数据预处理
  • 数据清理
  • 可视化
  • 模型设计,培训和离线评估
  • 模型部署,在线评估和监控
  • 模型维护,诊断和再培训

当然,不能为所有问题提供相同的解决方案,因此最好的方法是创建一个通用框架并使其适应新问题。

You can see my workflow in the below image :

通过分析房屋价格理解机器学习流程 --Adam Studio_第2张图片

Data Science has so many techniques and procedures that can confuse anyone.

数据科学有许多技术和程序可以让任何人感到困惑。

2-2 Real world Application Vs Competitions

We all know that there are differences between real world problem and competition problem. The following figure that is taken from one of the courses in coursera, has partly made this comparison
我们都知道现实世界问题和竞争问题之间存在差异。 下图取自课程中的一个课程,部分进行了这种比较

通过分析房屋价格理解机器学习流程 --Adam Studio_第3张图片

As you can see, there are a lot more steps to solve in real problems.

3- Problem Definition

I think one of the important things when you start a new machine learning project is defining your problem.that means you should understand business problem.( Problem Formalization).

Problem definition has four steps that have illustrated in the picture below:

我认为,当你开始一个新的机器学习项目时,重要的事情之一是定义你的问题。这意味着你应该理解业务问题。(问题形式化)。

问题定义有四个步骤,如下图所示:

通过分析房屋价格理解机器学习流程 --Adam Studio_第4张图片

3-1 Problem Feature

We will use the house prices data set. This dataset contains information about house prices and the target value is:

  • SalePrice
    Why am I using House price dataset:

  • This is a good project because it is so well understood.

  • Attributes are numeric and categurical so you have to figure out how to load and handle data.

  • It is a Regression problem, allowing you to practice with perhaps an easier type of supervised learning algorithm.

  • This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition.

  • Creative feature engineering .

我们将使用房价数据集。 此数据集包含有关房价的信息,目标值为:

  • 销售价格
    为什么我使用房价数据集:

  • 这是一个很好的项目,因为它很好理解。

  • 属性是数字和分类,因此您必须弄清楚如何加载和处理数据。

  • 这是一个回归问题,允许您练习一种更简单的监督学习算法。

  • 对于已完成机器学习在线课程并希望在参加特色竞赛之前扩展其技能的数据科学学生来说,这是一场完美的比赛。

  • 创意特色工程。

3-1-1 Metric

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

在预测值的对数与观察到的销售价格的对数之间的均方根误差(RMSE)上评估提交。 (记录日志意味着预测昂贵房屋和廉价房屋的错误将同样影响结果。)
通过分析房屋价格理解机器学习流程 --Adam Studio_第5张图片

3-2 Aim

It is our job to predict the sales price for each house. for each Id in the test set, you must predict the value of the SalePrice variable.

我们的工作是预测每栋房屋的销售价格。 对于测试集中的每个Id,您必须预测SalePrice变量的值。

3-3 Variables

The variables are :

  • SalePrice - the property’s sale price in dollars. This is the target variable that you’re trying to predict.
  • MSSubClass: The building class
  • MSZoning: The general zoning classification
  • LotFrontage: Linear feet of street connected to property
  • LotArea: Lot size in square feet
  • Street: Type of road access
  • Alley: Type of alley access
  • LotShape: General shape of property
  • LandContour: Flatness of the property
  • Utilities: Type of utilities available
  • LotConfig: Lot configuration
  • LandSlope: Slope of property
  • Neighborhood: Physical locations within Ames city limits
  • Condition1: Proximity to main road or railroad
  • Condition2: Proximity to main road or railroad (if a second is present)
  • BldgType: Type of dwelling
  • HouseStyle: Style of dwelling
  • OverallQual: Overall material and finish quality
  • OverallCond: Overall condition rating
  • YearBuilt: Original construction date
  • YearRemodAdd: Remodel date
  • RoofStyle: Type of roof
  • RoofMatl: Roof material
  • Exterior1st: Exterior covering on house
  • Exterior2nd: Exterior covering on house (if more than one material)
  • MasVnrType: Masonry veneer type
  • MasVnrArea: Masonry veneer area in square feet
  • ExterQual: Exterior material quality
  • ExterCond: Present condition of the material on the exterior
  • Foundation: Type of foundation
  • BsmtQual: Height of the basement
  • BsmtCond: General condition of the basement
  • BsmtExposure: Walkout or garden level basement walls
  • BsmtFinType1: Quality of basement finished area
  • BsmtFinSF1: Type 1 finished square feet
  • BsmtFinType2: Quality of second finished area (if present)
  • BsmtFinSF2: Type 2 finished square feet
  • BsmtUnfSF: Unfinished square feet of basement area
  • TotalBsmtSF: Total square feet of basement area
  • Heating: Type of heating
  • HeatingQC: Heating quality and condition
  • CentralAir: Central air conditioning
  • Electrical: Electrical system
  • 1stFlrSF: First Floor square feet
  • 2ndFlrSF: Second floor square feet
  • LowQualFinSF: Low quality finished square feet (all floors)
  • GrLivArea: Above grade (ground) living area square feet
  • BsmtFullBath: Basement full bathrooms
  • BsmtHalfBath: Basement half bathrooms
  • FullBath: Full bathrooms above grade
  • HalfBath: Half baths above grade
  • Bedroom: Number of bedrooms above basement level
  • Kitchen: Number of kitchens
  • KitchenQual: Kitchen quality
  • TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
  • Functional: Home functionality rating
  • Fireplaces: Number of fireplaces
  • FireplaceQu: Fireplace quality
  • GarageType: Garage location
  • GarageYrBlt: Year garage was built
  • GarageFinish: Interior finish of the garage
  • GarageCars: Size of garage in car capacity
  • GarageArea: Size of garage in square feet
  • GarageQual: Garage quality
  • GarageCond: Garage condition
  • PavedDrive: Paved driveway
  • WoodDeckSF: Wood deck area in square feet
  • OpenPorchSF: Open porch area in square feet
  • EnclosedPorch: Enclosed porch area in square feet
  • 3SsnPorch: Three season porch area in square feet
  • ScreenPorch: Screen porch area in square feet
  • PoolArea: Pool area in square feet
  • PoolQC: Pool quality
  • Fence: Fence quality
  • MiscFeature: Miscellaneous feature not covered in other categories
  • MiscVal: Value of miscellaneous feature
  • MoSold: Month Sold
  • YrSold: Year Sold
  • SaleType: Type of sale
  • SaleCondition: Condition of sale

变量是:

  • SalePrice - 该物业的销售价格以美元计算。这是您尝试预测的目标变量。
  • MSSubClass:建筑类
  • MSZoning:一般分区分类
  • LotFrontage:与物业相连的街道的线性脚
  • LotArea:地块尺寸,平方英尺
  • 街道:道路通行类型
  • 胡同:胡同通道的类型
  • LotShape:一般的财产形状
  • LandContour:酒店的平整度
  • 实用程序:可用的实用程序类型
  • LotConfig:批量配置
  • LandSlope:物业坡度
  • 邻里:Ames市区内的物理位置
  • 条件1:靠近主要道路或铁路
  • 条件2:靠近主要道路或铁路(如果存在第二个)
  • BldgType:住宅类型
  • HouseStyle:住宅风格
  • OverallQual:整体材料和成品质量
  • OverallCond:总体状况评级
  • YearBuilt:原始施工日期
  • YearRemodAdd:改造日期
  • RoofStyle:屋顶类型
  • RoofMatl:屋顶材料
  • Exterior1st:房屋外墙
  • Exterior2nd:房屋外墙(如果有多种材料)
  • MasVnrType:砌体贴面类型
  • MasVnrArea:平方英尺的砌体饰面区域
  • ExterQual:外部材料质量
  • ExterCond:外部材料的现状
  • 基础:基础类型
  • BsmtQual:地下室的高度
  • BsmtCond:地下室的一般状况
  • BsmtExposure:罢工或花园层地下室墙壁
  • BsmtFinType1:地下室成品区的质量
  • BsmtFinSF1:类型1完成平方英尺
  • BsmtFinType2:第二个完成区域的质量(如果存在)
  • BsmtFinSF2:2型成品平方英尺
  • BsmtUnfSF:未完工的地下室平方英尺
  • TotalBsmtSF:地下室总面积平方英尺
  • 加热:加热类型
  • HeatingQC:加热质量和条件
  • CentralAir:中央空调
  • 电气:电气系统
  • 1stFlrSF:一楼平方英尺
  • 2ndFlrSF:二楼平方英尺
  • LowQualFinSF:低质量的平方英尺(所有楼层)
  • GrLivArea:以上(地面)生活区平方英尺
  • BsmtFullBath:地下室齐全的浴室
  • BsmtHalfBath:地下室半浴室
  • FullBath:高档以上的完整浴室
  • HalfBath:高于等级的半浴
  • 卧室:地下室以上的卧室数量
  • 厨房:厨房数量
  • KitchenQual:厨房质量
  • TotRmsAbvGrd:以上客房总数(不包括浴室)
  • 功能:家庭功能评级
  • 壁炉:壁炉数量
  • FireplaceQu:壁炉质量
  • 车库类型:车库位置
  • GarageYrBlt:建造了年车库
  • GarageFinish:车库的内部装饰
  • GarageCars:车库容量的车库大小
  • GarageArea:车库的面积,平方英尺
  • GarageQual:车库质量
  • GarageCond:车库状况
  • PavedDrive:铺好的车道
  • WoodDeckSF:平方英尺的木甲板面积
  • OpenPorchSF:平方英尺的开放式门廊区域
  • EnclosedPorch:封闭的门廊面积,平方英尺
  • 3SsnPorch:三个季节的门廊面积,平方英尺
  • ScreenPorch:屏幕门廊面积,平方英尺
  • PoolArea:泳池面积,平方英尺
  • PoolQC:游泳池质量
  • 围栏:围栏质量
  • MiscFeature:其他类别未涵盖的其他功能
  • MiscVal:杂项功能的价值
  • MoSold:已售出月份
  • YrSold:已售出年份
  • SaleType:销售类型
  • SaleCondition:销售条件

4- Inputs & Outputs

For every machine learning problem, you should ask yourself, what are inputs and outputs for the model?

对于每个机器学习问题,您应该问自己,模型的输入和输出是什么?
通过分析房屋价格理解机器学习流程 --Adam Studio_第6张图片

4-1 Inputs

  • train.csv - the training set
  • test.csv - the test set

4-2 Outputs

  • sale prices for every record in test.csv

5 Loading Packages

In this kernel we are using the following packages:
通过分析房屋价格理解机器学习流程 --Adam Studio_第7张图片

5-1 Import

from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import confusion_matrix
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from scipy.stats import skew
import scipy.stats as stats
import lightgbm as lgb
import seaborn as sns
import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib
import warnings
import sklearn
import scipy
import json
import sys
import csv
import os

5-2 Version

print('matplotlib: {}'.format(matplotlib.__version__))
print('sklearn: {}'.format(sklearn.__version__))
print('scipy: {}'.format(scipy.__version__))
print('seaborn: {}'.format(sns.__version__))
print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('Python: {}'.format(sys.version))

通过分析房屋价格理解机器学习流程 --Adam Studio_第8张图片

5-5-3 Setup

A few tiny adjustments for better code readability

pd.set_option('display.float_format', lambda x: '%.3f' % x)
sns.set(style='white', context='notebook', palette='deep')
warnings.filterwarnings('ignore')
sns.set_style('white')
%matplotlib inline

6- Exploratory Data Analysis(EDA)

In this section, you’ll learn how to use graphical and numerical techniques to begin uncovering the structure of your data.

  • Which variables suggest interesting relationships?
  • Which observations are unusual?

By the end of the section, you’ll be able to answer these questions and more, while generating graphics that are both insightful and beautiful. then We will review analytical and statistical operations:

  • Data Collection

  • Visualization

  • Data Cleaning

  • Data Preprocessing
    在本节中,您将学习如何使用图形和数字技术来开始发现数据结构。

  • 哪些变量表明有趣的关系?

  • 哪些观察结果不寻常?

在本节结束时,您将能够回答这些问题以及更多问题,同时生成既富有洞察力又美观的图形。 然后我们将审查分析和统计操作:

  • 数据采集
  • 可视化
  • 数据清理
  • 数据预处理
    通过分析房屋价格理解机器学习流程 --Adam Studio_第9张图片

6-1 Data Collection

Data collection is the process of gathering and measuring data, information or any variables of interest in a standardized and established manner that enables the collector to answer or test hypothesis and evaluate outcomes of the particular collection.
数据收集是以标准化和既定方式收集和测量数据,信息或任何感兴趣变量的过程,使收集者能够回答或检验假设并评估特定收集的结果。[techopedia]

# import Dataset to play with it
train = pd.read_csv('../input/train.csv')
test= pd.read_csv('../input/test.csv')

The concat function does all of the heavy lifting of performing concatenation operations along an axis. Let us create all_data.

concat函数完成了沿轴执行连接操作的所有繁重工作。 让我们创建all_data。

all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
                      test.loc[:,'MSSubClass':'SaleCondition']))
  • Each row is an observation (also known as : sample, example, instance, record)
  • Each column is a feature (also known as: Predictor, attribute, Independent Variable, input, regressor, Covariate)

After loading the data via pandas, we should checkout what the content is, description and via the following:

  • 每行都是观察(也称为:样本,示例,实例,记录)
  • 每列都是一个特征(也称为:预测变量,属性,独立变量,输入,回归量,协变量)

通过pandas加载数据后,我们应该检查内容是什么,描述以及通过以下内容:

type(train),type(test)

在这里插入图片描述

6-1-1 Statistical Summary

  • 1- Dimensions of the dataset.

  • 2- Peek at the data itself.

  • 3- Statistical summary of all attributes.

  • 4- Breakdown of the data by the class variable.[7]

Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.

  • 1-数据集的维度。

  • 2-查看数据本身。

  • 3-所有属性的统计摘要。

  • 4类变量对数据的细分。[7]

别担心,每次查看数据都是一个命令。 这些是有用的命令,您可以在将来的项目中反复使用这些命令。

# shape
print(train.shape)

在这里插入图片描述
Train has one column more than test why? (yes ==>> target value)

# shape
print(test.shape)

(1459, 80)

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

You should see 1460 instances and 81 attributes for train and 1459 instances and 80 attributes for test

For getting some information about the dataset you can use info() command

我们可以快速了解数据包含多少个实例(行)和多少属性(列)以及shape属性。

您应该看到列车的1460个实例和81个属性以及1459个实例和80个测试属性

要获取有关数据集的一些信息,可以使用info()命令

print(train.info())


RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id 1460 non-null int64
MSSubClass 1460 non-null int64
MSZoning 1460 non-null object
LotFrontage 1201 non-null float64
LotArea 1460 non-null int64
Street 1460 non-null object
Alley 91 non-null object
LotShape 1460 non-null object
LandContour 1460 non-null object
Utilities 1460 non-null object
LotConfig 1460 non-null object
LandSlope 1460 non-null object
Neighborhood 1460 non-null object
Condition1 1460 non-null object
Condition2 1460 non-null object
BldgType 1460 non-null object
HouseStyle 1460 non-null object
OverallQual 1460 non-null int64
OverallCond 1460 non-null int64
YearBuilt 1460 non-null int64
YearRemodAdd 1460 non-null int64
RoofStyle 1460 non-null object
RoofMatl 1460 non-null object
Exterior1st 1460 non-null object
Exterior2nd 1460 non-null object
MasVnrType 1452 non-null object
MasVnrArea 1452 non-null float64
ExterQual 1460 non-null object
ExterCond 1460 non-null object
Foundation 1460 non-null object
BsmtQual 1423 non-null object
BsmtCond 1423 non-null object
BsmtExposure 1422 non-null object
BsmtFinType1 1423 non-null object
BsmtFinSF1 1460 non-null int64
BsmtFinType2 1422 non-null object
BsmtFinSF2 1460 non-null int64
BsmtUnfSF 1460 non-null int64
TotalBsmtSF 1460 non-null int64
Heating 1460 non-null object
HeatingQC 1460 non-null object
CentralAir 1460 non-null object
Electrical 1459 non-null object
1stFlrSF 1460 non-null int64
2ndFlrSF 1460 non-null int64
LowQualFinSF 1460 non-null int64
GrLivArea 1460 non-null int64
BsmtFullBath 1460 non-null int64
BsmtHalfBath 1460 non-null int64
FullBath 1460 non-null int64
HalfBath 1460 non-null int64
BedroomAbvGr 1460 non-null int64
KitchenAbvGr 1460 non-null int64
KitchenQual 1460 non-null object
TotRmsAbvGrd 1460 non-null int64
Functional 1460 non-null object
Fireplaces 1460 non-null int64
FireplaceQu 770 non-null object
GarageType 1379 non-null object
GarageYrBlt 1379 non-null float64
GarageFinish 1379 non-null object
GarageCars 1460 non-null int64
GarageArea 1460 non-null int64
GarageQual 1379 non-null object
GarageCond 1379 non-null object
PavedDrive 1460 non-null object
WoodDeckSF 1460 non-null int64
OpenPorchSF 1460 non-null int64
EnclosedPorch 1460 non-null int64
3SsnPorch 1460 non-null int64
ScreenPorch 1460 non-null int64
PoolArea 1460 non-null int64
PoolQC 7 non-null object
Fence 281 non-null object
MiscFeature 54 non-null object
MiscVal 1460 non-null int64
MoSold 1460 non-null int64
YrSold 1460 non-null int64
SaleType 1460 non-null object
SaleCondition 1460 non-null object
SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
None

if you want see the type of data and unique value of it you use following script

train['Fence'].unique()

array([nan, ‘MnPrv’, ‘GdWo’, ‘GdPrv’, ‘MnWw’], dtype=object)

train["Fence"].value_counts()

MnPrv 157
GdPrv 59
GdWo 54
MnWw 11
Name: Fence, dtype: int64

Copy Id for test and train data set

train_id=train['Id'].copy()
test_id=test['Id'].copy()

to check the first 5 rows of the data set, we can use head(5).

train.head(5)

通过分析房屋价格理解机器学习流程 --Adam Studio_第10张图片

1to check out last 5 row of the data set, we use tail() function

train.tail() 

通过分析房屋价格理解机器学习流程 --Adam Studio_第11张图片
to pop up 5 random rows from the data set, we can use sample(5) function

train.sample(5) 

通过分析房屋价格理解机器学习流程 --Adam Studio_第12张图片
To give a statistical summary about the dataset, we can use **describe()

train.describe() 

通过分析房屋价格理解机器学习流程 --Adam Studio_第13张图片
To check out how many null info are on the dataset, we can use **isnull().sum()

train.isnull().sum().head(2)

Id 0
MSSubClass 0
dtype: int64

train.groupby('SaleType').count()

通过分析房屋价格理解机器学习流程 --Adam Studio_第14张图片
to print dataset columns, we can use columns atribute

train.columns

通过分析房屋价格理解机器学习流程 --Adam Studio_第15张图片

type((train.columns))

pandas.core.indexes.base.Index

<< Note 2 >> in pandas’s data frame you can perform some query such as "where"

train[train['SalePrice']>700000]

在这里插入图片描述

6-1-2 Target Value Analysis

As you know SalePrice is our target value that we should predict it then now we take a look a

train['SalePrice'].describe()

通过分析房屋价格理解机器学习流程 --Adam Studio_第16张图片
Flexibly plot a univariate distribution of observations.

sns.set(rc={'figure.figsize':(9,7)})
sns.distplot(train['SalePrice']);

通过分析房屋价格理解机器学习流程 --Adam Studio_第17张图片

6-1-3 Skewness vs Kurtosis

Skewness

It is the degree of distortion from the symmetrical bell curve or the normal distribution. It measures the lack of symmetry in data distribution. It differentiates extreme values in one versus the other tail. A symmetrical distribution will have a skewness of 0.

Kurtosis

Kurtosis is all about the tails of the distribution — not the peakedness or flatness. It is used to describe the extreme values in one versus the other tail. It is actually the measure of outliers present in the distribution.

#skewness and kurtosis
print("Skewness: %f" % train['SalePrice'].skew())
print("Kurtosis: %f" % train['SalePrice'].kurt())

在这里插入图片描述

6-2 Visualization

Data visualization is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns.

With interactive visualization, you can take the concept a step further by using technology to drill down into charts and graphs for more detail, interactively changing what data you see and how it’s processed.[SAS]

In this section I show you 11 plots with matplotlib and seaborn that is listed in the blew picture:

通过分析房屋价格理解机器学习流程 --Adam Studio_第18张图片

6-2-1 Scatter plot

Scatter plot Purpose To identify the type of relationship (if any) between two quantitative variables

# Modify the graph above by assigning each species an individual color.
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
g=sns.FacetGrid(train[columns], hue="OverallQual", size=5) \
   .map(plt.scatter, "OverallQual", "SalePrice") \
   .add_legend()
g=g.map(plt.scatter, "OverallQual", "SalePrice",edgecolor="w").add_legend();
plt.show()

通过分析房屋价格理解机器学习流程 --Adam Studio_第19张图片

6-2-2 Box

In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram.[wikipedia]

data = pd.concat([train['SalePrice'], train['OverallQual']], axis=1)
f, ax = plt.subplots(figsize=(12, 8))
fig = sns.boxplot(x='OverallQual', y="SalePrice", data=data)

通过分析房屋价格理解机器学习流程 --Adam Studio_第20张图片

ax= sns.boxplot(x="OverallQual", y="SalePrice", data=train[columns])
ax= sns.stripplot(x="OverallQual", y="SalePrice", data=train[columns], jitter=True, edgecolor="gray")
plt.show()

通过分析房屋价格理解机器学习流程 --Adam Studio_第21张图片

6-2-3 Histogram

We can also create a histogram of each input variable to get an idea of the distribution.

# histograms
train.hist(figsize=(15,20))
plt.figure()

通过分析房屋价格理解机器学习流程 --Adam Studio_第22张图片

mini_train=train[columns]
f,ax=plt.subplots(1,2,figsize=(20,10))
mini_train[mini_train['SalePrice']>100000].GarageArea.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='red')
ax[0].set_title('SalePrice>100000')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)
mini_train[mini_train['SalePrice']<100000].GarageArea.plot.hist(ax=ax[1],color='green',bins=20,edgecolor='black')
ax[1].set_title('SalePrice<100000')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
plt.show()

通过分析房屋价格理解机器学习流程 --Adam Studio_第23张图片

mini_train[['SalePrice','OverallQual']].groupby(['OverallQual']).mean().plot.bar()

通过分析房屋价格理解机器学习流程 --Adam Studio_第24张图片

train['OverallQual'].value_counts().plot(kind="bar");

通过分析房屋价格理解机器学习流程 --Adam Studio_第25张图片

It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.

6-2-4 Multivariate Plots

Now we can look at the interactions between the variables.

First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.

# scatter plot matrix
pd.plotting.scatter_matrix(train[columns],figsize=(10,10))
plt.figure()

通过分析房屋价格理解机器学习流程 --Adam Studio_第26张图片

Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship.

6-2-5 violinplots

# violinplots on petal-length for each species
sns.violinplot(data=train,x="Functional", y="SalePrice")

通过分析房屋价格理解机器学习流程 --Adam Studio_第27张图片

6-2-6 pairplot

# Using seaborn pairplot to see the bivariate relation between each pair of features
sns.set()
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.pairplot(train[columns],size = 2 ,kind ='scatter')
plt.show()

通过分析房屋价格理解机器学习流程 --Adam Studio_第28张图片

6-2-7 kdeplot

# seaborn's kdeplot, plots univariate or bivariate density estimates.
#Size can be changed by tweeking the value used
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.FacetGrid(train[columns], hue="OverallQual", size=5).map(sns.kdeplot, "YearBuilt").add_legend()
plt.show()

通过分析房屋价格理解机器学习流程 --Adam Studio_第29张图片

6-2-8 jointplot

# Use seaborn's jointplot to make a hexagonal bin plot
#Set desired size and ratio and choose a color.
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.jointplot(x="OverallQual", y="SalePrice", data=train[columns], size=10,ratio=10, kind='hex',color='green')
plt.show()

通过分析房屋价格理解机器学习流程 --Adam Studio_第30张图片

# we will use seaborn jointplot shows bivariate scatterplots and univariate histograms with Kernel density 
# estimation in the same figure
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.jointplot(x="SalePrice", y="YearBuilt", data=train[columns], size=6, kind='kde', color='#800000', space=0)

通过分析房屋价格理解机器学习流程 --Adam Studio_第31张图片

6-2-9 Heatmap

plt.figure(figsize=(7,4)) 
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.heatmap(train[columns].corr(),annot=True,cmap='cubehelix_r') #draws  heatmap with input as the correlation matrix calculted by(iris.corr())
plt.show()

通过分析房屋价格理解机器学习流程 --Adam Studio_第32张图片

6-2-10 radviz

from pandas.tools.plotting import radviz
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
radviz(train[columns], "OverallQual")

通过分析房屋价格理解机器学习流程 --Adam Studio_第33张图片

6-2-12 Factorplot

sns.factorplot('OverallQual','SalePrice',hue='Functional',data=train)
plt.show()

通过分析房屋价格理解机器学习流程 --Adam Studio_第34张图片

6-3 Data Preprocessing

Data preprocessing refers to the transformations applied to our data before feeding it to the algorithm.

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis. there are plenty of steps for data preprocessing and we just listed some of them :

  • removing Target column (id)
  • Sampling (without replacement)
  • Making part of iris unbalanced and balancing (with undersampling and SMOTE)
  • Introducing missing values and treating them (replacing by average values)
  • Noise filtering
  • Data discretization
  • Normalization and standardization
  • PCA analysis
  • Feature selection (filter, embedded, wrapper)
    数据预处理是指在将数据提供给算法之前应用于我们数据的转换。

数据预处理是一种用于将原始数据转换为干净数据集的技术。 换句话说,每当从不同来源收集数据时,就以原始格式收集数据,这对于分析是不可行的。 有很多步骤可以进行数据预处理,我们只列出了一些步骤:

  • 删除目标列(id)
  • 取样(无需更换)
  • 使虹膜的一部分不平衡和平衡(使用欠采样和SMOTE)
  • 引入缺失值并对其进行处理(替换为平均值)
  • 噪音过滤
  • 数据离散化
  • 规范化和标准化
  • PCA分析
  • 功能选择(过滤器,嵌入式,包装器)

6-3-1 Noise filtering (Outliers)

An outlier is a data point that is distant from other similar points. Further simplifying an outlier is an observation that lies on abnormal observation amongst the normal observations in a sample set of population.

异常值是远离其他类似点的数据点。 进一步简化异常值是一种观察,其在于样本集合中的正常观察中的异常观察。

通过分析房屋价格理解机器学习流程 --Adam Studio_第35张图片
In statistics, an outlier is an observation point that is distant from other observations.

# Looking for outliers, as indicated in https://ww2.amstat.org/publications/jse/v19n3/decock.pdf
plt.scatter(train.GrLivArea, train.SalePrice, c = "blue", marker = "s")
plt.title("Looking for outliers")
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")
plt.show()

train = train[train.GrLivArea < 4000]

通过分析房屋价格理解机器学习流程 --Adam Studio_第36张图片

2 extreme outliers on the bottom right

#deleting points
train.sort_values(by = 'GrLivArea', ascending = False)[:2]
train = train.drop(train[train['Id'] == 1299].index)
train = train.drop(train[train['Id'] == 524].index)
#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) #compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index

all_data[skewed_feats] = np.log1p(all_data[skewed_feats])
# Log transform the target for official scoring
#The key point is to to log_transform the numeric variables since most of them are skewed.
train.SalePrice = np.log1p(train.SalePrice)
y = train.SalePrice

Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.

plt.scatter(train.GrLivArea, train.SalePrice, c = "blue", marker = "s")
plt.title("Looking for outliers")
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")
plt.show()

通过分析房屋价格理解机器学习流程 --Adam Studio_第37张图片

6-4 Data Cleaning

When dealing with real-world data, dirty data is the norm rather than the exception. We continuously need to predict correct values, impute missing ones, and find links between various data artefacts such as schemas and records. We need to stop treating data cleaning as a piecemeal exercise (resolving different types of errors in isolation), and instead leverage all signals and resources (such as constraints, available statistics, and dictionaries) to accurately predict corrective actions.

在处理实际数据时,脏数据是常态而不是异常。 我们不断需要预测正确的值,估算缺失的值,并找到各种数据伪像(如模式和记录)之间的链接。 我们需要停止将数据清理视为零碎的练习(孤立地解决不同类型的错误),而是利用所有信号和资源(例如约束,可用统计和词典)来准确预测纠正措施。

6-4-1 Handle missing values

Firstly, understand that there is NO good way to deal with missing data

首先,要了解没有好的方法来处理缺失的数据

通过分析房屋价格理解机器学习流程 --Adam Studio_第38张图片

#filling NA's with the mean of the column:
all_data = all_data.fillna(all_data.mean())

7- Model Deployment

In this section have been applied plenty of learning algorithms that play an important rule in your experiences and improve your knowledge in case of ML technique.

<< Note 3 >> : The results shown here may be slightly different for your analysis because, for example, the neural network algorithms use random number generators for fixing the initial value of the weights (starting points) of the neural networks, which often result in obtaining slightly different (local minima) solutions each time you run the analysis. Also note that changing the seed for the random number generator used to create the train, test, and validation samples can change your results.

在本节中已经应用了大量的学习算法,这些算法在您的经历中起着重要作用,并在ML技术的情况下提高您的知识。

<<注3 >>:此处显示的结果可能与您的分析略有不同,因为,例如,神经网络算法使用随机数生成器来固定神经网络的权重(起始点)的初始值,这通常是 每次运行分析时,都会获得稍微不同的(局部最小值)解决方案。 另请注意,更改用于创建训练,测试和验证样本的随机数生成器的种子可能会更改结果。

7-1 Families of ML algorithms

There are several categories for machine learning algorithms, below are some of these categories:

  • Linear
    • Linear Regression
    • Logistic Regression
    • Support Vector Machines
  • Tree-Based
    • Decision Tree
    • Random Forest
    • GBDT
  • KNN
  • Neural Networks

And if we want to categorize ML algorithms with the type of learning, there are below type:

  • Classification

    • k-Nearest Neighbors
    • LinearRegression
    • SVM
    • DT
    • NN
  • clustering

    • K-means
    • HCA
    • Expectation Maximization
  • Visualization and dimensionality reduction:

    • Principal Component Analysis(PCA)
    • Kernel PCA
    • Locally -Linear Embedding (LLE)
    • t-distributed Stochastic Neighbor Embedding (t-SNE)
  • Association rule learning

    • Apriori
    • Eclat
  • Semisupervised learning

  • Reinforcement Learning

    • Q-learning
  • Batch learning & Online learning

  • Ensemble Learning
    << Note >>

Here is no method which outperforms all others for all tasks

7-2 Accuracy and precision

One of the most important questions to ask as a machine learning engineer when evaluating our model is how to judge our own model? each machine learning model is trying to solve a problem with a different objective using a different dataset and hence, it is important to understand the context before choosing a metric.

在评估我们的模型时,作为机器学习工程师要求的最重要问题之一是如何判断我们自己的模型? 每个机器学习模型都试图使用不同的数据集来解决具有不同目标的问题,因此,在选择度量之前理解上下文非常重要。

通过分析房屋价格理解机器学习流程 --Adam Studio_第39张图片

7-2-1 RMSE

Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

预测值的对数与观察到的销售价格的对数之间的均方根误差(RMSE)。 (记录日志意味着预测昂贵房屋和廉价房屋的错误将同样影响结果。)
通过分析房屋价格理解机器学习流程 --Adam Studio_第40张图片

#creating matrices for sklearn:
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice
X_train.info()

通过分析房屋价格理解机器学习流程 --Adam Studio_第41张图片

7-3 Ridge

def rmse_cv(model):
    rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv = 5))
    return(rmse)
model_ridge = Ridge()
alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]
cv_ridge = [rmse_cv(Ridge(alpha = alpha)).mean() for alpha in alphas]

7-3-1 Root Mean Squared Error

cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge.plot(title = "Validation")
plt.xlabel("alpha")
plt.ylabel("rmse")

在这里插入图片描述

通过分析房屋价格理解机器学习流程 --Adam Studio_第42张图片

# steps
steps = [('scaler', StandardScaler()),
         ('ridge', Ridge())]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'ridge__alpha':np.logspace(-4, 0, 50)}

# Create the GridSearchCV object: cv
cv = GridSearchCV(pipeline, parameters, cv=3)

# Fit to the training set
cv.fit(X_train, y)

#predict on train set
y_pred_train=cv.predict(X_train)

# Predict test set
y_pred_test=cv.predict(X_test)

# rmse on train set
rmse = np.sqrt(mean_squared_error(y, y_pred_train))
print("Root Mean Squared Error: {}".format(rmse))

在这里插入图片描述

7-4 RandomForestClassifier

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

随机森林是一种元估计器,它适用于数据集的各个子样本上的多个决策树分类器,并使用平均来提高预测精度和控制过拟合。 子样本大小始终与原始输入样本大小相同,但如果bootstrap = True(默认),则使用替换绘制样本。

num_test = 0.3
X_train, X_test, y_train, y_test = train_test_split(X_train, y, test_size=num_test, random_state=100)
# Fit Random Forest on Training Set
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=300, random_state=0)
regressor.fit(X_train, y_train)

# Score model
regressor.score(X_train, y_train)

在这里插入图片描述

7-5 XGBoost

XGBoost is one of the most popular machine learning algorithm these days. Regardless of the type of prediction task at hand; regression or classification.

XGBoost是目前最流行的机器学习算法之一。 无论手头的预测任务类型如何; 回归或分类。

7-5-1 But what makes XGBoost so popular?

  • Speed and performance : Originally written in C++, it is comparatively faster than other ensemble classifiers.

  • Core algorithm is parallelizable : Because the core XGBoost algorithm is parallelizable it can harness the power of multi-core computers. It is also parallelizable onto GPU’s and across networks of computers making it feasible to train on very large datasets as well.

  • Consistently outperforms other algorithm methods : It has shown better performance on a variety of machine learning benchmark datasets.

  • Wide variety of tuning parameters : XGBoost internally has parameters for cross-validation, regularization, user-defined objective functions, missing values, tree parameters, scikit-learn compatible API etc.[10]

XGBoost (Extreme Gradient Boosting) belongs to a family of boosting algorithms and uses the gradient boosting (GBM) framework at its core. It is an optimized distributed gradient boosting library. But wait, what is boosting? Well, keep on reading.

  • 速度和性能:最初用C ++编写,比其他整体分类器要快。

  • 核心算法是可并行化的:因为核心XGBoost算法是可并行化的,所以它可以利用多核计算机的强大功能。 它还可以并行化到GPU和计算机网络上,因此也可以在非常大的数据集上进行训练。

  • 始终如一地优于其他算法方法:它在各种机器学习基准数据集上表现出更好的性能。

  • 各种调整参数:XGBoost内部具有交叉验证,正则化,用户定义的目标函数,缺失值,树参数,scikit-learn兼容API等参数。[10]

XGBoost(Extreme Gradient Boosting)属于一系列增强算法,并在其核心使用梯度增强(GBM)框架。 它是一个优化的分布式梯度增强库。 但等等,是什么促进? 好吧,继续阅读。

# Initialize model
from xgboost.sklearn import XGBRegressor
XGB_Regressor = XGBRegressor()                  

# Fit the model on our data
XGB_Regressor.fit(X_train, y_train)

通过分析房屋价格理解机器学习流程 --Adam Studio_第43张图片

# Score model
XGB_Regressor.score(X_train, y_train)

在这里插入图片描述

7-6 LassoCV

Lasso linear model with iterative fitting along a regularization path. The best model is selected by cross-validation.

lasso=LassoCV()
# Fit the model on our data
lasso.fit(X_train, y_train)

在这里插入图片描述

# Score model
lasso.score(X_train, y_train)

在这里插入图片描述

7-7 GradientBoostingRegressor

GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function.

boostingregressor=GradientBoostingRegressor()
# Fit the model on our data
boostingregressor.fit(X_train, y_train)

通过分析房屋价格理解机器学习流程 --Adam Studio_第44张图片

# Score model
boostingregressor.score(X_train, y_train)

在这里插入图片描述

7-8 DecisionTree

from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
dt = DecisionTreeRegressor(random_state=1)
# Fit model
dt.fit(X_train, y_train)

通过分析房屋价格理解机器学习流程 --Adam Studio_第45张图片

dt.score(X_train, y_train)

在这里插入图片描述

7-9 ExtraTreeRegressor

from sklearn.tree import ExtraTreeRegressor

dtr = ExtraTreeRegressor()
# Fit model
dtr.fit(X_train, y_train)

通过分析房屋价格理解机器学习流程 --Adam Studio_第46张图片

# Fit model
dtr.score(X_train, y_train)

在这里插入图片描述

8- Conclusion

This kernel is not completed yet, I will try to cover all the parts related to the process of ML with a variety of Python packages and I know that there are still some problems then I hope to get your feedback to improve it.

9- References

Https://skymind.ai/wiki/machine-learning-workflow
Problem-define
Sklearn
Machine-learning-in-python-step-by-step
Data Cleaning
Kaggle kernel
Choosing-the-right-metric-for-machine-learning-models-part
Unboxing outliers in machine learning
How to handle missing data
Datacamp

你可能感兴趣的:(AI程序员,算法,机器学习,数据科学,Python,程序员,特征工程)