This is a A Comprehensive ML Workflow for House Prices data set, it is clear that everyone in this community is familiar with house prices dataset but if you need to review your information about the dataset please visit this link.
I have tried to help Fans of Machine Learning in Kaggle how to face machine learning problems. and I think it is a great opportunity for who want to learn machine learning workflow with python completely.
I want to covere most of the methods that are implemented for house prices until 2018, you can start to learn and review your knowledge about ML with a simple dataset and try to learn and memorize the workflow for your journey in Data science world.
Before we get into the notebook, let me introduce some helpful resources.
这是房屋价格数据集的综合ML工作流程,很明显,该社区的每个人都熟悉房价数据集,但如果您需要查看有关数据集的信息,请访问此链接。
我试图帮助Kaggle的机器学习爱好者如何面对机器学习问题。 我认为这对于想要完全使用python学习机器学习工作流程的人来说是一个很好的机会。
我想强调大部分直到2018年实施房价的方法,你可以用一个简单的数据集开始学习和回顾你对ML的了解,并尝试学习和记住你在数据科学世界中旅程的工作流程。
在我们进入笔记本之前,让我介绍一些有用的资源。
If you have already read some machine learning books. You have noticed that there are different ways to stream data into machine learning.
Most of these books share the following steps:
Of course, the same solution can not be provided for all problems, so the best way is to create a general framework and adapt it to new problem.
如果您已经阅读过一些机器学习书籍。 您已经注意到有不同的方法将数据流式传输到机器学习中。
这些书中的大多数共享以下步骤:
当然,不能为所有问题提供相同的解决方案,因此最好的方法是创建一个通用框架并使其适应新问题。
You can see my workflow in the below image :
Data Science has so many techniques and procedures that can confuse anyone.
数据科学有许多技术和程序可以让任何人感到困惑。
We all know that there are differences between real world problem and competition problem. The following figure that is taken from one of the courses in coursera, has partly made this comparison
我们都知道现实世界问题和竞争问题之间存在差异。 下图取自课程中的一个课程,部分进行了这种比较
As you can see, there are a lot more steps to solve in real problems.
I think one of the important things when you start a new machine learning project is defining your problem.that means you should understand business problem.( Problem Formalization).
Problem definition has four steps that have illustrated in the picture below:
我认为,当你开始一个新的机器学习项目时,重要的事情之一是定义你的问题。这意味着你应该理解业务问题。(问题形式化)。
问题定义有四个步骤,如下图所示:
We will use the house prices data set. This dataset contains information about house prices and the target value is:
SalePrice
Why am I using House price dataset:
This is a good project because it is so well understood.
Attributes are numeric and categurical so you have to figure out how to load and handle data.
It is a Regression problem, allowing you to practice with perhaps an easier type of supervised learning algorithm.
This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition.
Creative feature engineering .
我们将使用房价数据集。 此数据集包含有关房价的信息,目标值为:
销售价格
为什么我使用房价数据集:
这是一个很好的项目,因为它很好理解。
属性是数字和分类,因此您必须弄清楚如何加载和处理数据。
这是一个回归问题,允许您练习一种更简单的监督学习算法。
对于已完成机器学习在线课程并希望在参加特色竞赛之前扩展其技能的数据科学学生来说,这是一场完美的比赛。
创意特色工程。
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)
在预测值的对数与观察到的销售价格的对数之间的均方根误差(RMSE)上评估提交。 (记录日志意味着预测昂贵房屋和廉价房屋的错误将同样影响结果。)
It is our job to predict the sales price for each house. for each Id in the test set, you must predict the value of the SalePrice variable.
我们的工作是预测每栋房屋的销售价格。 对于测试集中的每个Id,您必须预测SalePrice变量的值。
The variables are :
变量是:
For every machine learning problem, you should ask yourself, what are inputs and outputs for the model?
对于每个机器学习问题,您应该问自己,模型的输入和输出是什么?
In this kernel we are using the following packages:
from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.linear_model import ElasticNet, Lasso, BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import confusion_matrix
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from scipy.stats import skew
import scipy.stats as stats
import lightgbm as lgb
import seaborn as sns
import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib
import warnings
import sklearn
import scipy
import json
import sys
import csv
import os
print('matplotlib: {}'.format(matplotlib.__version__))
print('sklearn: {}'.format(sklearn.__version__))
print('scipy: {}'.format(scipy.__version__))
print('seaborn: {}'.format(sns.__version__))
print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('Python: {}'.format(sys.version))
A few tiny adjustments for better code readability
pd.set_option('display.float_format', lambda x: '%.3f' % x)
sns.set(style='white', context='notebook', palette='deep')
warnings.filterwarnings('ignore')
sns.set_style('white')
%matplotlib inline
In this section, you’ll learn how to use graphical and numerical techniques to begin uncovering the structure of your data.
By the end of the section, you’ll be able to answer these questions and more, while generating graphics that are both insightful and beautiful. then We will review analytical and statistical operations:
Data Collection
Visualization
Data Cleaning
Data Preprocessing
在本节中,您将学习如何使用图形和数字技术来开始发现数据结构。
哪些变量表明有趣的关系?
哪些观察结果不寻常?
在本节结束时,您将能够回答这些问题以及更多问题,同时生成既富有洞察力又美观的图形。 然后我们将审查分析和统计操作:
Data collection is the process of gathering and measuring data, information or any variables of interest in a standardized and established manner that enables the collector to answer or test hypothesis and evaluate outcomes of the particular collection.
数据收集是以标准化和既定方式收集和测量数据,信息或任何感兴趣变量的过程,使收集者能够回答或检验假设并评估特定收集的结果。[techopedia]
# import Dataset to play with it
train = pd.read_csv('../input/train.csv')
test= pd.read_csv('../input/test.csv')
The concat function does all of the heavy lifting of performing concatenation operations along an axis. Let us create all_data.
concat函数完成了沿轴执行连接操作的所有繁重工作。 让我们创建all_data。
all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
test.loc[:,'MSSubClass':'SaleCondition']))
After loading the data via pandas, we should checkout what the content is, description and via the following:
通过pandas加载数据后,我们应该检查内容是什么,描述以及通过以下内容:
type(train),type(test)
1- Dimensions of the dataset.
2- Peek at the data itself.
3- Statistical summary of all attributes.
4- Breakdown of the data by the class variable.[7]
Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.
1-数据集的维度。
2-查看数据本身。
3-所有属性的统计摘要。
4类变量对数据的细分。[7]
别担心,每次查看数据都是一个命令。 这些是有用的命令,您可以在将来的项目中反复使用这些命令。
# shape
print(train.shape)
Train has one column more than test why? (yes ==>> target value)
# shape
print(test.shape)
(1459, 80)
We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.
You should see 1460 instances and 81 attributes for train and 1459 instances and 80 attributes for test
For getting some information about the dataset you can use info() command
我们可以快速了解数据包含多少个实例(行)和多少属性(列)以及shape属性。
您应该看到列车的1460个实例和81个属性以及1459个实例和80个测试属性
要获取有关数据集的一些信息,可以使用info()命令
print(train.info())
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id 1460 non-null int64
MSSubClass 1460 non-null int64
MSZoning 1460 non-null object
LotFrontage 1201 non-null float64
LotArea 1460 non-null int64
Street 1460 non-null object
Alley 91 non-null object
LotShape 1460 non-null object
LandContour 1460 non-null object
Utilities 1460 non-null object
LotConfig 1460 non-null object
LandSlope 1460 non-null object
Neighborhood 1460 non-null object
Condition1 1460 non-null object
Condition2 1460 non-null object
BldgType 1460 non-null object
HouseStyle 1460 non-null object
OverallQual 1460 non-null int64
OverallCond 1460 non-null int64
YearBuilt 1460 non-null int64
YearRemodAdd 1460 non-null int64
RoofStyle 1460 non-null object
RoofMatl 1460 non-null object
Exterior1st 1460 non-null object
Exterior2nd 1460 non-null object
MasVnrType 1452 non-null object
MasVnrArea 1452 non-null float64
ExterQual 1460 non-null object
ExterCond 1460 non-null object
Foundation 1460 non-null object
BsmtQual 1423 non-null object
BsmtCond 1423 non-null object
BsmtExposure 1422 non-null object
BsmtFinType1 1423 non-null object
BsmtFinSF1 1460 non-null int64
BsmtFinType2 1422 non-null object
BsmtFinSF2 1460 non-null int64
BsmtUnfSF 1460 non-null int64
TotalBsmtSF 1460 non-null int64
Heating 1460 non-null object
HeatingQC 1460 non-null object
CentralAir 1460 non-null object
Electrical 1459 non-null object
1stFlrSF 1460 non-null int64
2ndFlrSF 1460 non-null int64
LowQualFinSF 1460 non-null int64
GrLivArea 1460 non-null int64
BsmtFullBath 1460 non-null int64
BsmtHalfBath 1460 non-null int64
FullBath 1460 non-null int64
HalfBath 1460 non-null int64
BedroomAbvGr 1460 non-null int64
KitchenAbvGr 1460 non-null int64
KitchenQual 1460 non-null object
TotRmsAbvGrd 1460 non-null int64
Functional 1460 non-null object
Fireplaces 1460 non-null int64
FireplaceQu 770 non-null object
GarageType 1379 non-null object
GarageYrBlt 1379 non-null float64
GarageFinish 1379 non-null object
GarageCars 1460 non-null int64
GarageArea 1460 non-null int64
GarageQual 1379 non-null object
GarageCond 1379 non-null object
PavedDrive 1460 non-null object
WoodDeckSF 1460 non-null int64
OpenPorchSF 1460 non-null int64
EnclosedPorch 1460 non-null int64
3SsnPorch 1460 non-null int64
ScreenPorch 1460 non-null int64
PoolArea 1460 non-null int64
PoolQC 7 non-null object
Fence 281 non-null object
MiscFeature 54 non-null object
MiscVal 1460 non-null int64
MoSold 1460 non-null int64
YrSold 1460 non-null int64
SaleType 1460 non-null object
SaleCondition 1460 non-null object
SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
None
if you want see the type of data and unique value of it you use following script
train['Fence'].unique()
array([nan, ‘MnPrv’, ‘GdWo’, ‘GdPrv’, ‘MnWw’], dtype=object)
train["Fence"].value_counts()
MnPrv 157
GdPrv 59
GdWo 54
MnWw 11
Name: Fence, dtype: int64
Copy Id for test and train data set
train_id=train['Id'].copy()
test_id=test['Id'].copy()
to check the first 5 rows of the data set, we can use head(5).
train.head(5)
1to check out last 5 row of the data set, we use tail() function
train.tail()
to pop up 5 random rows from the data set, we can use sample(5) function
train.sample(5)
To give a statistical summary about the dataset, we can use **describe()
train.describe()
To check out how many null info are on the dataset, we can use **isnull().sum()
train.isnull().sum().head(2)
Id 0
MSSubClass 0
dtype: int64
train.groupby('SaleType').count()
to print dataset columns, we can use columns atribute
train.columns
type((train.columns))
pandas.core.indexes.base.Index
<< Note 2 >> in pandas’s data frame you can perform some query such as "where"
train[train['SalePrice']>700000]
As you know SalePrice is our target value that we should predict it then now we take a look a
train['SalePrice'].describe()
Flexibly plot a univariate distribution of observations.
sns.set(rc={'figure.figsize':(9,7)})
sns.distplot(train['SalePrice']);
It is the degree of distortion from the symmetrical bell curve or the normal distribution. It measures the lack of symmetry in data distribution. It differentiates extreme values in one versus the other tail. A symmetrical distribution will have a skewness of 0.
Kurtosis is all about the tails of the distribution — not the peakedness or flatness. It is used to describe the extreme values in one versus the other tail. It is actually the measure of outliers present in the distribution.
#skewness and kurtosis
print("Skewness: %f" % train['SalePrice'].skew())
print("Kurtosis: %f" % train['SalePrice'].kurt())
Data visualization is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns.
With interactive visualization, you can take the concept a step further by using technology to drill down into charts and graphs for more detail, interactively changing what data you see and how it’s processed.[SAS]
In this section I show you 11 plots with matplotlib and seaborn that is listed in the blew picture:
Scatter plot Purpose To identify the type of relationship (if any) between two quantitative variables
# Modify the graph above by assigning each species an individual color.
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
g=sns.FacetGrid(train[columns], hue="OverallQual", size=5) \
.map(plt.scatter, "OverallQual", "SalePrice") \
.add_legend()
g=g.map(plt.scatter, "OverallQual", "SalePrice",edgecolor="w").add_legend();
plt.show()
In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram.[wikipedia]
data = pd.concat([train['SalePrice'], train['OverallQual']], axis=1)
f, ax = plt.subplots(figsize=(12, 8))
fig = sns.boxplot(x='OverallQual', y="SalePrice", data=data)
ax= sns.boxplot(x="OverallQual", y="SalePrice", data=train[columns])
ax= sns.stripplot(x="OverallQual", y="SalePrice", data=train[columns], jitter=True, edgecolor="gray")
plt.show()
We can also create a histogram of each input variable to get an idea of the distribution.
# histograms
train.hist(figsize=(15,20))
plt.figure()
mini_train=train[columns]
f,ax=plt.subplots(1,2,figsize=(20,10))
mini_train[mini_train['SalePrice']>100000].GarageArea.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='red')
ax[0].set_title('SalePrice>100000')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)
mini_train[mini_train['SalePrice']<100000].GarageArea.plot.hist(ax=ax[1],color='green',bins=20,edgecolor='black')
ax[1].set_title('SalePrice<100000')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
plt.show()
mini_train[['SalePrice','OverallQual']].groupby(['OverallQual']).mean().plot.bar()
train['OverallQual'].value_counts().plot(kind="bar");
It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.
Now we can look at the interactions between the variables.
First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.
# scatter plot matrix
pd.plotting.scatter_matrix(train[columns],figsize=(10,10))
plt.figure()
Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship.
# violinplots on petal-length for each species
sns.violinplot(data=train,x="Functional", y="SalePrice")
# Using seaborn pairplot to see the bivariate relation between each pair of features
sns.set()
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.pairplot(train[columns],size = 2 ,kind ='scatter')
plt.show()
# seaborn's kdeplot, plots univariate or bivariate density estimates.
#Size can be changed by tweeking the value used
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.FacetGrid(train[columns], hue="OverallQual", size=5).map(sns.kdeplot, "YearBuilt").add_legend()
plt.show()
# Use seaborn's jointplot to make a hexagonal bin plot
#Set desired size and ratio and choose a color.
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.jointplot(x="OverallQual", y="SalePrice", data=train[columns], size=10,ratio=10, kind='hex',color='green')
plt.show()
# we will use seaborn jointplot shows bivariate scatterplots and univariate histograms with Kernel density
# estimation in the same figure
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.jointplot(x="SalePrice", y="YearBuilt", data=train[columns], size=6, kind='kde', color='#800000', space=0)
plt.figure(figsize=(7,4))
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.heatmap(train[columns].corr(),annot=True,cmap='cubehelix_r') #draws heatmap with input as the correlation matrix calculted by(iris.corr())
plt.show()
from pandas.tools.plotting import radviz
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
radviz(train[columns], "OverallQual")
sns.factorplot('OverallQual','SalePrice',hue='Functional',data=train)
plt.show()
Data preprocessing refers to the transformations applied to our data before feeding it to the algorithm.
Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis. there are plenty of steps for data preprocessing and we just listed some of them :
数据预处理是一种用于将原始数据转换为干净数据集的技术。 换句话说,每当从不同来源收集数据时,就以原始格式收集数据,这对于分析是不可行的。 有很多步骤可以进行数据预处理,我们只列出了一些步骤:
An outlier is a data point that is distant from other similar points. Further simplifying an outlier is an observation that lies on abnormal observation amongst the normal observations in a sample set of population.
异常值是远离其他类似点的数据点。 进一步简化异常值是一种观察,其在于样本集合中的正常观察中的异常观察。
In statistics, an outlier is an observation point that is distant from other observations.
# Looking for outliers, as indicated in https://ww2.amstat.org/publications/jse/v19n3/decock.pdf
plt.scatter(train.GrLivArea, train.SalePrice, c = "blue", marker = "s")
plt.title("Looking for outliers")
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")
plt.show()
train = train[train.GrLivArea < 4000]
2 extreme outliers on the bottom right
#deleting points
train.sort_values(by = 'GrLivArea', ascending = False)[:2]
train = train.drop(train[train['Id'] == 1299].index)
train = train.drop(train[train['Id'] == 524].index)
#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) #compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index
all_data[skewed_feats] = np.log1p(all_data[skewed_feats])
# Log transform the target for official scoring
#The key point is to to log_transform the numeric variables since most of them are skewed.
train.SalePrice = np.log1p(train.SalePrice)
y = train.SalePrice
Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.
plt.scatter(train.GrLivArea, train.SalePrice, c = "blue", marker = "s")
plt.title("Looking for outliers")
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")
plt.show()
When dealing with real-world data, dirty data is the norm rather than the exception. We continuously need to predict correct values, impute missing ones, and find links between various data artefacts such as schemas and records. We need to stop treating data cleaning as a piecemeal exercise (resolving different types of errors in isolation), and instead leverage all signals and resources (such as constraints, available statistics, and dictionaries) to accurately predict corrective actions.
在处理实际数据时,脏数据是常态而不是异常。 我们不断需要预测正确的值,估算缺失的值,并找到各种数据伪像(如模式和记录)之间的链接。 我们需要停止将数据清理视为零碎的练习(孤立地解决不同类型的错误),而是利用所有信号和资源(例如约束,可用统计和词典)来准确预测纠正措施。
Firstly, understand that there is NO good way to deal with missing data
首先,要了解没有好的方法来处理缺失的数据
#filling NA's with the mean of the column:
all_data = all_data.fillna(all_data.mean())
In this section have been applied plenty of learning algorithms that play an important rule in your experiences and improve your knowledge in case of ML technique.
<< Note 3 >> : The results shown here may be slightly different for your analysis because, for example, the neural network algorithms use random number generators for fixing the initial value of the weights (starting points) of the neural networks, which often result in obtaining slightly different (local minima) solutions each time you run the analysis. Also note that changing the seed for the random number generator used to create the train, test, and validation samples can change your results.
在本节中已经应用了大量的学习算法,这些算法在您的经历中起着重要作用,并在ML技术的情况下提高您的知识。
<<注3 >>:此处显示的结果可能与您的分析略有不同,因为,例如,神经网络算法使用随机数生成器来固定神经网络的权重(起始点)的初始值,这通常是 每次运行分析时,都会获得稍微不同的(局部最小值)解决方案。 另请注意,更改用于创建训练,测试和验证样本的随机数生成器的种子可能会更改结果。
There are several categories for machine learning algorithms, below are some of these categories:
And if we want to categorize ML algorithms with the type of learning, there are below type:
Classification
clustering
Visualization and dimensionality reduction:
Association rule learning
Semisupervised learning
Reinforcement Learning
Batch learning & Online learning
Ensemble Learning
<< Note >>
Here is no method which outperforms all others for all tasks
One of the most important questions to ask as a machine learning engineer when evaluating our model is how to judge our own model? each machine learning model is trying to solve a problem with a different objective using a different dataset and hence, it is important to understand the context before choosing a metric.
在评估我们的模型时,作为机器学习工程师要求的最重要问题之一是如何判断我们自己的模型? 每个机器学习模型都试图使用不同的数据集来解决具有不同目标的问题,因此,在选择度量之前理解上下文非常重要。
Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)
预测值的对数与观察到的销售价格的对数之间的均方根误差(RMSE)。 (记录日志意味着预测昂贵房屋和廉价房屋的错误将同样影响结果。)
#creating matrices for sklearn:
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice
X_train.info()
def rmse_cv(model):
rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv = 5))
return(rmse)
model_ridge = Ridge()
alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]
cv_ridge = [rmse_cv(Ridge(alpha = alpha)).mean() for alpha in alphas]
cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge.plot(title = "Validation")
plt.xlabel("alpha")
plt.ylabel("rmse")
# steps
steps = [('scaler', StandardScaler()),
('ridge', Ridge())]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
# Specify the hyperparameter space
parameters = {'ridge__alpha':np.logspace(-4, 0, 50)}
# Create the GridSearchCV object: cv
cv = GridSearchCV(pipeline, parameters, cv=3)
# Fit to the training set
cv.fit(X_train, y)
#predict on train set
y_pred_train=cv.predict(X_train)
# Predict test set
y_pred_test=cv.predict(X_test)
# rmse on train set
rmse = np.sqrt(mean_squared_error(y, y_pred_train))
print("Root Mean Squared Error: {}".format(rmse))
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
随机森林是一种元估计器,它适用于数据集的各个子样本上的多个决策树分类器,并使用平均来提高预测精度和控制过拟合。 子样本大小始终与原始输入样本大小相同,但如果bootstrap = True(默认),则使用替换绘制样本。
num_test = 0.3
X_train, X_test, y_train, y_test = train_test_split(X_train, y, test_size=num_test, random_state=100)
# Fit Random Forest on Training Set
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=300, random_state=0)
regressor.fit(X_train, y_train)
# Score model
regressor.score(X_train, y_train)
XGBoost is one of the most popular machine learning algorithm these days. Regardless of the type of prediction task at hand; regression or classification.
XGBoost是目前最流行的机器学习算法之一。 无论手头的预测任务类型如何; 回归或分类。
Speed and performance : Originally written in C++, it is comparatively faster than other ensemble classifiers.
Core algorithm is parallelizable : Because the core XGBoost algorithm is parallelizable it can harness the power of multi-core computers. It is also parallelizable onto GPU’s and across networks of computers making it feasible to train on very large datasets as well.
Consistently outperforms other algorithm methods : It has shown better performance on a variety of machine learning benchmark datasets.
Wide variety of tuning parameters : XGBoost internally has parameters for cross-validation, regularization, user-defined objective functions, missing values, tree parameters, scikit-learn compatible API etc.[10]
XGBoost (Extreme Gradient Boosting) belongs to a family of boosting algorithms and uses the gradient boosting (GBM) framework at its core. It is an optimized distributed gradient boosting library. But wait, what is boosting? Well, keep on reading.
速度和性能:最初用C ++编写,比其他整体分类器要快。
核心算法是可并行化的:因为核心XGBoost算法是可并行化的,所以它可以利用多核计算机的强大功能。 它还可以并行化到GPU和计算机网络上,因此也可以在非常大的数据集上进行训练。
始终如一地优于其他算法方法:它在各种机器学习基准数据集上表现出更好的性能。
各种调整参数:XGBoost内部具有交叉验证,正则化,用户定义的目标函数,缺失值,树参数,scikit-learn兼容API等参数。[10]
XGBoost(Extreme Gradient Boosting)属于一系列增强算法,并在其核心使用梯度增强(GBM)框架。 它是一个优化的分布式梯度增强库。 但等等,是什么促进? 好吧,继续阅读。
# Initialize model
from xgboost.sklearn import XGBRegressor
XGB_Regressor = XGBRegressor()
# Fit the model on our data
XGB_Regressor.fit(X_train, y_train)
# Score model
XGB_Regressor.score(X_train, y_train)
Lasso linear model with iterative fitting along a regularization path. The best model is selected by cross-validation.
lasso=LassoCV()
# Fit the model on our data
lasso.fit(X_train, y_train)
# Score model
lasso.score(X_train, y_train)
GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function.
boostingregressor=GradientBoostingRegressor()
# Fit the model on our data
boostingregressor.fit(X_train, y_train)
# Score model
boostingregressor.score(X_train, y_train)
from sklearn.tree import DecisionTreeRegressor
# Define model. Specify a number for random_state to ensure same results each run
dt = DecisionTreeRegressor(random_state=1)
# Fit model
dt.fit(X_train, y_train)
dt.score(X_train, y_train)
from sklearn.tree import ExtraTreeRegressor
dtr = ExtraTreeRegressor()
# Fit model
dtr.fit(X_train, y_train)
# Fit model
dtr.score(X_train, y_train)
This kernel is not completed yet, I will try to cover all the parts related to the process of ML with a variety of Python packages and I know that there are still some problems then I hope to get your feedback to improve it.
Https://skymind.ai/wiki/machine-learning-workflow
Problem-define
Sklearn
Machine-learning-in-python-step-by-step
Data Cleaning
Kaggle kernel
Choosing-the-right-metric-for-machine-learning-models-part
Unboxing outliers in machine learning
How to handle missing data
Datacamp