Adam坤

通过分析房屋价格理解机器学习流程 --Adam Studio

Machine Learning Workflow for House Prices

1- Introduction

This is a A Comprehensive ML Workflow for House Prices data set, it is clear that everyone in this community is familiar with house prices dataset but if you need to review your information about the dataset please visit this link.

I have tried to help Fans of Machine Learning in Kaggle how to face machine learning problems. and I think it is a great opportunity for who want to learn machine learning workflow with python completely.

I want to covere most of the methods that are implemented for house prices until 2018, you can start to learn and review your knowledge about ML with a simple dataset and try to learn and memorize the workflow for your journey in Data science world.

Before we get into the notebook, let me introduce some helpful resources.

这是房屋价格数据集的综合ML工作流程，很明显，该社区的每个人都熟悉房价数据集，但如果您需要查看有关数据集的信息，请访问此链接。

我试图帮助Kaggle的机器学习爱好者如何面对机器学习问题。我认为这对于想要完全使用python学习机器学习工作流程的人来说是一个很好的机会。

我想强调大部分直到2018年实施房价的方法，你可以用一个简单的数据集开始学习和回顾你对ML的了解，并尝试学习和记住你在数据科学世界中旅程的工作流程。

在我们进入笔记本之前，让我介绍一些有用的资源。

2- Machine Learning Workflow

If you have already read some machine learning books. You have noticed that there are different ways to stream data into machine learning.

Most of these books share the following steps:

Define Problem
Specify Inputs & Outputs
Exploratory data analysis
Data Collection
Data Preprocessing
Data Cleaning
Visualization
Model Design, Training, and Offline Evaluation
Model Deployment, Online Evaluation, and Monitoring
Model Maintenance, Diagnosis, and Retraining

Of course, the same solution can not be provided for all problems, so the best way is to create a general framework and adapt it to new problem.

如果您已经阅读过一些机器学习书籍。您已经注意到有不同的方法将数据流式传输到机器学习中。

这些书中的大多数共享以下步骤：

定义问题
指定输入和输出
探索性数据分析
数据采集
数据预处理
数据清理
可视化
模型设计，培训和离线评估
模型部署，在线评估和监控
模型维护，诊断和再培训

当然，不能为所有问题提供相同的解决方案，因此最好的方法是创建一个通用框架并使其适应新问题。

You can see my workflow in the below image :

Data Science has so many techniques and procedures that can confuse anyone.

数据科学有许多技术和程序可以让任何人感到困惑。

2-2 Real world Application Vs Competitions

We all know that there are differences between real world problem and competition problem. The following figure that is taken from one of the courses in coursera, has partly made this comparison
我们都知道现实世界问题和竞争问题之间存在差异。下图取自课程中的一个课程，部分进行了这种比较

As you can see, there are a lot more steps to solve in real problems.

3- Problem Definition

I think one of the important things when you start a new machine learning project is defining your problem.that means you should understand business problem.( Problem Formalization).

Problem definition has four steps that have illustrated in the picture below:

我认为，当你开始一个新的机器学习项目时，重要的事情之一是定义你的问题。这意味着你应该理解业务问题。（问题形式化）。

问题定义有四个步骤，如下图所示：

3-1 Problem Feature

We will use the house prices data set. This dataset contains information about house prices and the target value is:

SalePrice
Why am I using House price dataset:
This is a good project because it is so well understood.
Attributes are numeric and categurical so you have to figure out how to load and handle data.
It is a Regression problem, allowing you to practice with perhaps an easier type of supervised learning algorithm.
This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition.
Creative feature engineering .

我们将使用房价数据集。此数据集包含有关房价的信息，目标值为：

销售价格
为什么我使用房价数据集：
这是一个很好的项目，因为它很好理解。
属性是数字和分类，因此您必须弄清楚如何加载和处理数据。
这是一个回归问题，允许您练习一种更简单的监督学习算法。
对于已完成机器学习在线课程并希望在参加特色竞赛之前扩展其技能的数据科学学生来说，这是一场完美的比赛。
创意特色工程。

3-1-1 Metric

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

在预测值的对数与观察到的销售价格的对数之间的均方根误差（RMSE）上评估提交。（记录日志意味着预测昂贵房屋和廉价房屋的错误将同样影响结果。）

3-2 Aim

It is our job to predict the sales price for each house. for each Id in the test set, you must predict the value of the SalePrice variable.

我们的工作是预测每栋房屋的销售价格。对于测试集中的每个Id，您必须预测SalePrice变量的值。

3-3 Variables

The variables are :

SalePrice - the property’s sale price in dollars. This is the target variable that you’re trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale

变量是：

SalePrice - 该物业的销售价格以美元计算。这是您尝试预测的目标变量。
MSSubClass：建筑类
MSZoning：一般分区分类
LotFrontage：与物业相连的街道的线性脚
LotArea：地块尺寸，平方英尺
街道：道路通行类型
胡同：胡同通道的类型
LotShape：一般的财产形状
LandContour：酒店的平整度
实用程序：可用的实用程序类型
LotConfig：批量配置
LandSlope：物业坡度
邻里：Ames市区内的物理位置
条件1：靠近主要道路或铁路
条件2：靠近主要道路或铁路（如果存在第二个）
BldgType：住宅类型
HouseStyle：住宅风格
OverallQual：整体材料和成品质量
OverallCond：总体状况评级
YearBuilt：原始施工日期
YearRemodAdd：改造日期
RoofStyle：屋顶类型
RoofMatl：屋顶材料
Exterior1st：房屋外墙
Exterior2nd：房屋外墙（如果有多种材料）
MasVnrType：砌体贴面类型
MasVnrArea：平方英尺的砌体饰面区域
ExterQual：外部材料质量
ExterCond：外部材料的现状
基础：基础类型
BsmtQual：地下室的高度
BsmtCond：地下室的一般状况
BsmtExposure：罢工或花园层地下室墙壁
BsmtFinType1：地下室成品区的质量
BsmtFinSF1：类型1完成平方英尺
BsmtFinType2：第二个完成区域的质量（如果存在）
BsmtFinSF2：2型成品平方英尺
BsmtUnfSF：未完工的地下室平方英尺
TotalBsmtSF：地下室总面积平方英尺
加热：加热类型
HeatingQC：加热质量和条件
CentralAir：中央空调
电气：电气系统
1stFlrSF：一楼平方英尺
2ndFlrSF：二楼平方英尺
LowQualFinSF：低质量的平方英尺（所有楼层）
GrLivArea：以上（地面）生活区平方英尺
BsmtFullBath：地下室齐全的浴室
BsmtHalfBath：地下室半浴室
FullBath：高档以上的完整浴室
HalfBath：高于等级的半浴
卧室：地下室以上的卧室数量
厨房：厨房数量
KitchenQual：厨房质量
TotRmsAbvGrd：以上客房总数（不包括浴室）
功能：家庭功能评级
壁炉：壁炉数量
FireplaceQu：壁炉质量
车库类型：车库位置
GarageYrBlt：建造了年车库
GarageFinish：车库的内部装饰
GarageCars：车库容量的车库大小
GarageArea：车库的面积，平方英尺
GarageQual：车库质量
GarageCond：车库状况
PavedDrive：铺好的车道
WoodDeckSF：平方英尺的木甲板面积
OpenPorchSF：平方英尺的开放式门廊区域
EnclosedPorch：封闭的门廊面积，平方英尺
3SsnPorch：三个季节的门廊面积，平方英尺
ScreenPorch：屏幕门廊面积，平方英尺
PoolArea：泳池面积，平方英尺
PoolQC：游泳池质量
围栏：围栏质量
MiscFeature：其他类别未涵盖的其他功能
MiscVal：杂项功能的价值
MoSold：已售出月份
YrSold：已售出年份
SaleType：销售类型
SaleCondition：销售条件

4- Inputs & Outputs

For every machine learning problem, you should ask yourself, what are inputs and outputs for the model?

对于每个机器学习问题，您应该问自己，模型的输入和输出是什么？

4-1 Inputs

train.csv - the training set
test.csv - the test set

4-2 Outputs

sale prices for every record in test.csv

5 Loading Packages

In this kernel we are using the following packages:

5-1 Import

from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import confusion_matrix
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from scipy.stats import skew
import scipy.stats as stats
import lightgbm as lgb
import seaborn as sns
import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib
import warnings
import sklearn
import scipy
import json
import sys
import csv
import os

5-2 Version

print('matplotlib: {}'.format(matplotlib.__version__))
print('sklearn: {}'.format(sklearn.__version__))
print('scipy: {}'.format(scipy.__version__))
print('seaborn: {}'.format(sns.__version__))
print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('Python: {}'.format(sys.version))

5-5-3 Setup

A few tiny adjustments for better code readability

pd.set_option('display.float_format', lambda x: '%.3f' % x)
sns.set(style='white', context='notebook', palette='deep')
warnings.filterwarnings('ignore')
sns.set_style('white')
%matplotlib inline

6- Exploratory Data Analysis(EDA)

In this section, you’ll learn how to use graphical and numerical techniques to begin uncovering the structure of your data.

Which variables suggest interesting relationships?
Which observations are unusual?

By the end of the section, you’ll be able to answer these questions and more, while generating graphics that are both insightful and beautiful. then We will review analytical and statistical operations:

Data Collection
Visualization
Data Cleaning
Data Preprocessing
在本节中，您将学习如何使用图形和数字技术来开始发现数据结构。
哪些变量表明有趣的关系？
哪些观察结果不寻常？

在本节结束时，您将能够回答这些问题以及更多问题，同时生成既富有洞察力又美观的图形。然后我们将审查分析和统计操作：

数据采集
可视化
数据清理
数据预处理

6-1 Data Collection

Data collection is the process of gathering and measuring data, information or any variables of interest in a standardized and established manner that enables the collector to answer or test hypothesis and evaluate outcomes of the particular collection.
数据收集是以标准化和既定方式收集和测量数据，信息或任何感兴趣变量的过程，使收集者能够回答或检验假设并评估特定收集的结果。[techopedia]

# import Dataset to play with it
train = pd.read_csv('../input/train.csv')
test= pd.read_csv('../input/test.csv')

The concat function does all of the heavy lifting of performing concatenation operations along an axis. Let us create all_data.

concat函数完成了沿轴执行连接操作的所有繁重工作。让我们创建all_data。

all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
                      test.loc[:,'MSSubClass':'SaleCondition']))

Each row is an observation (also known as : sample, example, instance, record)
Each column is a feature (also known as: Predictor, attribute, Independent Variable, input, regressor, Covariate)

After loading the data via pandas, we should checkout what the content is, description and via the following:

每行都是观察（也称为：样本，示例，实例，记录）
每列都是一个特征（也称为：预测变量，属性，独立变量，输入，回归量，协变量）

通过pandas加载数据后，我们应该检查内容是什么，描述以及通过以下内容：

type(train),type(test)

6-1-1 Statistical Summary

1- Dimensions of the dataset.
2- Peek at the data itself.
3- Statistical summary of all attributes.
4- Breakdown of the data by the class variable.[7]

Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.

1-数据集的维度。
2-查看数据本身。
3-所有属性的统计摘要。
4类变量对数据的细分。[7]

别担心，每次查看数据都是一个命令。这些是有用的命令，您可以在将来的项目中反复使用这些命令。

# shape
print(train.shape)

Train has one column more than test why? (yes ==>> target value)

# shape
print(test.shape)

(1459, 80)

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

You should see 1460 instances and 81 attributes for train and 1459 instances and 80 attributes for test

For getting some information about the dataset you can use info() command

我们可以快速了解数据包含多少个实例（行）和多少属性（列）以及shape属性。

您应该看到列车的1460个实例和81个属性以及1459个实例和80个测试属性

要获取有关数据集的一些信息，可以使用info（）命令

print(train.info())

RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id 1460 non-null int64
MSSubClass 1460 non-null int64
MSZoning 1460 non-null object
LotFrontage 1201 non-null float64
LotArea 1460 non-null int64
Street 1460 non-null object
Alley 91 non-null object
LotShape 1460 non-null object
LandContour 1460 non-null object
Utilities 1460 non-null object
LotConfig 1460 non-null object
LandSlope 1460 non-null object
Neighborhood 1460 non-null object
Condition1 1460 non-null object
Condition2 1460 non-null object
BldgType 1460 non-null object
HouseStyle 1460 non-null object
OverallQual 1460 non-null int64
OverallCond 1460 non-null int64
YearBuilt 1460 non-null int64
YearRemodAdd 1460 non-null int64
RoofStyle 1460 non-null object
RoofMatl 1460 non-null object
Exterior1st 1460 non-null object
Exterior2nd 1460 non-null object
MasVnrType 1452 non-null object
MasVnrArea 1452 non-null float64
ExterQual 1460 non-null object
ExterCond 1460 non-null object
Foundation 1460 non-null object
BsmtQual 1423 non-null object
BsmtCond 1423 non-null object
BsmtExposure 1422 non-null object
BsmtFinType1 1423 non-null object
BsmtFinSF1 1460 non-null int64
BsmtFinType2 1422 non-null object
BsmtFinSF2 1460 non-null int64
BsmtUnfSF 1460 non-null int64
TotalBsmtSF 1460 non-null int64
Heating 1460 non-null object
HeatingQC 1460 non-null object
CentralAir 1460 non-null object
Electrical 1459 non-null object
1stFlrSF 1460 non-null int64
2ndFlrSF 1460 non-null int64
LowQualFinSF 1460 non-null int64
GrLivArea 1460 non-null int64
BsmtFullBath 1460 non-null int64
BsmtHalfBath 1460 non-null int64
FullBath 1460 non-null int64
HalfBath 1460 non-null int64
BedroomAbvGr 1460 non-null int64
KitchenAbvGr 1460 non-null int64
KitchenQual 1460 non-null object
TotRmsAbvGrd 1460 non-null int64
Functional 1460 non-null object
Fireplaces 1460 non-null int64
FireplaceQu 770 non-null object
GarageType 1379 non-null object
GarageYrBlt 1379 non-null float64
GarageFinish 1379 non-null object
GarageCars 1460 non-null int64
GarageArea 1460 non-null int64
GarageQual 1379 non-null object
GarageCond 1379 non-null object
PavedDrive 1460 non-null object
WoodDeckSF 1460 non-null int64
OpenPorchSF 1460 non-null int64
EnclosedPorch 1460 non-null int64
3SsnPorch 1460 non-null int64
ScreenPorch 1460 non-null int64
PoolArea 1460 non-null int64
PoolQC 7 non-null object
Fence 281 non-null object
MiscFeature 54 non-null object
MiscVal 1460 non-null int64
MoSold 1460 non-null int64
YrSold 1460 non-null int64
SaleType 1460 non-null object
SaleCondition 1460 non-null object
SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
None

if you want see the type of data and unique value of it you use following script

train['Fence'].unique()

array([nan, ‘MnPrv’, ‘GdWo’, ‘GdPrv’, ‘MnWw’], dtype=object)

train["Fence"].value_counts()

MnPrv 157
GdPrv 59
GdWo 54
MnWw 11
Name: Fence, dtype: int64

Copy Id for test and train data set

train_id=train['Id'].copy()
test_id=test['Id'].copy()

to check the first 5 rows of the data set, we can use head(5).

train.head(5)

1to check out last 5 row of the data set, we use tail() function

train.tail()

to pop up 5 random rows from the data set, we can use sample(5) function

train.sample(5)

To give a statistical summary about the dataset, we can use **describe()

train.describe()

To check out how many null info are on the dataset, we can use **isnull().sum()

train.isnull().sum().head(2)

Id 0
MSSubClass 0
dtype: int64

train.groupby('SaleType').count()

to print dataset columns, we can use columns atribute

train.columns

type((train.columns))

pandas.core.indexes.base.Index

<< Note 2 >> in pandas’s data frame you can perform some query such as "where"

train[train['SalePrice']>700000]

6-1-2 Target Value Analysis

As you know SalePrice is our target value that we should predict it then now we take a look a

train['SalePrice'].describe()

Flexibly plot a univariate distribution of observations.

sns.set(rc={'figure.figsize':(9,7)})
sns.distplot(train['SalePrice']);

6-1-3 Skewness vs Kurtosis

Skewness

It is the degree of distortion from the symmetrical bell curve or the normal distribution. It measures the lack of symmetry in data distribution. It differentiates extreme values in one versus the other tail. A symmetrical distribution will have a skewness of 0.

Kurtosis

Kurtosis is all about the tails of the distribution — not the peakedness or flatness. It is used to describe the extreme values in one versus the other tail. It is actually the measure of outliers present in the distribution.

#skewness and kurtosis
print("Skewness: %f" % train['SalePrice'].skew())
print("Kurtosis: %f" % train['SalePrice'].kurt())

6-2 Visualization

Data visualization is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns.

With interactive visualization, you can take the concept a step further by using technology to drill down into charts and graphs for more detail, interactively changing what data you see and how it’s processed.[SAS]

In this section I show you 11 plots with matplotlib and seaborn that is listed in the blew picture:

6-2-1 Scatter plot

Scatter plot Purpose To identify the type of relationship (if any) between two quantitative variables

# Modify the graph above by assigning each species an individual color.
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
g=sns.FacetGrid(train[columns], hue="OverallQual", size=5) \
   .map(plt.scatter, "OverallQual", "SalePrice") \
   .add_legend()
g=g.map(plt.scatter, "OverallQual", "SalePrice",edgecolor="w").add_legend();
plt.show()

6-2-2 Box

In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram.[wikipedia]

data = pd.concat([train['SalePrice'], train['OverallQual']], axis=1)
f, ax = plt.subplots(figsize=(12, 8))
fig = sns.boxplot(x='OverallQual', y="SalePrice", data=data)

ax= sns.boxplot(x="OverallQual", y="SalePrice", data=train[columns])
ax= sns.stripplot(x="OverallQual", y="SalePrice", data=train[columns], jitter=True, edgecolor="gray")
plt.show()

6-2-3 Histogram

We can also create a histogram of each input variable to get an idea of the distribution.

# histograms
train.hist(figsize=(15,20))
plt.figure()

mini_train=train[columns]
f,ax=plt.subplots(1,2,figsize=(20,10))
mini_train[mini_train['SalePrice']>100000].GarageArea.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='red')
ax[0].set_title('SalePrice>100000')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)
mini_train[mini_train['SalePrice']<100000].GarageArea.plot.hist(ax=ax[1],color='green',bins=20,edgecolor='black')
ax[1].set_title('SalePrice<100000')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
plt.show()

mini_train[['SalePrice','OverallQual']].groupby(['OverallQual']).mean().plot.bar()

train['OverallQual'].value_counts().plot(kind="bar");

It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.

6-2-4 Multivariate Plots

Now we can look at the interactions between the variables.

First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.

# scatter plot matrix
pd.plotting.scatter_matrix(train[columns],figsize=(10,10))
plt.figure()

Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship.

6-2-5 violinplots

# violinplots on petal-length for each species
sns.violinplot(data=train,x="Functional", y="SalePrice")

6-2-6 pairplot

# Using seaborn pairplot to see the bivariate relation between each pair of features
sns.set()
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.pairplot(train[columns],size = 2 ,kind ='scatter')
plt.show()

6-2-7 kdeplot

# seaborn's kdeplot, plots univariate or bivariate density estimates.
#Size can be changed by tweeking the value used
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.FacetGrid(train[columns], hue="OverallQual", size=5).map(sns.kdeplot, "YearBuilt").add_legend()
plt.show()

6-2-8 jointplot

# Use seaborn's jointplot to make a hexagonal bin plot
#Set desired size and ratio and choose a color.
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.jointplot(x="OverallQual", y="SalePrice", data=train[columns], size=10,ratio=10, kind='hex',color='green')
plt.show()

# we will use seaborn jointplot shows bivariate scatterplots and univariate histograms with Kernel density 
# estimation in the same figure
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.jointplot(x="SalePrice", y="YearBuilt", data=train[columns], size=6, kind='kde', color='#800000', space=0)

6-2-9 Heatmap

plt.figure(figsize=(7,4)) 
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.heatmap(train[columns].corr(),annot=True,cmap='cubehelix_r') #draws  heatmap with input as the correlation matrix calculted by(iris.corr())
plt.show()

6-2-10 radviz

from pandas.tools.plotting import radviz
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
radviz(train[columns], "OverallQual")

6-2-12 Factorplot

sns.factorplot('OverallQual','SalePrice',hue='Functional',data=train)
plt.show()

6-3 Data Preprocessing

Data preprocessing refers to the transformations applied to our data before feeding it to the algorithm.

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis. there are plenty of steps for data preprocessing and we just listed some of them :

removing Target column (id)
Sampling (without replacement)
Making part of iris unbalanced and balancing (with undersampling and SMOTE)
Introducing missing values and treating them (replacing by average values)
Noise filtering
Data discretization
Normalization and standardization
PCA analysis
Feature selection (filter, embedded, wrapper)
数据预处理是指在将数据提供给算法之前应用于我们数据的转换。

数据预处理是一种用于将原始数据转换为干净数据集的技术。换句话说，每当从不同来源收集数据时，就以原始格式收集数据，这对于分析是不可行的。有很多步骤可以进行数据预处理，我们只列出了一些步骤：

删除目标列（id）
取样（无需更换）
使虹膜的一部分不平衡和平衡（使用欠采样和SMOTE）
引入缺失值并对其进行处理（替换为平均值）
噪音过滤
数据离散化
规范化和标准化
PCA分析
功能选择（过滤器，嵌入式，包装器）

6-3-1 Noise filtering (Outliers)

An outlier is a data point that is distant from other similar points. Further simplifying an outlier is an observation that lies on abnormal observation amongst the normal observations in a sample set of population.

异常值是远离其他类似点的数据点。进一步简化异常值是一种观察，其在于样本集合中的正常观察中的异常观察。

In statistics, an outlier is an observation point that is distant from other observations.

# Looking for outliers, as indicated in https://ww2.amstat.org/publications/jse/v19n3/decock.pdf
plt.scatter(train.GrLivArea, train.SalePrice, c = "blue", marker = "s")
plt.title("Looking for outliers")
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")
plt.show()

train = train[train.GrLivArea < 4000]

2 extreme outliers on the bottom right

#deleting points
train.sort_values(by = 'GrLivArea', ascending = False)[:2]
train = train.drop(train[train['Id'] == 1299].index)
train = train.drop(train[train['Id'] == 524].index)

#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) #compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index

all_data[skewed_feats] = np.log1p(all_data[skewed_feats])

# Log transform the target for official scoring
#The key point is to to log_transform the numeric variables since most of them are skewed.
train.SalePrice = np.log1p(train.SalePrice)
y = train.SalePrice

Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.

plt.scatter(train.GrLivArea, train.SalePrice, c = "blue", marker = "s")
plt.title("Looking for outliers")
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")
plt.show()

6-4 Data Cleaning

When dealing with real-world data, dirty data is the norm rather than the exception. We continuously need to predict correct values, impute missing ones, and find links between various data artefacts such as schemas and records. We need to stop treating data cleaning as a piecemeal exercise (resolving different types of errors in isolation), and instead leverage all signals and resources (such as constraints, available statistics, and dictionaries) to accurately predict corrective actions.

在处理实际数据时，脏数据是常态而不是异常。我们不断需要预测正确的值，估算缺失的值，并找到各种数据伪像（如模式和记录）之间的链接。我们需要停止将数据清理视为零碎的练习（孤立地解决不同类型的错误），而是利用所有信号和资源（例如约束，可用统计和词典）来准确预测纠正措施。

6-4-1 Handle missing values

Firstly, understand that there is NO good way to deal with missing data

首先，要了解没有好的方法来处理缺失的数据

#filling NA's with the mean of the column:
all_data = all_data.fillna(all_data.mean())

7- Model Deployment

In this section have been applied plenty of learning algorithms that play an important rule in your experiences and improve your knowledge in case of ML technique.

<< Note 3 >> : The results shown here may be slightly different for your analysis because, for example, the neural network algorithms use random number generators for fixing the initial value of the weights (starting points) of the neural networks, which often result in obtaining slightly different (local minima) solutions each time you run the analysis. Also note that changing the seed for the random number generator used to create the train, test, and validation samples can change your results.

在本节中已经应用了大量的学习算法，这些算法在您的经历中起着重要作用，并在ML技术的情况下提高您的知识。

<<注3 >>：此处显示的结果可能与您的分析略有不同，因为，例如，神经网络算法使用随机数生成器来固定神经网络的权重（起始点）的初始值，这通常是每次运行分析时，都会获得稍微不同的（局部最小值）解决方案。另请注意，更改用于创建训练，测试和验证样本的随机数生成器的种子可能会更改结果。

7-1 Families of ML algorithms

There are several categories for machine learning algorithms, below are some of these categories:

Linear
- Linear Regression
- Logistic Regression
- Support Vector Machines
Tree-Based
- Decision Tree
- Random Forest
- GBDT
KNN
Neural Networks

And if we want to categorize ML algorithms with the type of learning, there are below type:

Classification
- k-Nearest Neighbors
- LinearRegression
- SVM
- DT
- NN
clustering
- K-means
- HCA
- Expectation Maximization
Visualization and dimensionality reduction:
- Principal Component Analysis(PCA)
- Kernel PCA
- Locally -Linear Embedding (LLE)
- t-distributed Stochastic Neighbor Embedding (t-SNE)
Association rule learning
- Apriori
- Eclat
Semisupervised learning
Reinforcement Learning
- Q-learning
Batch learning & Online learning
Ensemble Learning
<< Note >>

Here is no method which outperforms all others for all tasks

7-2 Accuracy and precision

One of the most important questions to ask as a machine learning engineer when evaluating our model is how to judge our own model? each machine learning model is trying to solve a problem with a different objective using a different dataset and hence, it is important to understand the context before choosing a metric.

在评估我们的模型时，作为机器学习工程师要求的最重要问题之一是如何判断我们自己的模型？每个机器学习模型都试图使用不同的数据集来解决具有不同目标的问题，因此，在选择度量之前理解上下文非常重要。

7-2-1 RMSE

Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

预测值的对数与观察到的销售价格的对数之间的均方根误差（RMSE）。（记录日志意味着预测昂贵房屋和廉价房屋的错误将同样影响结果。）

#creating matrices for sklearn:
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice

X_train.info()

7-3 Ridge

def rmse_cv(model):
    rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv = 5))
    return(rmse)

model_ridge = Ridge()

alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]
cv_ridge = [rmse_cv(Ridge(alpha = alpha)).mean() for alpha in alphas]

7-3-1 Root Mean Squared Error

cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge.plot(title = "Validation")
plt.xlabel("alpha")
plt.ylabel("rmse")

# steps
steps = [('scaler', StandardScaler()),
         ('ridge', Ridge())]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'ridge__alpha':np.logspace(-4, 0, 50)}

# Create the GridSearchCV object: cv
cv = GridSearchCV(pipeline, parameters, cv=3)

# Fit to the training set
cv.fit(X_train, y)

#predict on train set
y_pred_train=cv.predict(X_train)

# Predict test set
y_pred_test=cv.predict(X_test)

# rmse on train set
rmse = np.sqrt(mean_squared_error(y, y_pred_train))
print("Root Mean Squared Error: {}".format(rmse))

7-4 RandomForestClassifier

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

随机森林是一种元估计器，它适用于数据集的各个子样本上的多个决策树分类器，并使用平均来提高预测精度和控制过拟合。子样本大小始终与原始输入样本大小相同，但如果bootstrap = True（默认），则使用替换绘制样本。

num_test = 0.3
X_train, X_test, y_train, y_test = train_test_split(X_train, y, test_size=num_test, random_state=100)

# Fit Random Forest on Training Set
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=300, random_state=0)
regressor.fit(X_train, y_train)

# Score model
regressor.score(X_train, y_train)

7-5 XGBoost

XGBoost is one of the most popular machine learning algorithm these days. Regardless of the type of prediction task at hand; regression or classification.

XGBoost是目前最流行的机器学习算法之一。无论手头的预测任务类型如何; 回归或分类。

7-5-1 But what makes XGBoost so popular?

Speed and performance : Originally written in C++, it is comparatively faster than other ensemble classifiers.
Core algorithm is parallelizable : Because the core XGBoost algorithm is parallelizable it can harness the power of multi-core computers. It is also parallelizable onto GPU’s and across networks of computers making it feasible to train on very large datasets as well.
Consistently outperforms other algorithm methods : It has shown better performance on a variety of machine learning benchmark datasets.
Wide variety of tuning parameters : XGBoost internally has parameters for cross-validation, regularization, user-defined objective functions, missing values, tree parameters, scikit-learn compatible API etc.[10]

XGBoost (Extreme Gradient Boosting) belongs to a family of boosting algorithms and uses the gradient boosting (GBM) framework at its core. It is an optimized distributed gradient boosting library. But wait, what is boosting? Well, keep on reading.

速度和性能：最初用C ++编写，比其他整体分类器要快。
核心算法是可并行化的：因为核心XGBoost算法是可并行化的，所以它可以利用多核计算机的强大功能。它还可以并行化到GPU和计算机网络上，因此也可以在非常大的数据集上进行训练。
始终如一地优于其他算法方法：它在各种机器学习基准数据集上表现出更好的性能。
各种调整参数：XGBoost内部具有交叉验证，正则化，用户定义的目标函数，缺失值，树参数，scikit-learn兼容API等参数。[10]

XGBoost（Extreme Gradient Boosting）属于一系列增强算法，并在其核心使用梯度增强（GBM）框架。它是一个优化的分布式梯度增强库。但等等，是什么促进？好吧，继续阅读。

# Initialize model
from xgboost.sklearn import XGBRegressor
XGB_Regressor = XGBRegressor()                  

# Fit the model on our data
XGB_Regressor.fit(X_train, y_train)

# Score model
XGB_Regressor.score(X_train, y_train)

7-6 LassoCV

Lasso linear model with iterative fitting along a regularization path. The best model is selected by cross-validation.

lasso=LassoCV()

# Fit the model on our data
lasso.fit(X_train, y_train)

# Score model
lasso.score(X_train, y_train)

7-7 GradientBoostingRegressor

GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function.

boostingregressor=GradientBoostingRegressor()

# Fit the model on our data
boostingregressor.fit(X_train, y_train)

# Score model
boostingregressor.score(X_train, y_train)

7-8 DecisionTree

from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
dt = DecisionTreeRegressor(random_state=1)

# Fit model
dt.fit(X_train, y_train)

dt.score(X_train, y_train)

7-9 ExtraTreeRegressor

from sklearn.tree import ExtraTreeRegressor

dtr = ExtraTreeRegressor()

# Fit model
dtr.fit(X_train, y_train)

# Fit model
dtr.score(X_train, y_train)

8- Conclusion

This kernel is not completed yet, I will try to cover all the parts related to the process of ML with a variety of Python packages and I know that there are still some problems then I hope to get your feedback to improve it.

9- References

Https://skymind.ai/wiki/machine-learning-workflow
Problem-define
Sklearn
Machine-learning-in-python-step-by-step
Data Cleaning
Kaggle kernel
Choosing-the-right-metric-for-machine-learning-models-part
Unboxing outliers in machine learning
How to handle missing data
Datacamp

你可能感兴趣的:(AI程序员,算法,机器学习,数据科学,Python,程序员,特征工程)

python 读excel每行替换_Python脚本操作Excel实现批量替换功能 weixin_39646695 python 读excel每行替换
Python脚本操作Excel实现批量替换功能大家好，给大家分享下如何使用Python脚本操作Excel实现批量替换。使用的工具Openpyxl，一个处理excel的python库，处理excel，其实针对的就是WorkBook，Sheet，Cell这三个最根本的元素~明确需求原始excel如下我们的目标是把下面excel工作表的sheet1表页A列的内容“替换我吧”批量替换为B列的“我用来替换的
day15｜前端框架学习和算法 universe_01 前端算法笔记
T22括号生成先把所有情况都画出来，然后（在满足什么情况下）把不符合条件的删除。T78子集要画树状图，把思路清晰。可以用暴力法、回溯法和DFS做这个题DFS深度搜索：每个边都走完，再回溯应用：二叉树搜索，图搜索回溯算法=DFS+剪枝T200岛屿数量（非常经典BFS宽度把树状转化成队列形式，lambda匿名函数“一次性的小函数，没有名字”setup语法糖：让代码更简洁好写的语法ref创建：基本类型的
C++ 计数排序、归并排序、快速排序每天搬一点点砖 c++数据结构算法
计数排序：是一种基于哈希的排序算法。他的基本思想是通过统计每个元素的出现次数，然后根据统计结果将元素依次放入排序后的序列中。这种排序算法适用于范围较小的情况，例如整数范围在0到k之间计数排序步骤：1初始化一个长度为最大元素值加1的计数数组，所有元素初始化为02遍历原始数组，将每个元素值作为索引，在计数数组中对应位置加13将数组清空4遍历计数器数组，按照数组中的元素个数放回到元数组中计数排序的优点和
英伟达靠什么支撑起了4万亿？AI泡沫还能撑多久？
英伟达市值突破4万亿美元，既是AI算力需求爆发的直接体现，也暗含市场对未来的狂热预期。其支撑逻辑与潜在风险并存，而AI泡沫的可持续性则取决于技术、商业与地缘政治的复杂博弈。⚙️一、英伟达4万亿市值的核心支撑因素技术垄断与生态壁垒硬件优势：英伟达GPU在AI训练市场占有率超87%，H100芯片的FP16算力达1979TFLOPS，领先竞品3-5倍。CUDA生态：400万开发者构建的软件护城河，成为A
Spring进阶 - SpringMVC实现原理之DispatcherServlet处理请求的过程倾听铃的声后端 spring java mvc 开发语言分布式
前文我们有了IOC的源码基础以及SpringMVC的基础，我们便可以进一步深入理解SpringMVC主要实现原理，包含DispatcherServlet的初始化过程和DispatcherServlet处理请求的过程的源码解析。本文是第二篇：DispatcherServlet处理请求的过程的源码解析。@pdaiSpring进阶-SpringMVC实现原理之DispatcherServlet处理请求的
【C++算法】76.优先级队列_前 K 个高频单词流星白龙优选算法C++c++算法开发语言
文章目录题目链接：题目描述：解法C++算法代码：题目链接：692.前K个高频单词题目描述：解法利用堆来解决TopK问题预处理一下原始的字符串数组，用一个哈希表统计一下每一个单词出现的频次。创建一个大小为k的堆频次：小根堆字典序（频次相同的时候）：大根堆循环让元素依次进堆判断提取结果C++算法代码：classSolution{//定义类型别名，PSI表示对typedefpairPSI;//自定义比较
企业级区块链平台Hyperchain核心原理剖析 boyedu 区块链区块链企业级区块链平台 Hyperchain
Hyperchain作为国产自主可控的企业级联盟区块链平台，其核心原理围绕高性能共识、隐私保护、智能合约引擎及可扩展架构展开，通过多模块协同实现企业级区块链网络的高效部署与安全运行。以下从核心架构、关键技术、性能优化、安全机制、应用场景五个维度展开剖析：一、核心架构：分层解耦与模块化设计Hyperchain采用分层架构，将区块链功能解耦为独立模块，支持灵活组合与扩展：P2P网络层由验证节点（VP）
力扣面试题07 - 旋转矩阵茶猫_ leetcode 矩阵算法 c语言
题目：给你一幅由N×N矩阵表示的图像，其中每个像素的大小为4字节。请你设计一种算法，将图像旋转90度。不占用额外内存空间能否做到？示例1:给定matrix=[[1,2,3],[4,5,6],[7,8,9]],原地旋转输入矩阵，使其变为:[[7,4,1],[8,5,2],[9,6,3]]示例2:给定matrix=[[5,1,9,11],[2,4,8,10],[13,3,6,7],[15,14,12,
通义万相2.2：开启高清视频生成新纪元 Liudef06小白特殊专栏 AIGC 人工智能人工智能通义万相2.2 图生视频
通义万相2.2：开启高清视频生成新纪元2025年7月28日，中国AI领域迎来里程碑时刻——通义万相团队正式开源其革命性视频生成模型Wan2.2的核心权重，这标志着开源社区首次获得支持720P高清视频生成的先进模型架构。一、架构革新：混合专家系统1.1MoE视频扩散架构通义万相2.2首次将混合专家（MoE）架构引入视频扩散模型，通过双专家系统实现计算效率与模型容量的平衡：classMoEVideoD
模拟退火(SA)：如何“故意走错路”，才能找到最优解？小瑞瑞acd 小瑞瑞学数模模拟退火算法 python 启发式算法算法
模拟退火(SA)：如何“故意走错路”，才能找到最优解？图示模拟退火算法如何通过接受较差解（橙色虚线标注）从局部最优（绿色点）逃逸，最终找到全局最优解（紫色点），展示其跳出局部极小值的能力。大家好，我是小瑞瑞！欢迎回到我的专栏！想象一下，你站在一座连绵不绝的山脉中，目标是找到海拔最低的那个山谷。你手上只有一个高度计，视野被浓雾笼罩，只能看清脚下的一小片区域。如果你是一个“贪心”的登山者，你的策略会非
2025年SDK游戏盾终极解析：重新定义手游安全的“隐形护甲” 上海云盾商务经理杨杨游戏安全
副标题：从客户端加密到AI反外挂，拆解全链路防护如何重塑游戏攻防天平引言：当传统高防在手游战场“失效”2025年全球手游市场规模突破$2000亿，黑客单次攻击成本却降至$30——某SLG游戏因协议层CC攻击单日流失37%玩家，某开放世界游戏遭低频DDoS瘫痪6小时损失千万。传统高防IP的致命短板暴露无遗：无法识别伪造客户端流量、难防协议篡改、误杀率超15%。而集成于游戏终端的SDK游戏盾，正以“源
LVS+Keepalived实现高可用和负载均衡 2401_84412895 程序员 lvs 负载均衡运维
2、开启网卡子接口配置VIP[root@a~]#cd/etc/sysconfig/network-scripts/[root@anetwork-scripts]#cp-aifcfg-ens32ifcfg-ens32:0[root@anetwork-scripts]#catifcfg-ens32:0BOOTPROTO=staticDEVICE=ens32:0ONBOOT=yesIPADDR=10.1
CodeFoeces-450B ss5smi
题目原题链接：B.JzzhuandSequences题意根据公式公式计算对应fn的值。参考了其他作者的代码和思路。找循环点。负数取余需要加取余数到>0为止才可取余。代码#includeusingnamespacestd;constintmod=1e9+7;intmain(){longlongf[10],x,y,n;cin>>x>>y>>n;x=(x+mod)%mod;y=(y+mod)%mod;f
编程算法：技术创新的引擎与业务增长的核心驱动力
在数字经济时代，算法已成为推动技术创新与业务增长的隐形引擎。从存内计算突破冯·诺依曼瓶颈，到动态规划优化万亿级金融交易，编程算法正在重塑产业竞争格局。一、存内计算：突破冯·诺依曼瓶颈的算法革命1.1存内计算的基本原理传统计算架构中90%的能耗消耗在数据搬运上。存内计算（Processing-in-Memory）通过直接在存储单元执行计算，实现能效10-100倍提升：#传统计算vs存内计算能耗模型i
图论算法经典题目解析：DFS、BFS与拓扑排序实战周童學数据结构与算法深度优先算法图论
图论算法经典题目解析：DFS、BFS与拓扑排序实战图论问题是算法面试中的高频考点，本博客将通过四道LeetCode经典题目（均来自"Top100Liked"题库），深入讲解图论的核心算法思想和实现技巧。涵盖DFS、BFS、拓扑排序和前缀树等知识点，每道题配有Java实现和易错点分析。1.岛屿数量(DFS遍历)问题描述给定一个由'1'(陆地)和'0'(水)组成的二维网格，计算岛屿的数量。岛屿由水平或
【异常】使用 LiteFlow 框架时，提示错误ChainDuplicateException: [chain name duplicate] chainName=categoryChallenge 本本本添哥 002 -进阶开发能力 java
一、报错内容Causedby:com.yomahub.liteflow.exception.ChainDuplicateException:[chainnameduplicate]chainName=categoryChallengeatcom.yomahub.liteflow.parser.helper.ParserHelper.lambda$null$0(ParserHelper.java:1
代码随想录算法训练营第三十五天
01背包问题二维题目链接01背包问题二维题解importjava.util.Scanner;publicclassMain{publicstaticvoidmain(String[]args){Scannersc=newScanner(System.in);intM=sc.nextInt();intN=sc.nextInt();int[]space=newint[M];int[]value=new
为了在未来的人工智能世界中取得成功，学生们必须学习人类写作的优点睿邸管家
澳大利亚各地的学生在新学年开始使用铅笔、钢笔和键盘学习写字。在工作场所，机器也在学习写作，如此有效，几年之内，它们可能会写得比人类更好。有时它们已经做到了，就像Grammarly这样的应用程序所展示的那样。当然，人类现在的日常写作可能很快就会由具有人工智能(AI)的机器来完成。手机和电子邮件软件常用的预测文本是无数人每天都在使用的一种人工智能写作形式。据AI行业研究机构称，到2022年，人工智能及
AI模型训练中过拟合和欠拟合的区别是什么？ workflower 人工智能算法人工智能数据分析
在AI模型训练中，过拟合和欠拟合是两种常见的模型性能问题，核心区别在于模型对数据的学习程度和泛化能力：欠拟合（Underfitting）-定义：模型未能充分学习到数据中的规律，对训练数据的拟合程度较差，在训练集和测试集上的表现都不好（如准确率低、损失值高）。-原因：-模型结构过于简单（如用线性模型解决非线性问题）；-训练数据量不足或特征信息不充分；-训练时间太短，模型尚未学到有效模式。-表现：训练
Selenium 特殊控件操作与 ActionChains 实践详解小馋喵知识杂货铺 selenium 测试工具
1.下拉框单选操作(a)使用SeleniumSelect类（标准HTML标签）Selenium提供了内置的Select类用于操作标准下拉框，这种方式简单且直观。fromselenium.webdriver.support.uiimportSelect#定位下拉框dropdown=Select(driver.find_element("id","dropdown_id"))#通过以下三种方式选择单个
python笔记14介绍几个魔法方法抢公主的大魔王 python python
python笔记14介绍几个魔法方法先声明一下各位大佬，这是我的笔记。如有错误，恳请指正。另外，感谢您的观看，谢谢啦！(1).__doc__输出对应的函数，类的说明文档print(print.__doc__)print(value,...,sep='',end='\n',file=sys.stdout,flush=False)Printsthevaluestoastream,ortosys.std
Anaconda 和 Miniconda：功能详解与选择建议古月฿ python入门 python conda
Anaconda和Miniconda详细介绍一、Anaconda的详细介绍1.什么是Anaconda？Anaconda是一个开源的包管理和环境管理工具，在数据科学、机器学习以及科学计算领域发挥着关键作用。它以Python和R语言为基础，为用户精心准备了大量预装库和工具，极大地缩短了搭建数据科学环境的时间。对于那些想要快速开展数据分析、模型训练等工作的人员来说，Anaconda就像是一个一站式的“数
环境搭建 | Python + Anaconda / Miniconda + PyCharm 的安装、配置与使用
本文将分别介绍Python、Anaconda/Miniconda、PyCharm的安装、配置与使用，详细介绍Python环境搭建的全过程，涵盖Python、Pip、PythonLauncher、Anaconda、Miniconda、Pycharm等内容，以官方文档为参照，使用经验为补充，内容全面而详实。由于图片太多，就先贴一个无图简化版吧，详情请查看Python+Anaconda/Minicond
你竟然还在用克隆删除？Conda最新版rename命令全攻略！曦紫沐 Python基础知识 conda 虚拟环境管理
文章摘要Conda虚拟环境管理终于迎来革命性升级！本文揭秘Conda4.9+版本新增的rename黑科技，彻底告别传统“克隆+删除”的繁琐操作。从命令解析到实战案例，手把手教你如何安全高效地重命名Python虚拟环境，附带版本检测、环境迁移、故障排查等进阶技巧，助你提升开发效率10倍！一、颠覆认知：Conda居然自带重命名功能？很多开发者仍停留在“Conda无法直接重命名环境”的认知阶段，实际上自
centos7安装配置 Anaconda3
Anaconda是一个用于科学计算的Python发行版,Anaconda于Python，相当于centos于linux。下载[root@testsrc]#mwgethttps://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-5.2.0-Linux-x86_64.shBegintodownload:Anaconda3-5.2.0-L
Pandas：数据科学的超级瑞士军刀科技林总 DeepSeek学AI 人工智能
**——从零基础到高效分析的进化指南**###**一、Pandas诞生：数据革命的救世主****2010年前的数据分析噩梦**：```python#传统Python处理表格数据data=[]forrowincsv_file:ifrow[3]>100androw[2]=="China":data.append(float(row[5])#代码冗长易错！```**核心痛点**：-Excel处理百万行崩
Zread.AI：一键将GitHub项目转化为结构化中文手册的AI代码维基工具
Zread.AI：一键将GitHub项目转化为结构化中文手册的AI代码维基工具文章来源：PoixeAI文章目录Zread.AI工具概述核心功能优势亮点典型应用场景上手指南注意事项官网地址Zread.AI由智谱Z.ai推出，是一款面向开发者的AI代码维基工具，可在几秒内把任何公开GitHub仓库转化为结构化中文手册，并通过独家Buzz面板聚合commits、issues与相关新闻，让项目脉搏一目了然
学C++的五大惊人好处
为什么要学c++学c++有什么用学习c++的好处有1.中考可以加分2.高考可能直接录取3.就业广且工资高4.在未来30--50年c++一定是一个很受欢迎的职业5.c++成功的例子deepsick等AI智能C++语言兼备编程效率和编译运行效率的语言C++语言是C语言功能增强版,在c语言的基础上添加了面向对象编程和泛型编程的支持既继承了C语言高效，简洁，快速和可移植的传统,又具备类似Java、Go等其
机器学习必备数学与编程指南：从入门到精通 a小胡哦机器学习基础机器学习人工智能
一、机器学习核心数学基础1.线性代数（神经网络的基础）必须掌握：矩阵运算（乘法、转置、逆）向量空间与线性变换特征值分解与奇异值分解(SVD)为什么重要：神经网络本质就是矩阵运算学习技巧：用NumPy实际操作矩阵运算2.概率与统计（模型评估的关键）核心概念：条件概率与贝叶斯定理概率分布（正态、泊松、伯努利）假设检验与p值应用场景：朴素贝叶斯、A/B测试3.微积分（优化算法的基础）重点掌握：导数与偏导
Android GreenDao介绍和Generator生成表对象代码
目录(?)[-]介绍创建工程转载请注明：http://blog.csdn.net/sinat_30276961/article/details/50052109最近无意中发现了GreenDao，然后查看了一些资料后，发现这个数据库框架很适合用，于是乎，查看了官网的api，并自己写了一个小应用总结一下它的使用方法。介绍按照国际惯例，在开篇，总要先介绍一下什么是GreenDao吧。首先需要说明的是Gr
分享100个最新免费的高匿HTTP代理IP mcj8089 代理IP 代理服务器匿名代理免费代理IP 最新代理IP
推荐两个代理IP网站： 1. 全网代理IP：http://proxy.goubanjia.com/ 2. 敲代码免费IP：http://ip.qiaodm.com/ 120.198.243.130:80,中国/广东省 58.251.78.71:8088,中国/广东省 183.207.228.22:83,中国/
mysql高级特性之数据分区 annan211 java 数据结构 mongodb 分区 mysql
mysql高级特性 1 以存储引擎的角度分析，分区表和物理表没有区别。是按照一定的规则将数据分别存储的逻辑设计。器底层是由多个物理字表组成。 2 分区的原理分区表由多个相关的底层表实现，这些底层表也是由句柄对象表示，所以我们可以直接访问各个分区。存储引擎管理分区的各个底层表和管理普通表一样(所有底层表都必须使用相同的存储引擎)，分区表的索引只是
JS采用正则表达式简单获取URL地址栏参数 chiangfai js 地址栏参数获取
GetUrlParam:function GetUrlParam(param){ var reg = new RegExp("(^|&)"+ param +"=([^&]*)(&|$)"); var r = window.location.search.substr(1).match(reg); if(r!=null
怎样将数据表拷贝到powerdesigner (本地数据库表) Array_06 powerDesigner
================================================== 1、打开PowerDesigner12，在菜单中按照如下方式进行操作 file->Reverse Engineer->DataBase 点击后，弹出 New Physical Data Model 的对话框 2、在General选项卡中 Model name:模板名字，自
logbackのhelloworld 飞翔的马甲日志 logback
一、概述 1.日志是啥？当我是个逗比的时候我是这么理解的：log.debug()代替了system.out.print(); 当我项目工作时，以为是一堆得.log文件。这两天项目发布新版本，比较轻松，决定好好地研究下日志以及logback。传送门1：日志的作用与方法： http://www.infoq.com/cn/articles/why-and-how-log 上面的作
新浪微博爬虫模拟登陆随意而生新浪微博
转载自：http://hi.baidu.com/erliang20088/item/251db4b040b8ce58ba0e1235 近来由于毕设需要，重新修改了新浪微博爬虫废了不少劲，希望下边的总结能够帮助后来的同学们。现行版的模拟登陆与以前相比，最大的改动在于cookie获取时候的模拟url的请求
synchronized 香水浓 java thread
Java语言的关键字，可用来给对象和方法或者代码块加锁，当它锁定一个方法或者一个代码块的时候，同一时刻最多只有一个线程执行这段代码。当两个并发线程访问同一个对象object中的这个加锁同步代码块时，一个时间内只能有一个线程得到执行。另一个线程必须等待当前线程执行完这个代码块以后才能执行该代码块。然而，当一个线程访问object的一个加锁代码块时，另一个线程仍然
maven 简单实用教程 AdyZhang maven
1. Maven介绍 1.1. 简介 java编写的用于构建系统的自动化工具。目前版本是2.0.9，注意maven2和maven1有很大区别，阅读第三方文档时需要区分版本。 1.2. Maven资源见官方网站；The 5 minute test，官方简易入门文档；Getting Started Tutorial，官方入门文档；Build Coo
Android 通过 intent传值获得null aijuans android
我在通过intent 获得传递兑现过的时候报错，空指针,我是getMap方法进行传值，代码如下 1 2 3 4 5 6 7 8 9 public void getMap(View view){ Intent i =
apache 做代理报如下错误：The proxy server received an invalid response from an upstream baalwolf response
网站配置是apache＋tomcat,tomcat没有报错，apache报错是： The proxy server received an invalid response from an upstream server. The proxy server could not handle the request GET /. Reason: Error reading fr
Tomcat6 内存和线程配置 BigBird2012 tomcat6
1、修改启动时内存参数、并指定JVM时区（在windows server 2008 下时间少了8个小时）在Tomcat上运行j2ee项目代码时，经常会出现内存溢出的情况，解决办法是在系统参数中增加系统参数： window下，在catalina.bat最前面 set JAVA_OPTS=-XX:PermSize=64M -XX:MaxPermSize=128m -Xms5
Karam与TDD bijian1013 Karam TDD
一.TDD 测试驱动开发（Test-Driven Development,TDD）是一种敏捷（AGILE）开发方法论，它把开发流程倒转了过来，在进行代码实现之前，首先保证编写测试用例，从而用测试来驱动开发（而不是把测试作为一项验证工具来使用）。 TDD的原则很简单： a.只有当某个
[Zookeeper学习笔记之七]Zookeeper源代码分析之Zookeeper.States bit1129 zookeeper
public enum States { CONNECTING, //Zookeeper服务器不可用，客户端处于尝试链接状态 ASSOCIATING, //？？？ CONNECTED, //链接建立，可以与Zookeeper服务器正常通信 CONNECTEDREADONLY, //处于只读状态的链接状态，只读模式可以在
【Scala十四】Scala核心八：闭包 bit1129 scala
Free variable A free variable of an expression is a variable that’s used inside the expression but not defined inside the expression. For instance, in the function literal expression (x: Int) => (x
android发送json并解析返回json ronin47 android
package com.http.test; import org.apache.http.HttpResponse; import org.apache.http.HttpStatus; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.HttpGet; import
一份IT实习生的总结 brotherlamp PHP php资料 php教程 php培训 php视频
今天突然发现在不知不觉中自己已经实习了 3 个月了，现在可能不算是真正意义上的实习吧，因为现在自己才大三，在这边撸代码的同时还要考虑到学校的功课跟期末考试。让我震惊的是，我完全想不到在这 3 个月里我到底学到了什么，这是一件多么悲催的事情啊。同时我对我应该 get 到什么新技能也很迷茫。所以今晚还是总结下把，让自己在接下来的实习生活有更加明确的方向。最后感谢工作室给我们几个人这个机会让我们提前出来
据说是2012年10月人人网校招的一道笔试题-给出一个重物重量为X,另外提供的小砝码重量分别为1，3，9。。。3^N。将重物放到天平左侧，问在两边如何添加砝码 bylijinnan java
public class ScalesBalance { /** * 题目： * 给出一个重物重量为X,另外提供的小砝码重量分别为1，3，9。。。3^N。（假设N无限大，但一种重量的砝码只有一个） * 将重物放到天平左侧，问在两边如何添加砝码使两边平衡 * * 分析： * 三进制 * 我们约定括号表示里面的数是三进制，例如 47=(1202
dom4j最常用最简单的方法 chiangfai dom4j
要使用dom4j读写XML文档,需要先下载dom4j包,dom4j官方网站在 http://www.dom4j.org/目前最新dom4j包下载地址:http://nchc.dl.sourceforge.net/sourceforge/dom4j/dom4j-1.6.1.zip 解开后有两个包,仅操作XML文档的话把dom4j-1.6.1.jar加入工程就可以了,如果需要使用XPath的话还需要
简单HBase笔记 chenchao051 hbase
一、Client-side write buffer 客户端缓存请求描述：可以缓存客户端的请求，以此来减少RPC的次数，但是缓存只是被存在一个ArrayList中，所以多线程访问时不安全的。可以使用getWriteBuffer()方法来取得客户端缓存中的数据。默认关闭。二、Scan的Caching 描述： next( )方法请求一行就要使用一次RPC,即使
mysqldump导出时出现when doing LOCK TABLES daizj mysql mysqdump 导数据
　　执行　mysqldump -uxxx -pxxx -hxxx -Pxxxx database tablename > tablename.sql　导出表时，会报 mysqldump: Got error: 1044: Access denied for user 'xxx'@'xxx' to database 'xxx' when doing LOCK TABLES 解决
CSS渲染原理 dcj3sjt126com Web
从事Web前端开发的人都与CSS打交道很多，有的人也许不知道css是怎么去工作的，写出来的css浏览器是怎么样去解析的呢？当这个成为我们提高css水平的一个瓶颈时，是否应该多了解一下呢？一、浏览器的发展与CSS
《阿甘正传》台词 dcj3sjt126com
Part Ⅰ: 《阿甘正传》Forrest Gump经典中英文对白 Forrest: Hello! My names Forrest. Forrest Gump. You wanna Chocolate? I could eat about a million and a half othese. My momma always said life was like a box ochocol
Java处理JSON dyy_gusi json
Json在数据传输中很好用，原因是JSON 比 XML 更小、更快，更易解析。在Java程序中，如何使用处理JSON，现在有很多工具可以处理，比较流行常用的是google的gson和alibaba的fastjson，具体使用如下： 1、读取json然后处理 class ReadJSON { public static void main(String[] args)
win7下nginx和php的配置 geeksun nginx
1. 安装包准备 nginx : 从nginx.org下载nginx-1.8.0.zip php：从php.net下载php-5.6.10-Win32-VC11-x64.zip， php是免安装文件。 RunHiddenConsole: 用于隐藏命令行窗口 2. 配置 # java用8080端口做应用服务器，nginx反向代理到这个端口即可 p
基于2.8版本redis配置文件中文解释 hongtoushizi redis
转载自： http://wangwei007.blog.51cto.com/68019/1548167 在Redis中直接启动redis-server服务时, 采用的是默认的配置文件。采用redis-server xxx.conf 这样的方式可以按照指定的配置文件来运行Redis服务。下面是Redis2.8.9的配置文
第五章常用Lua开发库3-模板渲染 jinnianshilongnian nginx lua
动态web网页开发是Web开发中一个常见的场景，比如像京东商品详情页，其页面逻辑是非常复杂的，需要使用模板技术来实现。而Lua中也有许多模板引擎，如目前我在使用的lua-resty-template，可以渲染很复杂的页面，借助LuaJIT其性能也是可以接受的。如果学习过JavaEE中的servlet和JSP的话，应该知道JSP模板最终会被翻译成Servlet来执行；而lua-r
JZSearch大数据搜索引擎颠覆者 JavaScript
系统简介：大数据的特点有四个层面：第一，数据体量巨大。从TB级别，跃升到PB级别；第二，数据类型繁多。网络日志、视频、图片、地理位置信息等等。第三，价值密度低。以视频为例，连续不间断监控过程中，可能有用的数据仅仅有一两秒。第四，处理速度快。最后这一点也是和传统的数据挖掘技术有着本质的不同。业界将其归纳为4个“V”——Volume，Variety，Value，Velocity。大数据搜索引
10招让你成为杰出的Java程序员 pda158 java 编程框架
如果你是一个热衷于技术的 Java 程序员，那么下面的 10 个要点可以让你在众多 Java 开发人员中脱颖而出。　　 1. 拥有扎实的基础和深刻理解 OO 原则　　对于 Java 程序员，深刻理解 Object Oriented Programming（面向对象编程）这一概念是必须的。没有 OOPS 的坚实基础，就领会不了像 Java 这些面向对象编程语言
tomcat之oracle连接池配置小网客 oracle
tomcat版本7.0 配置oracle连接池方式：修改tomcat的server.xml配置文件： <GlobalNamingResources> <Resource name="utermdatasource" auth="Container" type="javax.sql.DataSou
Oracle 分页算法汇总 vipbooks oracle sql 算法 .net
这是我找到的一些关于Oracle分页的算法，大家那里还有没有其他好的算法没？我们大家一起分享一下！ -- Oracle 分页算法一 select * from ( select page.*,rownum rn from (select * from help) page -- 20 = (currentPag