把之前学到的对数据的处理方法都用一遍,以后直接使用处理好的数据。
开始机器学习建模(简单建模,不涉及调参)和评估。
import pandas as pd # 用于数据处理和分析,可处理表格数据。
import matplotlib.pyplot as plt # 用于绘制各种类型的图表。
import seaborn as sns # 基于matplotlib的高级绘图库,绘制更美观的统计图
import numpy as np # 用于数值计算,提供高效数组操作
# 设置中文字体(解决中文显示问题)
plt.rcParams['font.sans-serif'] = ['SimHei'] # Windows系统常用黑体字体
plt.rcParams['axes.unicode_minus'] = False # 正常显示负号
# 读取数据
dt = pd.read_csv('data.csv')
print('数据基本信息:')
print(dt.info())
print('\n前五行信息预览')
print(dt.head())
输出:
数据基本信息:
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 7500 non-null int64
1 Home Ownership 7500 non-null object
2 Annual Income 5943 non-null float64
3 Years in current job 7129 non-null object
4 Tax Liens 7500 non-null float64
5 Number of Open Accounts 7500 non-null float64
6 Years of Credit History 7500 non-null float64
7 Maximum Open Credit 7500 non-null float64
8 Number of Credit Problems 7500 non-null float64
9 Months since last delinquent 3419 non-null float64
10 Bankruptcies 7486 non-null float64
11 Purpose 7500 non-null object
12 Term 7500 non-null object
13 Current Loan Amount 7500 non-null float64
14 Current Credit Balance 7500 non-null float64
15 Monthly Debt 7500 non-null float64
16 Credit Score 5943 non-null float64
17 Credit Default 7500 non-null int64
dtypes: float64(12), int64(2), object(4)
memory usage: 1.0+ MB
None
前五行信息预览
Id Home Ownership Annual Income Years in current job Tax Liens \
0 0 Own Home 482087.0 NaN 0.0
1 1 Own Home 1025487.0 10+ years 0.0
2 2 Home Mortgage 751412.0 8 years 0.0
3 3 Own Home 805068.0 6 years 0.0
4 4 Rent 776264.0 8 years 0.0
Number of Open Accounts Years of Credit History Maximum Open Credit \
0 11.0 26.3 685960.0
1 15.0 15.3 1181730.0
2 11.0 35.0 1182434.0
3 8.0 22.5 147400.0
4 13.0 13.6 385836.0
Number of Credit Problems Months since last delinquent Bankruptcies \
0 1.0 NaN 1.0
1 0.0 NaN 0.0
2 0.0 NaN 0.0
3 1.0 NaN 1.0
4 1.0 NaN 0.0
Purpose Term Current Loan Amount \
0 debt consolidation Short Term 99999999.0
1 debt consolidation Long Term 264968.0
2 debt consolidation Short Term 99999999.0
3 debt consolidation Short Term 121396.0
4 debt consolidation Short Term 125840.0
Current Credit Balance Monthly Debt Credit Score Credit Default
0 47386.0 7914.0 749.0 0
1 394972.0 18373.0 737.0 1
2 308389.0 13651.0 742.0 0
3 95855.0 11338.0 694.0 0
4 93309.0 7180.0 719.0 0
一共有17个特征,分别处理:
- Annual Income:有5943个非空值,float类型可以使用中位数和均值或基于其他相关特征进行回归预测填充。例如。如果,如果"Annual Income"和"Home Ownership"有一定相关性,可根据不同的房屋所有权类型的样本的平均收入来填补。
- Years in current job:有7129个非空值,但是是object类型,需要先将其转换为合适的数值类型再处理,例如"10+ years"可以转换为10,"6 years"可以转换为6,然后利用众数或中位数填补缺失值。
- Months since last delinquent:有3419个非空值,缺失值较多。若该特征对目标变量影响较大,可尝试用多重填补法等较为复杂的方法进行填充;若影响较小,也可直接删除具有缺失值的行,但要注意可能会导致数据量损失较大。
- Credit Score:有5943个非空值,可参考"Annual Income"的处理方法。
- Years in current job:可以将其从对象型转换为数值型,方便后续计算和模型处理。
- Home Ownership、Purpose、Term:可以进行标签编码和独热编码。如果 特征的类别数目较少且没有明显的顺序关系,可以使用独热编码,比如Home Ownership和Purpose;对于有一定顺序关系的比如Term的‘Short Term’和‘Long Term’,适合使用标签编码。
- 对于数值型特征,比如Annual Income、Maximum Open Credit等,可采取箱线图检测异常值。如果存在异常值,需要根据实际情况决定是否进行处理。若是数据录入错误等原因导致的异常值,可进行修改或删除;若是真实存在的极端值,可能需要保留,但在某些模型中可能需要进行特殊处理,如采用稳健的统计方法或对数据进行变换。
- 对数值型特征进行特征缩放,将其缩放到相同的尺寸,避免某些特征因数值较大而在模型中占据主导地位。常用的方法有标准化(Min-Max标准化)和归一化(Z-score标准化)。例如,Annual Income、Years of Credit History等特征取值范围差异较大,可通过特征缩放将它们的取值范围统一到[0, 1]或均值为0、标准差为1的分布上。
- 衍生新特征:根据已有特征创建新的特征,可能会对模型性能有提升。例如,例如,可以计算“Debt-to-Income Ratio”(负债收入比),即“Monthly Debt”与“Annual Income”的比值,来反应客户的债务负担情况。
- 特征选择:通过相关性分析等方法,选择与目标变量“Credit Default”相关性较高的特征,去除相关性较低或冗余的特征,以降低模型的复杂度和过拟合的风险。
实际操作中,需要先补全缺失值-数据类型转换-筛选异常值-特征缩放-特征工程。这样的顺序可以保证数据在预处理过程中的一致性和有效性,为后续的机器学习模型训练提供高质量的数据。
具体处理如下:
因为最后输入的都是数值类型,所以先给字符串变量处理了
# 筛选字符串变量
discrete_features = dt.select_dtypes(include=['object']).columns.to_list()
discrete_features
输出:
['Home Ownership', 'Years in current job', 'Purpose', 'Term']
# 依次查看离散特征的类别
for feature in discrete_features:
print(f'\n{feature}的类别有:')
print(dt[feature].value_counts())
输出:
Home Ownership的类别有:
Home Ownership
Home Mortgage 3637
Rent 3204
Own Home 647
Have Mortgage 12
Name: count, dtype: int64
Years in current job的类别有:
Years in current job
10+ years 2332
2 years 705
3 years 620
< 1 year 563
5 years 516
1 year 504
4 years 469
6 years 426
7 years 396
8 years 339
9 years 259
Name: count, dtype: int64
Purpose的类别有:
Purpose
debt consolidation 5944
other 665
home improvements 412
business loan 129
buy a car 96
medical bills 71
major purchase 40
take a trip 37
buy house 34
small business 26
wedding 15
moving 11
educational expenses 10
vacation 8
renewable energy 2
Name: count, dtype: int64
Term的类别有:
Term
Short Term 5556
Long Term 1944
Name: count, dtype: int64
1. Home Ownership进行标签编码
- 按照贷款风险程度(抗风险能力)依次是:自有住房<租房<有其他贷款<有房贷
2. Years in current job进行标签编码
3. Purpose进行独热编码
4. Term直接做0-1映射即可
# 定义嵌套映射字典,用于标签编码
mapping = {
'Home Ownership':{
'Home Mortgage': 3,
'Rent': 1,
'Own Home': 0,
'Have Mortgage': 2
},
'Years in current job':{
'< 1 year': 0,
'1 year': 1,
'2 years': 2,
'3 years': 3,
'4 years': 4,
'5 years': 5,
'6 years': 6,
'7 years': 7,
'8 years': 8,
'9 years': 9,
'10+ years': 10
},
'Term':{
'Short Term': 0,
'Long Term': 1
}
}
# map()方法进行映射
dt['Home Ownership'] = dt['Home Ownership'].map(mapping['Home Ownership'])
dt['Years in current job'] = dt['Years in current job'].map(mapping['Years in current job'])
# 对特征Purpose进行独热编码,get_dummies()
# pd.get_dummies(待处理数据集, columns= ['待处理列'])
dt = pd.get_dummies(dt, columns=['Purpose'])
# 接下来找到"独热编码"生成的新特征,将bool型转换为int型
data = read_csv('data.csv')
for i in dt.columns:
if i not in data.columns:
dt[i] = dt[i].astype(int)
# Term 0-1映射
dt['Term'] = dt['Term'].map(mapping['Term'])
dt.rename(columns={'Term': 'Long Term'}, inplace=True) # 重命名列
# 查看编码后数据
dt.head()
此时所有特征均是 int 或 float 型
data.isnull().sum()
输出:
Id 0
Home Ownership 0
Annual Income 1557
Years in current job 371
Tax Liens 0
Number of Open Accounts 0
Years of Credit History 0
Maximum Open Credit 0
Number of Credit Problems 0
Months since last delinquent 4081
Bankruptcies 14
Purpose 0
Term 0
Current Loan Amount 0
Current Credit Balance 0
Monthly Debt 0
Credit Score 1557
Credit Default 0
dtype: int64
注:一般需要先进行缺失值补全,再进行独热编码,因为如果使用众数补全,可能会出现补全后全0的情况,但是唯一需要独热编码的特征Purpose本身无缺失值,所以先独热编码再补全无影响。
# 补全缺失值
for i in dt.columns:
if dt[i].isnull().sum() > 0:
mode = dt[i].mode()[0] # 众数
# median = dt[i].median() # 中位数
dt[i].fillna(mode, inplace=True) # 缺失值填补
异常值一般不处理,或者结合对照实验处理和不处理都尝试下,但是论文中要写这个,作为工作量。
根据需求绘制图像,作为描述性统计部分。
首先进行数据划分,如果要调参就要考两次,即测试集、验证集。
这里不调参,所以划分一次即可。
# 划分训练集和测试集
from tkinter import Y
from sklearn.model_selection import train_test_split
X = dt.drop(['Credit Default'], axis=1) # 特征,axis=1表示按列删除
y = dt['Credit Default'] # 标签
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 划分数据集,20%作为测试集,随机种子为42,保证每次划分结果一致
# 训练集和测试集的形状
print(f"训练集形状:{X_train.shape},测试集形状:{X_test.shape}")
输出:
训练集形状:(6000, 31),测试集形状:(1500, 31)
三行经典代码
from sklearn.svm import SVC #支持向量机分类器
from sklearn.neighbors import KNeighborsClassifier #K近邻分类器
from sklearn.linear_model import LogisticRegression #逻辑回归分类器
import xgboost as xgb #XGBoost分类器
import lightgbm as lgb #LightGBM分类器
from sklearn.ensemble import RandomForestClassifier #随机森林分类器
from catboost import CatBoostClassifier #CatBoost分类器
from sklearn.tree import DecisionTreeClassifier #决策树分类器
from sklearn.naive_bayes import GaussianNB #高斯朴素贝叶斯分类器
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # 用于评估分类器性能的指标
from sklearn.metrics import classification_report, confusion_matrix #用于生成分类报告和混淆矩阵
import warnings #用于忽略警告信息
warnings.filterwarnings("ignore") # 忽略所有警告信息
# SVM
# 1. 模型实例化
svm_model = SVC(random_state=42) # 随机种子random_state=42,保证每次运行结果一致(随机种子相同,每次运行结果相同)
# 2. 模型训练(代入训练集)
svm_model.fit(X_train, y_train)
# 3. 模型预测(代入测试集)
svm_pred = svm_model.predict(X_test)
print('\nSVM 分类报告:')
print(classification_report(y_test, svm_pred))
print('SVM混淆矩阵:')
print(confusion_matrix(y_test, svm_pred))
# 计算 SVM 评估指标,这个指标默认计算正类的性能
svm_accuracy = accuracy_score(y_test,svm_pred)
svm_precision = precision_score(y_test, svm_pred)
svm_recall = recall_score(y_test, svm_pred)
svm_f1 = f1_score(y_test, svm_pred)
print("\nSVM模型评估指标: ")
print(f"准确率:{svm_accuracy:.4f}")
print(f"精确率:{svm_precision:.4f}")
print(f"召回率:{svm_recall:.4f}")
print(f"F1 分数:{svm_f1:.4f}")
输出:
SVM 分类报告:
precision recall f1-score support
0 0.71 1.00 0.83 1059
1 0.00 0.00 0.00 441
accuracy 0.71 1500
macro avg 0.35 0.50 0.41 1500
weighted avg 0.50 0.71 0.58 1500
SVM混淆矩阵:
[[1059 0]
[ 441 0]]
SVM模型评估指标:
准确率:0.7060
精确率:0.0000
召回率:0.0000
F1 分数:0.0000
# KNN
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
knn_pred = knn_model.predict(X_test)
print("\nKNN分类报告:")
print(classification_report(y_test, knn_pred))
print("KNN混淆矩阵:")
print(confusion_matrix(y_test, knn_pred))
knn_accuracy = accuracy_score(y_test, knn_pred)
knn_precision = precision_score(y_test, knn_pred)
knn_recall = recall_score(y_test, knn_pred)
knn_f1 = f1_score(y_test, knn_pred)
print("\nKNN模型评估指标")
print(f"准确率:{knn_accuracy:.4f}")
print(f"精确率:{knn_precision:.4f}")
print(f"召回率:{knn_recall:.4f}")
print(f"F1 值:{knn_f1:.4f}")
KNN分类报告:
precision recall f1-score support
0 0.73 0.86 0.79 1059
1 0.41 0.24 0.30 441
accuracy 0.68 1500
macro avg 0.57 0.55 0.54 1500
weighted avg 0.64 0.68 0.65 1500
KNN混淆矩阵:
[[908 151]
[336 105]]
KNN模型评估指标
准确率:0.6753
精确率:0.4102
召回率:0.2381
F1 值:0.3013
# 逻辑回归
logreg_model = LogisticRegression(random_state=42)
logreg_model.fit(X_train, y_train)
logreg_pred = logreg_model.predict(X_test)
print("\n逻辑回归分类报告:")
print(classification_report(y_test, logreg_pred))
print("逻辑回归混淆矩阵:")
print(confusion_matrix(y_test, logreg_pred))
# 评估指标
logreg_accuracy = accuracy_score(y_test, logreg_pred)
logreg_precision = precision_score(y_test, logreg_pred)
logreg_recall = recall_score(y_test, logreg_pred)
logreg_f1 = f1_score(y_test, logreg_pred)
print('\n逻辑回归评估指标:')
print(f'准确率:{logreg_accuracy:.4f}')
print(f'精确率:{logreg_precision:.4f}')
print(f'召回率:{logreg_recall:.4f}')
print(f'F1 值:{logreg_f1:.4f}')
逻辑回归分类报告:
precision recall f1-score support
0 0.75 0.99 0.85 1059
1 0.86 0.20 0.33 441
accuracy 0.76 1500
macro avg 0.80 0.59 0.59 1500
weighted avg 0.78 0.76 0.70 1500
逻辑回归混淆矩阵:
[[1044 15]
[ 351 90]]
逻辑回归评估指标:
准确率:0.7560
精确率:0.8571
召回率:0.2041
F1 值:0.3297
# 朴素贝叶斯
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
nb_pred = nb_model.predict(X_test)
print("\n朴素贝叶斯 分类报告:")
print(classification_report(y_test, nb_pred))
print("朴素贝叶斯 混淆矩阵:")
print(confusion_matrix(y_test, nb_pred))
nb_accuracy = accuracy_score(y_test, nb_pred)
nb_precision = precision_score(y_test, nb_pred)
nb_recall = recall_score(y_test, nb_pred)
nb_f1 = f1_score(y_test, nb_pred)
print("\n朴素贝叶斯 模型评估指标:")
print(f"准确率: {nb_accuracy:.4f}")
print(f"精确率: {nb_precision:.4f}")
print(f"召回率: {nb_recall:.4f}")
print(f"F1 值: {nb_f1:.4f}")
朴素贝叶斯 分类报告:
precision recall f1-score support
0 0.98 0.19 0.32 1059
1 0.34 0.99 0.50 441
accuracy 0.43 1500
macro avg 0.66 0.59 0.41 1500
weighted avg 0.79 0.43 0.38 1500
朴素贝叶斯 混淆矩阵:
[[204 855]
[ 5 436]]
朴素贝叶斯 模型评估指标:
准确率: 0.4267
精确率: 0.3377
召回率: 0.9887
F1 值: 0.5035
# 决策树
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
dt_pred = dt_model.predict(X_test)
print("\n决策树 分类报告:")
print(classification_report(y_test, dt_pred))
print("决策树 混淆矩阵:")
print(confusion_matrix(y_test, dt_pred))
dt_accuracy = accuracy_score(y_test, dt_pred)
dt_precision = precision_score(y_test, dt_pred)
dt_recall = recall_score(y_test, dt_pred)
dt_f1 = f1_score(y_test, dt_pred)
print("\n决策树 模型评估指标:")
print(f"准确率: {dt_accuracy:.4f}")
print(f"精确率: {dt_precision:.4f}")
print(f"召回率: {dt_recall:.4f}")
print(f"F1 值: {dt_f1:.4f}")
决策树 分类报告:
precision recall f1-score support
0 0.79 0.75 0.77 1059
1 0.46 0.51 0.48 441
accuracy 0.68 1500
macro avg 0.62 0.63 0.62 1500
weighted avg 0.69 0.68 0.68 1500
决策树 混淆矩阵:
[[791 268]
[216 225]]
决策树 模型评估指标:
准确率: 0.6773
精确率: 0.4564
召回率: 0.5102
F1 值: 0.4818
# 随机森林
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
print("\n随机森林 分类报告:")
print(classification_report(y_test, rf_pred))
print("随机森林 混淆矩阵:")
print(confusion_matrix(y_test, rf_pred))
rf_accuracy = accuracy_score(y_test, rf_pred)
rf_precision = precision_score(y_test, rf_pred)
rf_recall = recall_score(y_test, rf_pred)
rf_f1 = f1_score(y_test, rf_pred)
print("\n随机森林 模型评估指标:")
print(f"准确率: {rf_accuracy:.4f}")
print(f"精确率: {rf_precision:.4f}")
print(f"召回率: {rf_recall:.4f}")
print(f"F1 值: {rf_f1:.4f}")
随机森林 分类报告:
precision recall f1-score support
0 0.77 0.97 0.86 1059
1 0.79 0.30 0.43 441
accuracy 0.77 1500
macro avg 0.78 0.63 0.64 1500
weighted avg 0.77 0.77 0.73 1500
随机森林 混淆矩阵:
[[1023 36]
[ 309 132]]
随机森林 模型评估指标:
准确率: 0.7700
精确率: 0.7857
召回率: 0.2993
F1 值: 0.4335
# XGBoost
xgb_model = xgb.XGBClassifier(random_state=42)
xgb_model.fit(X_train, y_train)
xgb_pred = xgb_model.predict(X_test)
print("\nXGBoost 分类报告:")
print(classification_report(y_test, xgb_pred))
print("XGBoost 混淆矩阵:")
print(confusion_matrix(y_test, xgb_pred))
xgb_accuracy = accuracy_score(y_test, xgb_pred)
xgb_precision = precision_score(y_test, xgb_pred)
xgb_recall = recall_score(y_test, xgb_pred)
xgb_f1 = f1_score(y_test, xgb_pred)
print("\nXGBoost 模型评估指标:")
print(f"准确率: {xgb_accuracy:.4f}")
print(f"精确率: {xgb_precision:.4f}")
print(f"召回率: {xgb_recall:.4f}")
print(f"F1 值: {xgb_f1:.4f}")
XGBoost 分类报告:
precision recall f1-score support
0 0.77 0.91 0.84 1059
1 0.62 0.37 0.46 441
accuracy 0.75 1500
macro avg 0.70 0.64 0.65 1500
weighted avg 0.73 0.75 0.72 1500
XGBoost 混淆矩阵:
[[960 99]
[280 161]]
XGBoost 模型评估指标:
准确率: 0.7473
精确率: 0.6192
召回率: 0.3651
F1 值: 0.4593
# LightGBM
lgb_model = lgb.LGBMClassifier(random_state=42)
lgb_model.fit(X_train, y_train)
lgb_pred = lgb_model.predict(X_test)
print("\nLightGBM 分类报告:")
print(classification_report(y_test, lgb_pred))
print("LightGBM 混淆矩阵:")
print(confusion_matrix(y_test, lgb_pred))
lgb_accuracy = accuracy_score(y_test, lgb_pred)
lgb_precision = precision_score(y_test, lgb_pred)
lgb_recall = recall_score(y_test, lgb_pred)
lgb_f1 = f1_score(y_test, lgb_pred)
print("\nLightGBM 模型评估指标:")
print(f"准确率: {lgb_accuracy:.4f}")
print(f"精确率: {lgb_precision:.4f}")
print(f"召回率: {lgb_recall:.4f}")
print(f"F1 值: {lgb_f1:.4f}")
@浙大疏锦行