时间:2018年10月9日
本次的分析数据来自Kaggle数据竞赛平台的“give me some credit”竞赛项目。下载地址为:https://www.kaggle.com/c/GiveMeSomeCredit/data
本次分析主要做了两件事:
一、用逻辑回归预测用户未来两年违约的概率
二、根据违约概率制作评分卡
分析步骤:
1.变量的描述
2.数据预处理
3.变量分析
3.模型建立
4.模型评估
5.制作评分卡
SeriousDlqin2yrs:超过90天或更糟的逾期拖欠
RevolvingUtilizationOfpercentage :无担保放款的循环利用:除了不动产和像车贷那样除以信用额度总和的无分期付款债务的信用卡和个人信用额度总额
UnsecuredLines age :借款人当时的年龄
NumberOfTime30-59DaysPastDueNotWorse :35-59天逾期但不糟糕次数
DebtRatio percentage:负债比率
MonthlyIncome real:月收入
NumberOf OpenCreditLinesAndLoans:开放式信贷和贷款数量,开放式贷款(分期付款如汽车贷款或抵押贷款)和信贷(如信用卡)的数量
NumberOfTimes90DaysLate:90天逾期次数:借款者有90天或更高逾期的次数
NumberRealEstateLoans :不动产贷款或额度数量:抵押贷款和不动产放款包括房屋净值信贷额度
OrLinesNumberOfTime60-89DaysPastDueNotWorse :60-89天逾期但不糟糕次数:借款人在在过去两年内有60-89天逾期还款但不糟糕的次数
NumberOfDependents:家属数量:不包括本人在内的家属数量
cstraining<-read.csv('D:\\A\\score-card\\cs-training.csv')
View(cstraining)
X
变量cstraining$X<-NULL
colnames(cstraining)<-c("y" ,"RUOUL","age","D30Past","DR","Income","OCAL" ,"D90Late" ,"RealEstate" ,"D60Past","Dependents" )
names(train)
> prop.table(table(cstraining$y))
0 1
0.93316 0.06684
违约客户样本量只占了6.7%,样本不均衡。这里,我采用的方法是把违约客户重复放进样本里:
bad<-cstraining[which(cstraining$y==1),]
all<-rbind(cstraining,bad)
> prop.table(table(all$y))
0 1
0.8746954 0.1253046
此时,违约客户占比达到了12.5%,在可接受范围内。
dat1<-all
library(mice)
md.pattern(dat1)
由图可以看出字段的缺失值个数Income
:31400 , Dependents
:4103
查看缺失比例:
library(VIM)
aggr_plot <- aggr(all, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(dat1), cex.axis=.7, gap=3, ylab=c("Histogram of missing all","Pattern"))
Variables sorted by number of missings:
Variable Count
Income 0.19621811
Dependents 0.02563958
y 0.00000000
RUOUL 0.00000000
age 0.00000000
D30.59Past 0.00000000
DR 0.00000000
OCAL 0.00000000
D90Late 0.00000000
RealEstate 0.00000000
D60.89Past 0.00000000
##盖帽法函数
block<-function(x,lower=T,upper=T){
if(lower){
q1<-quantile(x,0.01)
x[x<=q1]<-q1
}
if(upper){
q99<-quantile(x,0.99)
x[x>q99]<-q99
}
return(x)
}
(i). 对变量RUOUL
进行处理
dat1$RUOUL<-block(dat1$RUOUL)
boxplot(RUOUL~y,data=dat1,horizontal=T, frame=F,
col="lightgray",main="RUOUL")
> summary(dat1$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 41.0 51.0 51.9 62.0 109.0
> which(dat1$age==0)
[1] 65696
> dat1<-dat1[-which(dat1$age==0),]
年龄为0的肯定是异常值,而且只有一个数据是0,我采用的是直接删除这个记录。
boxplot(age~y,data=dat1,horizontal=T, frame=F,
col="lightgray",main="age")
QLa <- quantile(dat1$age, probs = 0.25)
QUa <- quantile(dat1$age, probs = 0.75)
QUa_QLa <- QUa-QLa
QLa;QUa;QUa_QLa
dat1$age[which(dat1$age> (QUa+1.5*QUa_QLa))]<-1.5*QUa_QLa
boxplot(dat1$age,col="lightgray",main="age")
(iii). 对变量D30Past
进行处理
> table(dat1$D30Past )
0 1 2 3 4 5 6 7 8 9
131059 18441 5817 2372 1065 496 214 82 33 16
10 11 12 13 96 98
7 2 3 2 9 407
> table(dat1$D90Late )
0 1 2 3 4 5 6 7 8 9
148215 7008 2331 1052 486 214 128 69 36 33
10 11 12 13 14 15 17 96 98
13 8 3 6 3 2 2 9 407
> table(dat1$D60Past )
0 1 2 3 4 5 6 7 8 9
149651 7508 1679 498 170 55 28 14 3 1
11 96 98
2 9 407
显然值为96,98 的是异常值。我采用的方法是直接删除。而且D30Past
这个变量的96,98的记录删除之后,后面D90Late
D60Past
两个变量的96,98的记录也同时被删除了。
> dat1<-dat1[-which(dat1$D30Past %in% c(96,98)),]
> table(dat1$D30Past )
0 1 2 3 4 5 6 7 8 9 10 11
131059 18441 5817 2372 1065 496 214 82 33 16 7 2
12 13
3 2
> table(dat1$D90Late )
0 1 2 3 4 5 6 7 8 9 10 11
148215 7008 2331 1052 486 214 128 69 36 33 13 8
12 13 14 15 17
3 6 3 2 2
> table(dat1$D60Past )
0 1 2 3 4 5 6 7 8 9 11
149651 7508 1679 498 170 55 28 14 3 1 2
盖帽法处理异常值
dat1$D30.59Past<-block(dat1$D30.59Past)
> table(dat1$D30.59Past )
0 1 2 3 4
131059 18441 5817 2372 1920
(iv).对变量 "DR"进行处理
> summary(dat1$DR)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 0.2 0.4 350.3 0.9 329664.0
盖帽法处理
dat1$DR<-block(dat1$DR)
(v). 对变量Income
进行处理
因为已经做过多次模型对比,对这个变量分别采用KNN填补和均值填补,发现采用均值填补,模型效果更好,因此我采用的是用均值填补缺失值,然而均值容易受到极端值的影响,所以先对少数过于极端的异常值作处理,再填补缺失值
存在大量0,1,2这样的异常值,当作缺失值处理
dat1$Income[which(dat1$Income %in% c(0,1,2))]<-NA
boxplot(dat1$Income,col="lightgray",main="Income",range =1.5)
红色框里的值对应的记录直接删除,调range参数,找出想删除的值
boxplot(dat1$Income,col="lightgray",main="Income",range =100)
> sort(boxplot.stats(dat1$Income,coef =100)$out)
[1] 562466 582369 629000 649587 699530 702500 730483 835040
[9] 1072500 1560100 1794060 3008750`
dat1<-dat1[-which(dat1$Income>=562466),]
均值填补缺失值
dat1$Income[is.na(dat1$Income)]<-mean(dat1$Income,na.rm=TRUE)
盖帽法处理异常值
dat1$Income<-block(dat1$Income)
> summary(dat1$Income)
Min. 1st Qu. Median Mean 3rd Qu. Max.
820 4000 6630 6441 7300 22942
(vi).对变量OCAL
进行处理
> table(dat1$OCAL)
0 1 2 3 4 5 6 7 8 9 10 11
1959 5020 7275 9739 12352 13750 14367 13991 13166 12018 10197 8790
12 13 14 15 16 17 18 19 20 21 22 23
7445 5976 4846 3926 3206 2527 2007 1516 1266 924 731 575
24 25 26 27 28 29 30 31 32 33 34 35
447 354 261 209 157 122 98 78 54 51 36 28
36 37 38 39 40 41 42 43 44 45 46 47
21 8 13 10 11 5 8 9 2 9 3 2
48 49 50 51 52 53 54 56 57 58
8 5 2 2 4 1 4 2 3 1
> quantile(dat1$OCA,0.99)
99%
24
> quantile(dat1$OCA,0.01)
1%
0
采用盖帽法处理
> dat1$OCAL<-block(dat1$OCAL)
> table(dat1$OCAL)
0 1 2 3 4 5 6 7 8 9 10 11
1959 5020 7275 9739 12352 13750 14367 13991 13166 12018 10197 8790
12 13 14 15 16 17 18 19 20 21 22 23
7445 5976 4846 3926 3206 2527 2007 1516 1266 924 731 575
24
2028
(vii).对变量D90Late
进行处理
> prop.table(table(dat1$D90Late))
0 1 2 3 4
0.92860768060 0.04391059982 0.01460553770 0.00659160260 0.00304517002
5 6 7 8 9
0.00134087733 0.00080202009 0.00043233895 0.00022556815 0.00020677080
10 11 12 13 14
0.00008145517 0.00005012626 0.00001879735 0.00003759469 0.00001879735
15 17
0.00001253156 0.00001253156
单一指标占比大于90%的变量不适合进入模型,然而这个变量又比较重要,所以我把大于0次的都统一为1,然后0代表没有过超过90天逾期记录,1代表有超过90天逾期的记录。
dat1$D90Late<-as.numeric(as.logical(dat1$D90Late))
> prop.table(table(dat1$D90Late))
0 1
0.92860768 0.07139232
(viii).对变量RealEstate
进行处理
> table(dat1$RealEstate)
0 1 2 3 4 5 6 7 8 9 10 11
60439 55083 33284 6722 2362 772 368 202 116 92 42 25
12 13 14 15 16 17 18 19 20 21 23 25
24 20 7 10 5 5 2 2 3 1 2 4
26 29 32 54
1 2 1 1
> quantile(dat1$RealEstate,0.99)
99%
5
> quantile(dat1$RealEstate,0.01)
1%
0
采用盖帽法处理
> dat1$RealEstate<-block(dat1$RealEstate)
> table(dat1$RealEstate)
0 1 2 3 4 5
60439 55083 33284 6722 2362 1707
(ix).对变量D60Past
进行处理
> prop.table(table(dat1$D60Past))
0 1 2 3 4
0.937611609241 0.047037225010 0.010520247874 0.003120359405 0.001065182930
5 6 7 8 9
0.000344618007 0.000175441894 0.000087720947 0.000018797346 0.000006265782
11
0.000012531564
处理方法和理由与变量D90Late
一样
> dat1$D60.89Past<-as.numeric(as.logical(dat1$D60Past))
> prop.table(table(dat1$D60Past))
0 1
0.93761161 0.06238839
(x).对变量Dependents
进行处理
> summary(dat1$Dependents)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 0.00 0.00 0.77 1.00 20.00 4060
> table(dat1$Dependents)
0 1 2 3 4 5 6 7 8 9 10 13
91702 28210 21070 10310 3156 813 182 56 26 5 5 1
20
1
缺失值用众数0填补
dat1$Dependents[is.na(dat1$Dependents)]<-0
盖帽法处理异常值
dat1$Dependents<-block(dat1$Dependents)
> table(dat1$Dependents)
0 1 2 3 4
95762 28210 21070 10310 4245
y中0和1互换,互换后1是没有违约客户,0是违约客户
dat2<-dat1
dat2$y<-as.numeric(!as.logical(dat2$y))
查看相关性
library(corrplot)
cor1<-cor(dat2)
corrplot(cor1,method="number")
各变量间的相关系数不大,可以不作处理
> prop.table(table(dat2$y))
0 1
0.1237993 0.8762007
library(caret)
set.seed(1234)
splitIndex<-createDataPartition(dat2$y,time=1,p=0.7,list=FALSE)
train<-dat2[splitIndex,]
test<-dat2[-splitIndex,]
> prop.table(table(train$y))
0 1
0.1243041 0.8756959
> prop.table(table(test$y))
0 1
0.1226216 0.8773784
> nrow(train)
[1] 111718
> nrow(test)
[1] 47879
model<-glm(y~.,train,family = "binomial")
summary(model)
所有变量都显著
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.759486590 0.051636436 53.441 < 0.0000000000000002 ***
RUOUL -1.965393815 0.031089068 -63.218 < 0.0000000000000002 ***
age 0.018351654 0.000865627 21.200 < 0.0000000000000002 ***
D30.59Past -0.469131910 0.011938381 -39.296 < 0.0000000000000002 ***
DR 0.000054717 0.000012537 4.365 0.0000127 ***
Income 0.000046362 0.000003391 13.673 < 0.0000000000000002 ***
OCAL -0.039503078 0.002506984 -15.757 < 0.0000000000000002 ***
D90Late -1.598706728 0.029895157 -53.477 < 0.0000000000000002 ***
RealEstate -0.133127866 0.012048378 -11.049 < 0.0000000000000002 ***
D60.89Past -1.076664163 0.032420477 -33.209 < 0.0000000000000002 ***
Dependents -0.046652021 0.009639607 -4.840 0.0000013 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 83881 on 111717 degrees of freedom
Residual deviance: 60808 on 111707 degrees of freedom
AIC: 60830
library(car)
> vif(model)
RUOUL age D30.59Past DR Income OCAL D90Late
1.216980 1.152183 1.110489 1.048800 1.220319 1.549682 1.123098
RealEstate D60.89Past Dependents
1.491353 1.081564 1.067081
pre <- predict(object = model, newdata= test, type = 'response')
library(pROC)
modelroc <- roc(test$y,pre)
plot(modelroc, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),grid.col=c("green", "red"), max.auc.polygon=TRUE,auc.polygon.col="skyblue", print.thres=TRUE)
模型AUC值为0.855,效果还可以。(注,在此之前已经用过多种方法清洗数据,得到的AUC值分别为:0.858,0.831,0.796。不同的数据清洗,对模型结果影响也是蛮大的。)
定义KS函数:
myKS <- function(pre,label){
true <- sum(label)
false <- length(label)-true
tpr <- NULL
fpr <- NULL
o_pre <- pre[order(pre)] # let the threshold in an order from small to large
for (i in o_pre){
tp <- sum((pre >= i) & label)
tpr <- c(tpr,tp/true)
fp <- sum((pre >= i) & (1-label))
fpr <- c(fpr,fp/false)
}
plot(o_pre,tpr,type = "l",col= "green",xlab="threshold",ylab="tpr,fpr")
lines(o_pre,fpr,type="l", col = "red")
KSvalue <- max(tpr-fpr)
sub = paste("KS value =",KSvalue)
title(sub=sub)
cutpoint <- which(tpr-fpr==KSvalue)
thre <- o_pre[cutpoint]
lines(c(thre,thre),c(fpr[cutpoint],tpr[cutpoint]),col = "blue")
cat("KS-value:",KSvalue)
}
myKS(pre,test$y)
计算WOE的函数
totalbad = as.numeric(table(train$y))[1]
totalgood = as.numeric(table(train$y))[2]
getWOE <- function(a,p,q)
{
Bad <- as.numeric(table(train$y[a > p & a <= q]))[1]
Good <- as.numeric(table(train$y[a > p & a <= q]))[2]
WOE <- log((Good/totalgood)/(Bad/totalbad),base = exp(1))
return(WOE)
}
library(smbinning)
1.RUOUL
变量
train$RUOUL<-round(train$RUOUL,3)#保留3位小数
RUOULresult=smbinning(df=train,y="y",x="RUOUL",p=0.05)
> RUOULresult$iv
[1] 1.1077
查看分箱情况
RUOULresult$ivtable
smbinning.plot(RUOULresult,option="WoE",sub="RUOUL")
tmp.RUOUL <- 0
for(i in 1:nrow(train)) {
if(train$RUOUL[i] <= 0.114)
tmp.RUOUL[i] <- RUOULresult$ivtable[[13]][1]
else if(train$RUOUL[i] <= 0.227)
tmp.RUOUL[i] <- RUOULresult$ivtable[[13]][2]
else if(train$RUOUL[i] <= 0.3)
tmp.RUOUL[i] <- RUOULresult$ivtable[[13]][3]
else if(train$RUOUL[i] <= 0.389)
tmp.RUOUL[i] <- RUOULresult$ivtable[[13]][4]
else if(train$RUOUL[i] <= 0.493)
tmp.RUOUL[i] <- RUOULresult$ivtable[[13]][5]
else if(train$RUOUL[i] <= 0.6791)
tmp.RUOUL[i] <- RUOULresult$ivtable[[13]][6]
else if(train$RUOUL[i] <= 0.902)
tmp.RUOUL[i] <- RUOULresult$ivtable[[13]][7]
else
tmp.RUOUL[i] <- RUOULresult$ivtable[[13]][8]
}
> table(tmp.RUOUL)
tmp.RUOUL
-1.3896 -0.8822 -0.3639 0.0528 0.2664 0.6058 0.7948 1.3609
16901 8916 8219 5663 5638 5662 12631 48088
train_WOE<-train
train_WOE$w_RUOUL<-tmp.RUOUL
2.对age
变量进行分箱
ageresult=smbinning(df=train,y="y",x="age",p=0.05)
smbinning.plot(ageresult,option="WoE",sub="age")
> ageresult$iv
[1] 0.2582
tmp.age <- 0
for(i in 1:nrow(train)) {
if(train$age[i] <= 35)
tmp.age[i] <- ageresult$ivtable[[13]][1]
else if(train$age[i] <= 43)
tmp.age[i] <- ageresult$ivtable[[13]][2]
else if(train$age[i] <= 49)
tmp.age[i] <- ageresult$ivtable[[13]][3]
else if(train$age[i] <= 52)
tmp.age[i] <- ageresult$ivtable[[13]][4]
else if(train$age[i] <= 55)
tmp.age[i] <- ageresult$ivtable[[13]][5]
else if(train$age[i] <= 59)
tmp.age[i] <- ageresult$ivtable[[13]][6]
else if(train$age[i] <= 63)
tmp.age[i] <- ageresult$ivtable[[13]][7]
else if(train$age[i] <= 67)
tmp.age[i] <- ageresult$ivtable[[13]][8]
else
tmp.age[i] <- ageresult$ivtable[[13]][9]
}
train_WOE$w_age<-tmp.age
3.对D30Past
变量进行分箱
D30Pastresult=smbinning(df=train,y="y",x="D30Past",p=0.05)
smbinning.plot(D30Pastresult,option="WoE",sub="D30Past")
tmp.D30Past <- 0
for(i in 1:nrow(train)) {
if(train$D30Past[i] <= 0)
tmp.D30Past[i] <- D30Pastresult$ivtable[[13]][1]
else if(train$D30Past[i] <= 1)
tmp.D30Past[i] <- D30Pastresult$ivtable[[13]][2]
else
tmp.D30Past[i] <-D30Pastresult$ivtable[[13]][3]
}
table(tmp.D30Past)
train_WOE$w_D30Past<-tmp.D30Past
4对DR
变量进行分箱
train$DR<-round(train$DR,3)#保留3位小数
DRresult=smbinning(df=train,y="y",x="DR",p=0.05)
smbinning.plot(DRresult,option="WoE",sub="DR")
> getWOE(train$DR,-Inf,0.137)
[1] 0.1509092
第一第二个箱合并结果并没有很好,合并后的WOE值与第三个箱的WOE值几乎相等。尝试一下第二第三个箱合并,看一下结果
> getWOE(train$DR,0.016,0.381)
[1] 0.1043971
第二第三个箱合并后,woe值还不错
手动分箱提取woe:
woe.DR_1=getWOE(train$DR,-Inf,0.016)
woe.DR_2=getWOE(train$DR,0.016,0.381)
woe.DR_3=getWOE(train$DR,0.381,0.503)
woe.DR_4=getWOE(train$DR,0.503,0.666)
woe.DR_5=getWOE(train$DR,0.666,3.972)
woe.DR_6=getWOE(train$DR,3.972,Inf)
DR.WOE<-c(woe.DR_1,woe.DR_2,woe.DR_3,woe.DR_4,woe.DR_5,woe.DR_6)
DR.WOE<-round(DR.WOE,3)
画woe代码:
x<- barplot(DR.WOE,xlab="preference",ylab="frequency",main='Weight of Evidence')
lbls<-paste(" ",DR.WOE)
text(x,DR.WOE,labels=lbls,cex=1.5,pos=1)
tmp.DR <- 0
for(i in 1:nrow(train)) {
if(train$DR[i] <= 0.016)
tmp.DR[i] <- woe.DR_1
else if(train$DR[i] <= 0.381)
tmp.DR[i] <- woe.DR_2
else if(train$DR[i] <= 0.503)
tmp.DR[i] <- woe.DR_3
else if(train$DR[i] <= 0.666)
tmp.DR[i] <- woe.DR_4
else if(train$DR[i] <= 3.972)
tmp.DR[i] <- woe.DR_5
else
tmp.DR[i] <- woe.DR_6
}
train_WOE$w_DR<-tmp.DR
5.对Income
变量进行分箱
Incomeresult=smbinning(df=train,y="y",x="Income",p=0.05)
smbinning.plot(Incomeresult,option="WoE",sub="Income")
> getWOE(train$Income,-Inf,3332)
[1] -0.396827
> getWOE(train$Income,9830,Inf)
[1] 0.4653303
合并结果还不错,选择合并
提取woe:
woe.Income_1=getWOE(train$Income,-Inf,3332)
woe.Income_2=getWOE(train$Income,3332,4838)
woe.Income_3=getWOE(train$Income,4838,6620)
woe.Income_4=getWOE(train$Income,6620,9830)
woe.Income_5=getWOE(train$Income,9830,Inf)
Income.WOE=c(woe.Income_1,woe.Income_2,woe.Income_3,woe.Income_4,woe.Income_5)
画woe:
Income.WOE<-round(Income.WOE,3)
x<- barplot(Income.WOE,xlab="preference",ylab="frequency",main='Weight of Evidence')
lbls<-paste(" ",Income.WOE)
text(x,Income.WOE,labels=lbls,cex=1.5,pos=1)
tmp.Income <- 0
for(i in 1:nrow(train)) {
if(train$Income[i] <= 3332)
tmp.Income[i] <- woe.Income_1
else if(train$Income[i] <= 4838)
tmp.Income[i] <- woe.Income_2
else if(train$Income[i] <= 6620)
tmp.Income[i] <- woe.Income_3
else if(train$Income[i] <= 9830)
tmp.Income[i] <- woe.Income_4
else
tmp.Income[i] <- woe.Income_5
}
train_WOE$w_Income<-tmp.Income
6.对OCAL
变量进行分箱
OCALresult=smbinning(df=train,y="y",x="OCAL",p=0.05)
smbinning.plot(OCALresult,option="WoE",sub="OCAL")
> OCALresult$iv
[1] 0.0709
tmp.OCAL <- 0
for(i in 1:nrow(train)) {
if(train$OCAL[i] <= 2)
tmp.OCAL[i] <- OCALresult$ivtable[[13]][1]
else if(train$OCAL[i] <= 3)
tmp.OCAL[i] <- OCALresult$ivtable[[13]][2]
else if(train$OCAL[i] <= 5)
tmp.OCAL[i] <- OCALresult$ivtable[[13]][3]
else if(train$OCAL[i] <= 7)
tmp.OCAL[i] <- OCALresult$ivtable[[13]][4]
else if(train$OCAL[i] <= 8)
tmp.OCAL[i] <- OCALresult$ivtable[[13]][5]
else if(train$OCAL[i] <= 13)
tmp.OCAL[i] <- OCALresult$ivtable[[13]][6]
else
tmp.OCAL[i] <- OCALresult$ivtable[[13]][7]
}
train_WOE$w_OCAL<-tmp.OCAL
7.对D90Late
变量进行分箱
> table(train$D90Late)
0 1
103607 8111
手动分箱
woe.D90Late_1=getWOE(train$D90Late,-Inf,0)
woe.D90Late_2=getWOE(train$D90Late,0,Inf)
> woe.D90Late_1
[1] 0.3793484
> woe.D90Late_2
[1] -2.279932
#画WOE 代码:
D90Late.WOE<-c(woe.D90Late_1,woe.D90Late_2)
D90Late.WOE<-round(D90Late.WOE,3)
x<- barplot(D90Late.WOE,xlab="preference",ylab="frequency",main = 'D90Late')
lbls<-paste(" ",D90Late.WOE)
text(x,D90Late.WOE,labels=lbls,cex=1.5,pos=1)
tmp.D90Late<- 0
for(i in 1:nrow(train)) {
if(train$D90Late[i] <= 0)
tmp.D90Late[i] <- woe.D90Late_1
else
tmp.D90Late[i] <- woe.D90Late_2
}
train_WOE$w_D90Late<-tmp.D90Late
8.对RealEstate
变量进行分箱
RealEstateresult=smbinning(df=train,y="y",x="RealEstate",p=0.05)
smbinning.plot(RealEstateresult,option="WoE",sub="RealEstate")
> RealEstateresult$iv
[1] 0.0491
tmp.RealEstate <- 0
for(i in 1:nrow(train)) {
if(train$RealEstate[i] <= 0)
tmp.RealEstate[i] <- RealEstateresult$ivtable[[13]][1]
else if(train$RealEstate[i] <= 1)
tmp.RealEstate[i] <- RealEstateresult$ivtable[[13]][2]
else if(train$RealEstate[i] <= 2)
tmp.RealEstate[i] <- RealEstateresult$ivtable[[13]][3]
else
tmp.RealEstate[i] <- RealEstateresult$ivtable[[13]][4]
}
train_WOE$w_RealEstate<-tmp.RealEstate
9.对D60Past
变量进行分箱
> table(train$D60Past)
0 1
104720 6998
woe.D60Past_1 <- getWOE(train$D60Past,-Inf,0)
woe.D60Past_2 <- getWOE(train$D60Past,0,Inf)
> woe.D60Past_1
[1] 0.275222
> woe.D60Past_2
[1] -2.065585
tmp.D60Past<- 0
for(i in 1:nrow(train)) {
if(train$D60Past[i] <= 0)
tmp.D60Past[i] <- woe.D60Past_1
else
tmp.D60Past[i] <- woe.D60Past_2
}
train_WOE$w_D60Past<-tmp.D60Past
10.对Dependents
变量进行分箱:
Dependentsresult=smbinning(df=train,y="y",x="Dependents",p=0.05)
smbinning.plot(Dependentsresult,option="WoE",sub="Dependents")
> Dependentsresult$iv
[1] 0.0391
tmp.Dependents <- 0
for(i in 1:nrow(train)) {
if(train$Dependents[i] <= 0)
tmp.Dependents[i] <- Dependentsresult$ivtable[[13]][1]
else if(train$Dependents[i] <= 1)
tmp.Dependents[i] <- Dependentsresult$ivtable[[13]][2]
else if(train$Dependents[i] <= 2)
tmp.Dependents[i] <- Dependentsresult$ivtable[[13]][3]
else
tmp.Dependents[i] <- Dependentsresult$ivtable[[13]][4]
}
train_WOE$w_Dependents<-tmp.Dependents
train.woe<-train_WOE[,c(1,12:21)]
View(train.woe)
1、信用评分卡的评分公式为:
2、标准odds : 标准评分对应的odds
3、ln(odds)公式为:
ln(odds)=ln(p/(1-p))=β_0+β_1 x_1+β_2 x_2+β_3 x_3+⋯+β_n x_n
4、PDO:odds翻倍所需增加的分值
建模预测结果为“不发生违约的概率”,log(odds)即表示为“好坏比”。
由于逻辑回归模型的预测结果恰好为ln(odds),因此上式中的ln(odds)可替换为:intercept+∑评分权重*WOE
5、评分需要自己预设一个评分标准,比如:
如果标准评分=600,标准odds=10:1,PDO=40,那么,一个600分的客户所对应的odds就是10:1,一个640分的客户所对应的odds就是20:1,同样,560分则对应的odds就是5:1。
评分标准的设定需根据行业经验不断跟踪调整,下面的分数设定仅代表个人经验。
6、下面开始设立评分,假设按好坏比10为600分,每高40分好坏比翻一倍算出p,q。
即有方程组:
600 = q - p log(10)
600+40 = q - p log(210)
解方程组得:
p = -40/log(2)
q = 600 - 40log(10)/log(2)
把解得的p,q值代入公式:
即可算得每个变量的每个分箱的得分以及每个客户的得分
最终做出的评分卡样式如下(此表是在Excel做的,用的是R语言导出的评分卡CSV文件):
以下制作评分卡的过程也可以用现成的包直接生成,我是为了让自己加深理解,所以采用手动一步一步操作,过程比较繁琐,由于是初学者,代码也没有很简洁,大家就将就着看吧。 囧。。。
以下是制作评分卡的代码:
> glm.fit <-glm(y~., data=train.woe, family = binomial(link=logit))
> summary(glm.fit)
Call:
glm(formula = y ~ ., family = binomial(link = logit), data = train.woe)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.9660 0.2035 0.2762 0.4261 2.5143
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.93612 0.01131 171.218 < 0.0000000000000002 ***
w_RUOUL 0.60571 0.01125 53.838 < 0.0000000000000002 ***
w_age 0.43762 0.02414 18.132 < 0.0000000000000002 ***
w_D30Past 0.55782 0.01237 45.102 < 0.0000000000000002 ***
w_DR 0.87452 0.03992 21.907 < 0.0000000000000002 ***
w_Income 0.10930 0.03866 2.827 0.004693 **
w_OCAL 0.27058 0.04270 6.337 0.000000000235 ***
w_D90Late 0.56345 0.01117 50.423 < 0.0000000000000002 ***
w_RealEstate 0.66499 0.05284 12.585 < 0.0000000000000002 ***
w_D60Past 0.45478 0.01370 33.192 < 0.0000000000000002 ***
w_Dependents 0.21368 0.05549 3.851 0.000118 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Bata系数:
> coe =(glm.fit$coefficients)
> coe
(Intercept) w_RUOUL w_age w_D30Past w_DR w_Income
1.9361236 0.6057147 0.4376180 0.5578200 0.8745198 0.1093012
w_OCAL w_D90Late w_RealEstate w_D60Past w_Dependents
0.2705830 0.5634518 0.6649864 0.4547815 0.2136834
根据方程组:
600 = q - p *log(10)
600+40 = q - p log(210)
可算得p,q分别为:
p <- -40/log(2)
q <- 600 - 40*log(10)/log(2)
attach(train.woe)
根据评分公式:
可算出所有人的分数:
Score=q - p*(as.numeric(coe[1])+as.numeric(coe[2])*w_RUOUL
+as.numeric(coe[3])*w_age
+as.numeric(coe[4])*w_D30Past
+as.numeric(coe[5])*w_DR
+as.numeric(coe[6])*w_Income
+as.numeric(coe[7])*w_OCAL
+as.numeric(coe[8])*w_D90Late
+as.numeric(coe[9])*w_RealEstate
+as.numeric(coe[10])*w_D60Past
+as.numeric(coe[11])*w_Dependents)
> summary(Score)
Min. 1st Qu. Median Mean 3rd Qu. Max.
282.0 574.5 634.8 615.1 672.8 734.1
个人总评分=基础分+各部分得分
基础得分base为:
> base <- q - p*as.numeric(coe[1])
> base
[1] 578.8523
1、对w_RUOUL
变量打分
woe.RUOUL_1 <- RUOULresult$ivtable[[13]][1]
woe.RUOUL_2 <- RUOULresult$ivtable[[13]][2]
woe.RUOUL_3 <- RUOULresult$ivtable[[13]][3]
woe.RUOUL_4 <- RUOULresult$ivtable[[13]][4]
woe.RUOUL_5 <- RUOULresult$ivtable[[13]][5]
woe.RUOUL_6 <- RUOULresult$ivtable[[13]][6]
woe.RUOUL_7 <- RUOULresult$ivtable[[13]][7]
woe.RUOUL_8 <- RUOULresult$ivtable[[13]][8]
RUOUL_1.SCORE <- -p*as.numeric(coe[2])*woe.RUOUL_1
RUOUL_2.SCORE <- -p*as.numeric(coe[2])*woe.RUOUL_2
RUOUL_3.SCORE <- -p*as.numeric(coe[2])*woe.RUOUL_3
RUOUL_4.SCORE <- -p*as.numeric(coe[2])*woe.RUOUL_4
RUOUL_5.SCORE <- -p*as.numeric(coe[2])*woe.RUOUL_5
RUOUL_6.SCORE <- -p*as.numeric(coe[2])*woe.RUOUL_6
RUOUL_7.SCORE <- -p*as.numeric(coe[2])*woe.RUOUL_7
RUOUL_8.SCORE <- -p*as.numeric(coe[2])*woe.RUOUL_8
RUOUL.SCORE <- c(RUOUL_1.SCORE,RUOUL_2.SCORE,RUOUL_3.SCORE,RUOUL_4.SCORE,RUOUL_5.SCORE,RUOUL_6.SCORE,RUOUL_7.SCORE,RUOUL_8.SCORE)
2、对w_age
变量打分
woe.age_1 <- ageresult$ivtable[[13]][1]
woe.age_2 <- ageresult$ivtable[[13]][2]
woe.age_3 <- ageresult$ivtable[[13]][3]
woe.age_4 <- ageresult$ivtable[[13]][4]
woe.age_5 <- ageresult$ivtable[[13]][5]
woe.age_6 <- ageresult$ivtable[[13]][6]
woe.age_7 <- ageresult$ivtable[[13]][7]
woe.age_8 <- ageresult$ivtable[[13]][8]
woe.age_9 <- ageresult$ivtable[[13]][9]
age_1.SCORE <- -p*as.numeric(coe[3])*woe.age_1
age_2.SCORE <- -p*as.numeric(coe[3])*woe.age_2
age_3.SCORE <- -p*as.numeric(coe[3])*woe.age_3
age_4.SCORE <- -p*as.numeric(coe[3])*woe.age_4
age_5.SCORE <- -p*as.numeric(coe[3])*woe.age_5
age_6.SCORE <- -p*as.numeric(coe[3])*woe.age_6
age_7.SCORE <- -p*as.numeric(coe[3])*woe.age_7
age_8.SCORE <- -p*as.numeric(coe[3])*woe.age_8
age_9.SCORE <- -p*as.numeric(coe[3])*woe.age_9
age.SCORE <- c(age_1.SCORE,age_2.SCORE,age_3.SCORE,age_4.SCORE,age_5.SCORE,age_6.SCORE,age_7.SCORE,age_8.SCORE,age_9.SCORE)
3、对w_D30Past
变量打分
woe.D30Past_1 <- D30Pastresult$ivtable[[13]][1]
woe.D30Past_2 <- D30Pastresult$ivtable[[13]][2]
woe.D30Past_3 <- D30Pastresult$ivtable[[13]][3]
D30Past_1.SCORE <- -p*as.numeric(coe[4])*woe.D30Past_1
D30Past_2.SCORE <- -p*as.numeric(coe[4])*woe.D30Past_2
D30Past_3.SCORE <- -p*as.numeric(coe[4])*woe.D30Past_3
D30Past.SCORE <- c(D30Past_1.SCORE,D30Past_2.SCORE,D30Past_3.SCORE)
4、对w_DR
变量打分
woe.DR_1=getWOE(train$DR,-Inf,0.016)
woe.DR_2=getWOE(train$DR,0.016,0.381)
woe.DR_3=getWOE(train$DR,0.381,0.503)
woe.DR_4=getWOE(train$DR,0.503,0.666)
woe.DR_5=getWOE(train$DR,0.666,3.972)
woe.DR_6=getWOE(train$DR,3.972,Inf)
DR_1.SCORE <- -p*as.numeric(coe[5])*woe.DR_1
DR_2.SCORE <- -p*as.numeric(coe[5])*woe.DR_2
DR_3.SCORE <- -p*as.numeric(coe[5])*woe.DR_3
DR_4.SCORE <- -p*as.numeric(coe[5])*woe.DR_4
DR_5.SCORE <- -p*as.numeric(coe[5])*woe.DR_5
DR_6.SCORE <- -p*as.numeric(coe[5])*woe.DR_6
DR.SCORE <- c(DR_1.SCORE,DR_2.SCORE,DR_3.SCORE,DR_4.SCORE,DR_5.SCORE,DR_6.SCORE)
5、对 w_Income
变量打分
woe.Income_1=getWOE(train$Income,-Inf,3332)
woe.Income_2=getWOE(train$Income,3332,4838)
woe.Income_3=getWOE(train$Income,4838,6620)
woe.Income_4=getWOE(train$Income,6620,9830)
woe.Income_5=getWOE(train$Income,9830,Inf)
Income_1.SCORE <- -p*as.numeric(coe[6])*woe.Income_1
Income_2.SCORE <- -p*as.numeric(coe[6])*woe.Income_2
Income_3.SCORE <- -p*as.numeric(coe[6])*woe.Income_3
Income_4.SCORE <- -p*as.numeric(coe[6])*woe.Income_4
Income_5.SCORE <- -p*as.numeric(coe[6])*woe.Income_5
Income.SCORE <- c(Income_1.SCORE,Income_2.SCORE,Income_3.SCORE,Income_4.SCORE,Income_5.SCORE)
6、对 w_OCAL
变量打分
woe.OCAL_1 <- OCALresult$ivtable[[13]][1]
woe.OCAL_2 <- OCALresult$ivtable[[13]][2]
woe.OCAL_3 <- OCALresult$ivtable[[13]][3]
woe.OCAL_4 <- OCALresult$ivtable[[13]][4]
woe.OCAL_5 <- OCALresult$ivtable[[13]][5]
woe.OCAL_6 <- OCALresult$ivtable[[13]][6]
woe.OCAL_7 <- OCALresult$ivtable[[13]][7]
OCAL_1.SCORE <- -p*as.numeric(coe[7])*woe.OCAL_1
OCAL_2.SCORE <- -p*as.numeric(coe[7])*woe.OCAL_2
OCAL_3.SCORE <- -p*as.numeric(coe[7])*woe.OCAL_3
OCAL_4.SCORE <- -p*as.numeric(coe[7])*woe.OCAL_4
OCAL_5.SCORE <- -p*as.numeric(coe[7])*woe.OCAL_5
OCAL_6.SCORE <- -p*as.numeric(coe[7])*woe.OCAL_6
OCAL_7.SCORE <- -p*as.numeric(coe[7])*woe.OCAL_7
OCAL.SCORE <- c(OCAL_1.SCORE,OCAL_2.SCORE,OCAL_3.SCORE,OCAL_4.SCORE,OCAL_5.SCORE,OCAL_6.SCORE,OCAL_7.SCORE)
7 、对w_D90Late
变量打分
woe.D90Late_1=getWOE(train$D90Late,-Inf,0)
woe.D90Late_2=getWOE(train$D90Late,0,Inf)
D90Late_1.SCORE <- -p*as.numeric(coe[8])*woe.D90Late_1
D90Late_2.SCORE <- -p*as.numeric(coe[8])*woe.D90Late_2
D90Late.SCORE <- c(D90Late_1.SCORE,D90Late_2.SCORE)
8、对w_RealEstate
变量打分
woe.RealEstate_1 <- RealEstateresult$ivtable[[13]][1]
woe.RealEstate_2 <- RealEstateresult$ivtable[[13]][2]
woe.RealEstate_3 <- RealEstateresult$ivtable[[13]][3]
woe.RealEstate_4 <- RealEstateresult$ivtable[[13]][4]
RealEstate_1.SCORE <- -p*as.numeric(coe[9])*woe.RealEstate_1
RealEstate_2.SCORE <- -p*as.numeric(coe[9])*woe.RealEstate_2
RealEstate_3.SCORE <- -p*as.numeric(coe[9])*woe.RealEstate_3
RealEstate_4.SCORE <- -p*as.numeric(coe[9])*woe.RealEstate_4
RealEstate.SCORE <- c(RealEstate_1.SCORE,RealEstate_2.SCORE,RealEstate_3.SCORE,RealEstate_4.SCORE)
9 、对w_D60Past
变量打分
woe.D60Past_1 <- getWOE(train$D60Past,-Inf,0)
woe.D60Past_2 <- getWOE(train$D60Past,0,Inf)
D60Past_1.SCORE <- -p*as.numeric(coe[10])*woe.D60Past_1
D60Past_2.SCORE <- -p*as.numeric(coe[10])*woe.D60Past_2
D60Past.SCORE <- c(D60Past_1.SCORE,D60Past_2.SCORE)
10、对w_Dependents
变量打分
woe.Dependents_1 <- Dependentsresult$ivtable[[13]][1]
woe.Dependents_2 <- Dependentsresult$ivtable[[13]][2]
woe.Dependents_3 <- Dependentsresult$ivtable[[13]][3]
woe.Dependents_4 <- Dependentsresult$ivtable[[13]][4]
Dependents_1.SCORE <- -p*as.numeric(coe[11])*woe.Dependents_1
Dependents_2.SCORE <- -p*as.numeric(coe[11])*woe.Dependents_2
Dependents_3.SCORE <- -p*as.numeric(coe[11])*woe.Dependents_3
Dependents_4.SCORE <- -p*as.numeric(coe[11])*woe.Dependents_4
Dependents.SCORE <- c(Dependents_1.SCORE,Dependents_2.SCORE,Dependents_3.SCORE,Dependents_4.SCORE)
然后制作一个这样的data.frame,就完成了评分卡的制作:
> score_card
Characteristic Attribute Points
1 无担保放款的循环利用 <= 0.114 48
2 无担保放款的循环利用 <= 0.227 28
3 无担保放款的循环利用 <= 0.3 21
4 无担保放款的循环利用 <= 0.389 9
5 无担保放款的循环利用 <= 0.493 2
6 无担保放款的循环利用 <= 0.6791 -13
7 无担保放款的循环利用 <= 0.902 -31
8 无担保放款的循环利用 > 0.902 -49
9 年龄 <= 35 -13
10 年龄 <= 43 -9
11 年龄 <= 49 -5
12 年龄 <= 52 -3
13 年龄 <= 55 0
14 年龄 <= 59 6
15 年龄 <= 63 11
16 年龄 <= 67 19
17 年龄 > 67 29
18 30-59天逾期次数 <= 0 17
19 30-59天逾期次数 <= 1 -29
20 30-59天逾期次数 > 1 -60
21 负债比率 <= 0.016 23
22 负债比率 <= 0.381 5
23 负债比率 <= 0.503 -3
24 负债比率 <= 0.666 -16
25 负债比率 <= 3.972 -32
26 负债比率 > 3.972 11
27 月收入 <= 3332 -3
28 月收入 <= 4838 -2
29 月收入 <= 6620 0
30 月收入 <= 9830 1
31 月收入 > 9830 3
32 开放式信贷和贷款数量 <= 2 -10
33 开放式信贷和贷款数量 <= 3 -2
34 开放式信贷和贷款数量 <= 5 0
35 开放式信贷和贷款数量 <= 7 3
36 开放式信贷和贷款数量 <= 8 5
37 开放式信贷和贷款数量 <= 13 2
38 开放式信贷和贷款数量 > 13 -1
39 90天逾期次数 <=0 12
40 90天逾期次数 <=1 -74
41 不动产贷款或额度数量 <= 0 -8
42 不动产贷款或额度数量 <= 1 9
43 不动产贷款或额度数量 <= 2 7
44 不动产贷款或额度数量 > 2 -10
45 60-89天逾期次数 <=0 7
46 60-89天逾期次数 <=1 -54
47 家属数量 <= 0 2
48 家属数量 <= 1 -1
49 家属数量 <= 2 -3
50 家属数量 > 2 -5