信用评分卡模型(R语言)

贷款风险预测-信用评分卡模型(R语言)

时间:2018年10月9日

本次的分析数据来自Kaggle数据竞赛平台的“give me some credit”竞赛项目。下载地址为:https://www.kaggle.com/c/GiveMeSomeCredit/data

本次分析主要做了两件事:
一、用逻辑回归预测用户未来两年违约的概率
二、根据违约概率制作评分卡

分析步骤:
1.变量的描述
2.数据预处理
3.变量分析
3.模型建立
4.模型评估
5.制作评分卡

一、变量的描述

SeriousDlqin2yrs:超过90天或更糟的逾期拖欠
RevolvingUtilizationOfpercentage :无担保放款的循环利用:除了不动产和像车贷那样除以信用额度总和的无分期付款债务的信用卡和个人信用额度总额
UnsecuredLines age :借款人当时的年龄
NumberOfTime30-59DaysPastDueNotWorse :35-59天逾期但不糟糕次数
DebtRatio percentage:负债比率
MonthlyIncome real:月收入
NumberOf OpenCreditLinesAndLoans:开放式信贷和贷款数量,开放式贷款(分期付款如汽车贷款或抵押贷款)和信贷(如信用卡)的数量
NumberOfTimes90DaysLate:90天逾期次数:借款者有90天或更高逾期的次数
NumberRealEstateLoans :不动产贷款或额度数量:抵押贷款和不动产放款包括房屋净值信贷额度
OrLinesNumberOfTime60-89DaysPastDueNotWorse :60-89天逾期但不糟糕次数:借款人在在过去两年内有60-89天逾期还款但不糟糕的次数
NumberOfDependents:家属数量:不包括本人在内的家属数量

二、数据预处理

1.导入数据并预览数据

cstraining<-read.csv('D:\\A\\score-card\\cs-training.csv')
View(cstraining)

信用评分卡模型(R语言)_第1张图片
信用评分卡模型(R语言)_第2张图片

2. 删除 X 变量

cstraining$X<-NULL

3. 更改变量名称

colnames(cstraining)<-c("y" ,"RUOUL","age","D30Past","DR","Income","OCAL" ,"D90Late"  ,"RealEstate" ,"D60Past","Dependents" )
names(train)

4. 处理样本均衡

> prop.table(table(cstraining$y))

      0       1 
0.93316 0.06684 

违约客户样本量只占了6.7%,样本不均衡。这里,我采用的方法是把违约客户重复放进样本里:

bad<-cstraining[which(cstraining$y==1),]
all<-rbind(cstraining,bad)
> prop.table(table(all$y))

        0         1 
0.8746954 0.1253046 

此时,违约客户占比达到了12.5%,在可接受范围内。

5. 异常值和缺失值的处理

dat1<-all
(1)查看数据缺失情况
library(mice)
md.pattern(dat1)

由图可以看出字段的缺失值个数Income:31400 , Dependents:4103
信用评分卡模型(R语言)_第3张图片
查看缺失比例:

library(VIM)
aggr_plot <- aggr(all, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(dat1), cex.axis=.7, gap=3, ylab=c("Histogram of missing all","Pattern"))
 Variables sorted by number of missings: 
   Variable      Count
     Income 0.19621811
 Dependents 0.02563958
          y 0.00000000
      RUOUL 0.00000000
        age 0.00000000
 D30.59Past 0.00000000
         DR 0.00000000
       OCAL 0.00000000
    D90Late 0.00000000
 RealEstate 0.00000000
 D60.89Past 0.00000000

信用评分卡模型(R语言)_第4张图片

(2)对各变量进行异常值与缺失值处理
##盖帽法函数
block<-function(x,lower=T,upper=T){
  if(lower){
    q1<-quantile(x,0.01)
    x[x<=q1]<-q1
  }
  if(upper){
    q99<-quantile(x,0.99)
    x[x>q99]<-q99
  }
  return(x)
}

(i). 对变量RUOUL进行处理

dat1$RUOUL<-block(dat1$RUOUL)
boxplot(RUOUL~y,data=dat1,horizontal=T, frame=F, 
        col="lightgray",main="RUOUL")

信用评分卡模型(R语言)_第5张图片
(ii). 对变量age进行处理

> summary(dat1$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0    41.0    51.0    51.9    62.0   109.0 
> which(dat1$age==0)
[1] 65696
> dat1<-dat1[-which(dat1$age==0),]

年龄为0的肯定是异常值,而且只有一个数据是0,我采用的是直接删除这个记录。

boxplot(age~y,data=dat1,horizontal=T, frame=F, 
        col="lightgray",main="age")

信用评分卡模型(R语言)_第6张图片
由图可看出,年龄这个变量,是大年龄段存在异常值

QLa <- quantile(dat1$age, probs = 0.25)
QUa <- quantile(dat1$age, probs = 0.75)
QUa_QLa <- QUa-QLa
QLa;QUa;QUa_QLa
dat1$age[which(dat1$age> (QUa+1.5*QUa_QLa))]<-1.5*QUa_QLa
boxplot(dat1$age,col="lightgray",main="age")

信用评分卡模型(R语言)_第7张图片

(iii). 对变量D30Past进行处理

> table(dat1$D30Past )

     0      1      2      3      4      5      6      7      8      9 
131059  18441   5817   2372   1065    496    214     82     33     16 
    10     11     12     13     96     98 
     7      2      3      2      9    407 
> table(dat1$D90Late )

     0      1      2      3      4      5      6      7      8      9 
148215   7008   2331   1052    486    214    128     69     36     33 
    10     11     12     13     14     15     17     96     98 
    13      8      3      6      3      2      2      9    407 
> table(dat1$D60Past )

     0      1      2      3      4      5      6      7      8      9 
149651   7508   1679    498    170     55     28     14      3      1 
    11     96     98 
     2      9    407 

显然值为96,98 的是异常值。我采用的方法是直接删除。而且D30Past这个变量的96,98的记录删除之后,后面D90Late D60Past两个变量的96,98的记录也同时被删除了。

> dat1<-dat1[-which(dat1$D30Past %in% c(96,98)),]
> table(dat1$D30Past )

     0      1      2      3      4      5      6      7      8      9     10     11 
131059  18441   5817   2372   1065    496    214     82     33     16      7      2 
    12     13 
     3      2 
> table(dat1$D90Late )

     0      1      2      3      4      5      6      7      8      9     10     11 
148215   7008   2331   1052    486    214    128     69     36     33     13      8 
    12     13     14     15     17 
     3      6      3      2      2 
> table(dat1$D60Past )

     0      1      2      3      4      5      6      7      8      9     11 
149651   7508   1679    498    170     55     28     14      3      1      2 

盖帽法处理异常值

dat1$D30.59Past<-block(dat1$D30.59Past)
> table(dat1$D30.59Past )

     0      1      2      3      4 
131059  18441   5817   2372   1920 

(iv).对变量 "DR"进行处理

> summary(dat1$DR)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
     0.0      0.2      0.4    350.3      0.9 329664.0 

盖帽法处理

dat1$DR<-block(dat1$DR)

(v). 对变量Income进行处理
因为已经做过多次模型对比,对这个变量分别采用KNN填补和均值填补,发现采用均值填补,模型效果更好,因此我采用的是用均值填补缺失值,然而均值容易受到极端值的影响,所以先对少数过于极端的异常值作处理,再填补缺失值
信用评分卡模型(R语言)_第8张图片
存在大量0,1,2这样的异常值,当作缺失值处理

dat1$Income[which(dat1$Income %in% c(0,1,2))]<-NA
boxplot(dat1$Income,col="lightgray",main="Income",range =1.5)

信用评分卡模型(R语言)_第9张图片
红色框里的值对应的记录直接删除,调range参数,找出想删除的值

boxplot(dat1$Income,col="lightgray",main="Income",range =100)

信用评分卡模型(R语言)_第10张图片

> sort(boxplot.stats(dat1$Income,coef =100)$out)
 [1]  562466  582369  629000  649587  699530  702500  730483  835040
 [9] 1072500 1560100 1794060 3008750`
dat1<-dat1[-which(dat1$Income>=562466),]

均值填补缺失值

dat1$Income[is.na(dat1$Income)]<-mean(dat1$Income,na.rm=TRUE)

盖帽法处理异常值

dat1$Income<-block(dat1$Income)
> summary(dat1$Income)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    820    4000    6630    6441    7300   22942 

(vi).对变量OCAL进行处理

> table(dat1$OCAL)

    0     1     2     3     4     5     6     7     8     9    10    11 
 1959  5020  7275  9739 12352 13750 14367 13991 13166 12018 10197  8790 
   12    13    14    15    16    17    18    19    20    21    22    23 
 7445  5976  4846  3926  3206  2527  2007  1516  1266   924   731   575 
   24    25    26    27    28    29    30    31    32    33    34    35 
  447   354   261   209   157   122    98    78    54    51    36    28 
   36    37    38    39    40    41    42    43    44    45    46    47 
   21     8    13    10    11     5     8     9     2     9     3     2 
   48    49    50    51    52    53    54    56    57    58 
    8     5     2     2     4     1     4     2     3     1
> quantile(dat1$OCA,0.99)
99% 
 24 
> quantile(dat1$OCA,0.01)
1% 
 0 

采用盖帽法处理

> dat1$OCAL<-block(dat1$OCAL)
> table(dat1$OCAL)

    0     1     2     3     4     5     6     7     8     9    10    11 
 1959  5020  7275  9739 12352 13750 14367 13991 13166 12018 10197  8790 
   12    13    14    15    16    17    18    19    20    21    22    23 
 7445  5976  4846  3926  3206  2527  2007  1516  1266   924   731   575 
   24 
 2028 

(vii).对变量D90Late进行处理

> prop.table(table(dat1$D90Late))

            0             1             2             3             4 
0.92860768060 0.04391059982 0.01460553770 0.00659160260 0.00304517002 
            5             6             7             8             9 
0.00134087733 0.00080202009 0.00043233895 0.00022556815 0.00020677080 
           10            11            12            13            14 
0.00008145517 0.00005012626 0.00001879735 0.00003759469 0.00001879735 
           15            17 
0.00001253156 0.00001253156

单一指标占比大于90%的变量不适合进入模型,然而这个变量又比较重要,所以我把大于0次的都统一为1,然后0代表没有过超过90天逾期记录,1代表有超过90天逾期的记录。

dat1$D90Late<-as.numeric(as.logical(dat1$D90Late))
> prop.table(table(dat1$D90Late))

         0          1 
0.92860768 0.07139232

(viii).对变量RealEstate进行处理

> table(dat1$RealEstate)

    0     1     2     3     4     5     6     7     8     9    10    11 
60439 55083 33284  6722  2362   772   368   202   116    92    42    25 
   12    13    14    15    16    17    18    19    20    21    23    25 
   24    20     7    10     5     5     2     2     3     1     2     4 
   26    29    32    54 
    1     2     1     1 
> quantile(dat1$RealEstate,0.99)
99% 
  5 
> quantile(dat1$RealEstate,0.01)
1% 
 0 

采用盖帽法处理

> dat1$RealEstate<-block(dat1$RealEstate)
> table(dat1$RealEstate)

    0     1     2     3     4     5 
60439 55083 33284  6722  2362  1707 

(ix).对变量D60Past进行处理

> prop.table(table(dat1$D60Past))

             0              1              2              3              4 
0.937611609241 0.047037225010 0.010520247874 0.003120359405 0.001065182930 
             5              6              7              8              9 
0.000344618007 0.000175441894 0.000087720947 0.000018797346 0.000006265782 
            11 
0.000012531564 

处理方法和理由与变量D90Late一样

> dat1$D60.89Past<-as.numeric(as.logical(dat1$D60Past))
> prop.table(table(dat1$D60Past))

         0          1 
0.93761161 0.06238839

(x).对变量Dependents进行处理

> summary(dat1$Dependents)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    0.00    0.00    0.77    1.00   20.00    4060
> table(dat1$Dependents)

    0     1     2     3     4     5     6     7     8     9    10    13 
91702 28210 21070 10310  3156   813   182    56    26     5     5     1 
   20 
    1 

缺失值用众数0填补

dat1$Dependents[is.na(dat1$Dependents)]<-0

盖帽法处理异常值

dat1$Dependents<-block(dat1$Dependents)
> table(dat1$Dependents)

    0     1     2     3     4 
95762 28210 21070 10310  4245

三、变量分析

y中0和1互换,互换后1是没有违约客户,0是违约客户

dat2<-dat1
dat2$y<-as.numeric(!as.logical(dat2$y))

查看相关性

library(corrplot)
cor1<-cor(dat2)
corrplot(cor1,method="number")

信用评分卡模型(R语言)_第11张图片

各变量间的相关系数不大,可以不作处理

四、建立模型

1、划分训练集与测试集

> prop.table(table(dat2$y))

        0         1 
0.1237993 0.8762007 
library(caret)
set.seed(1234) 
splitIndex<-createDataPartition(dat2$y,time=1,p=0.7,list=FALSE) 
train<-dat2[splitIndex,] 
test<-dat2[-splitIndex,]
> prop.table(table(train$y))

        0         1 
0.1243041 0.8756959 
> prop.table(table(test$y))

        0         1 
0.1226216 0.8773784
> nrow(train)
[1] 111718
> nrow(test)
[1] 47879

2、建logistics回归模型

model<-glm(y~.,train,family = "binomial")
summary(model)

所有变量都显著

Coefficients:
                Estimate   Std. Error z value             Pr(>|z|)    
(Intercept)  2.759486590  0.051636436  53.441 < 0.0000000000000002 ***
RUOUL       -1.965393815  0.031089068 -63.218 < 0.0000000000000002 ***
age          0.018351654  0.000865627  21.200 < 0.0000000000000002 ***
D30.59Past  -0.469131910  0.011938381 -39.296 < 0.0000000000000002 ***
DR           0.000054717  0.000012537   4.365            0.0000127 ***
Income       0.000046362  0.000003391  13.673 < 0.0000000000000002 ***
OCAL        -0.039503078  0.002506984 -15.757 < 0.0000000000000002 ***
D90Late     -1.598706728  0.029895157 -53.477 < 0.0000000000000002 ***
RealEstate  -0.133127866  0.012048378 -11.049 < 0.0000000000000002 ***
D60.89Past  -1.076664163  0.032420477 -33.209 < 0.0000000000000002 ***
Dependents  -0.046652021  0.009639607  -4.840            0.0000013 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 83881  on 111717  degrees of freedom
Residual deviance: 60808  on 111707  degrees of freedom
AIC: 60830

3、检验多重共线性

library(car)
> vif(model)
     RUOUL        age D30.59Past         DR     Income       OCAL    D90Late 
  1.216980   1.152183   1.110489   1.048800   1.220319   1.549682   1.123098 
RealEstate D60.89Past Dependents 
  1.491353   1.081564   1.067081 

五、模型评估

1、返回模型在测试集上的概率值

pre <- predict(object = model, newdata= test, type = 'response')

2、ROC检验

library(pROC)
modelroc <- roc(test$y,pre)
plot(modelroc, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),grid.col=c("green", "red"), max.auc.polygon=TRUE,auc.polygon.col="skyblue", print.thres=TRUE)

模型AUC值为0.855,效果还可以。(注,在此之前已经用过多种方法清洗数据,得到的AUC值分别为:0.858,0.831,0.796。不同的数据清洗,对模型结果影响也是蛮大的。)
信用评分卡模型(R语言)_第12张图片

3、KS检验

定义KS函数:

myKS <- function(pre,label){
  true <- sum(label)
  false <- length(label)-true
  tpr <- NULL
  fpr <- NULL
  o_pre <- pre[order(pre)] # let the threshold in an order from small to large
  for (i in o_pre){
    tp <- sum((pre >= i) & label)
    tpr <- c(tpr,tp/true)
    fp <- sum((pre >= i) & (1-label))
    fpr <- c(fpr,fp/false)
  }
  plot(o_pre,tpr,type = "l",col= "green",xlab="threshold",ylab="tpr,fpr")
  lines(o_pre,fpr,type="l", col = "red")
  KSvalue <- max(tpr-fpr)
  sub = paste("KS value =",KSvalue)
  title(sub=sub)
  cutpoint <- which(tpr-fpr==KSvalue)
  thre <- o_pre[cutpoint]
  lines(c(thre,thre),c(fpr[cutpoint],tpr[cutpoint]),col = "blue")
  cat("KS-value:",KSvalue)
}
myKS(pre,test$y)

信用评分卡模型(R语言)_第13张图片
KS值为0.55,模型效果还不错

六、制作评分卡

(一)WOE转换

计算WOE的函数

totalbad = as.numeric(table(train$y))[1]
totalgood = as.numeric(table(train$y))[2]
getWOE <- function(a,p,q)   
{      
  Bad <- as.numeric(table(train$y[a > p & a <= q]))[1]      
  Good <- as.numeric(table(train$y[a > p & a <= q]))[2]      
  WOE <- log((Good/totalgood)/(Bad/totalbad),base = exp(1))   
  return(WOE)  
}
library(smbinning)

(二)对各变量进行分箱

1.RUOUL变量

train$RUOUL<-round(train$RUOUL,3)#保留3位小数
RUOULresult=smbinning(df=train,y="y",x="RUOUL",p=0.05)
> RUOULresult$iv
[1] 1.1077

查看分箱情况

RUOULresult$ivtable
smbinning.plot(RUOULresult,option="WoE",sub="RUOUL")

分箱情况还不错,保留自动分箱结果
信用评分卡模型(R语言)_第14张图片
进行woe转换

tmp.RUOUL <- 0 
for(i in 1:nrow(train)) {      
  if(train$RUOUL[i] <= 0.114)        
    tmp.RUOUL[i] <- RUOULresult$ivtable[[13]][1]      
  else if(train$RUOUL[i] <= 0.227)        
    tmp.RUOUL[i] <- RUOULresult$ivtable[[13]][2]      
  else if(train$RUOUL[i] <= 0.3)        
    tmp.RUOUL[i] <- RUOULresult$ivtable[[13]][3]      
  else if(train$RUOUL[i] <= 0.389)        
    tmp.RUOUL[i] <- RUOULresult$ivtable[[13]][4]      
  else if(train$RUOUL[i] <= 0.493)        
    tmp.RUOUL[i] <- RUOULresult$ivtable[[13]][5]      
  else if(train$RUOUL[i] <= 0.6791)        
    tmp.RUOUL[i] <- RUOULresult$ivtable[[13]][6]      
  else if(train$RUOUL[i] <= 0.902)        
    tmp.RUOUL[i] <- RUOULresult$ivtable[[13]][7]      
  else        
    tmp.RUOUL[i] <- RUOULresult$ivtable[[13]][8]    
} 
> table(tmp.RUOUL)
tmp.RUOUL
-1.3896 -0.8822 -0.3639  0.0528  0.2664  0.6058  0.7948  1.3609 
  16901    8916    8219    5663    5638    5662   12631   48088 
train_WOE<-train
train_WOE$w_RUOUL<-tmp.RUOUL

2.对age变量进行分箱

ageresult=smbinning(df=train,y="y",x="age",p=0.05)
smbinning.plot(ageresult,option="WoE",sub="age")
> ageresult$iv
[1] 0.2582

信用评分卡模型(R语言)_第15张图片
woe转换

tmp.age <- 0    
for(i in 1:nrow(train)) {      
  if(train$age[i] <= 35)        
    tmp.age[i] <- ageresult$ivtable[[13]][1]      
  else if(train$age[i] <= 43)        
    tmp.age[i] <- ageresult$ivtable[[13]][2]      
  else if(train$age[i] <= 49)        
    tmp.age[i] <- ageresult$ivtable[[13]][3]      
  else if(train$age[i] <= 52)        
    tmp.age[i] <- ageresult$ivtable[[13]][4]      
  else if(train$age[i] <= 55)        
    tmp.age[i] <- ageresult$ivtable[[13]][5]      
  else if(train$age[i] <= 59)        
    tmp.age[i] <- ageresult$ivtable[[13]][6]      
  else if(train$age[i] <= 63)        
    tmp.age[i] <- ageresult$ivtable[[13]][7]      
  else if(train$age[i] <= 67)        
    tmp.age[i] <- ageresult$ivtable[[13]][8]      
  else        
    tmp.age[i] <- ageresult$ivtable[[13]][9]    
}
train_WOE$w_age<-tmp.age

3.对D30Past 变量进行分箱

D30Pastresult=smbinning(df=train,y="y",x="D30Past",p=0.05)
smbinning.plot(D30Pastresult,option="WoE",sub="D30Past")

信用评分卡模型(R语言)_第16张图片
woe转换

tmp.D30Past <- 0    
for(i in 1:nrow(train)) {      
  if(train$D30Past[i] <= 0)        
    tmp.D30Past[i] <- D30Pastresult$ivtable[[13]][1]      
  else if(train$D30Past[i] <= 1)        
    tmp.D30Past[i] <- D30Pastresult$ivtable[[13]][2]      
  else        
    tmp.D30Past[i] <-D30Pastresult$ivtable[[13]][3]    
}
table(tmp.D30Past)
train_WOE$w_D30Past<-tmp.D30Past

4对DR变量进行分箱

train$DR<-round(train$DR,3)#保留3位小数
DRresult=smbinning(df=train,y="y",x="DR",p=0.05)
smbinning.plot(DRresult,option="WoE",sub="DR")

信用评分卡模型(R语言)_第17张图片
分箱结果并不是很好,决定把前面两个分箱合并,看一下woe值

> getWOE(train$DR,-Inf,0.137)
[1] 0.1509092

第一第二个箱合并结果并没有很好,合并后的WOE值与第三个箱的WOE值几乎相等。尝试一下第二第三个箱合并,看一下结果

> getWOE(train$DR,0.016,0.381)
[1] 0.1043971

第二第三个箱合并后,woe值还不错
手动分箱提取woe:

woe.DR_1=getWOE(train$DR,-Inf,0.016)
woe.DR_2=getWOE(train$DR,0.016,0.381)   
woe.DR_3=getWOE(train$DR,0.381,0.503)
woe.DR_4=getWOE(train$DR,0.503,0.666)
woe.DR_5=getWOE(train$DR,0.666,3.972)
woe.DR_6=getWOE(train$DR,3.972,Inf)
DR.WOE<-c(woe.DR_1,woe.DR_2,woe.DR_3,woe.DR_4,woe.DR_5,woe.DR_6)
DR.WOE<-round(DR.WOE,3)

画woe代码:

x<- barplot(DR.WOE,xlab="preference",ylab="frequency",main='Weight of Evidence')
lbls<-paste(" ",DR.WOE)
text(x,DR.WOE,labels=lbls,cex=1.5,pos=1)

信用评分卡模型(R语言)_第18张图片
进行woe转换

tmp.DR <- 0    
for(i in 1:nrow(train)) {      
  if(train$DR[i] <= 0.016)        
    tmp.DR[i] <- woe.DR_1      
  else if(train$DR[i] <= 0.381)       
    tmp.DR[i] <- woe.DR_2      
  else if(train$DR[i] <= 0.503)        
    tmp.DR[i] <- woe.DR_3      
  else if(train$DR[i] <= 0.666)        
    tmp.DR[i] <- woe.DR_4      
  else if(train$DR[i] <= 3.972)        
    tmp.DR[i] <- woe.DR_5      
  else        
    tmp.DR[i] <- woe.DR_6    
}

train_WOE$w_DR<-tmp.DR

5.对Income变量进行分箱

Incomeresult=smbinning(df=train,y="y",x="Income",p=0.05)
smbinning.plot(Incomeresult,option="WoE",sub="Income")

信用评分卡模型(R语言)_第19张图片
分箱结果并不是很好,尝试合并前面两个箱和后面两个箱

> getWOE(train$Income,-Inf,3332)
[1] -0.396827
> getWOE(train$Income,9830,Inf)
[1] 0.4653303

合并结果还不错,选择合并
提取woe:

woe.Income_1=getWOE(train$Income,-Inf,3332)
woe.Income_2=getWOE(train$Income,3332,4838)   
woe.Income_3=getWOE(train$Income,4838,6620)
woe.Income_4=getWOE(train$Income,6620,9830)
woe.Income_5=getWOE(train$Income,9830,Inf)
Income.WOE=c(woe.Income_1,woe.Income_2,woe.Income_3,woe.Income_4,woe.Income_5) 

画woe:

Income.WOE<-round(Income.WOE,3)
x<- barplot(Income.WOE,xlab="preference",ylab="frequency",main='Weight of Evidence')
lbls<-paste(" ",Income.WOE)
text(x,Income.WOE,labels=lbls,cex=1.5,pos=1)

信用评分卡模型(R语言)_第20张图片
Income变量进行WOE转换

tmp.Income <- 0    
for(i in 1:nrow(train)) {      
  if(train$Income[i] <= 3332)        
    tmp.Income[i] <- woe.Income_1      
  else if(train$Income[i] <= 4838)        
    tmp.Income[i] <- woe.Income_2      
  else if(train$Income[i] <= 6620)        
    tmp.Income[i] <- woe.Income_3      
  else if(train$Income[i] <= 9830)        
    tmp.Income[i] <- woe.Income_4      
  else        
    tmp.Income[i] <- woe.Income_5    
}
train_WOE$w_Income<-tmp.Income

6.对OCAL变量进行分箱

OCALresult=smbinning(df=train,y="y",x="OCAL",p=0.05)
smbinning.plot(OCALresult,option="WoE",sub="OCAL")
> OCALresult$iv
[1] 0.0709

信用评分卡模型(R语言)_第21张图片
woe转换:

tmp.OCAL <- 0    
for(i in 1:nrow(train)) {      
  if(train$OCAL[i] <= 2)        
    tmp.OCAL[i] <- OCALresult$ivtable[[13]][1]      
  else if(train$OCAL[i] <= 3)        
    tmp.OCAL[i] <- OCALresult$ivtable[[13]][2]      
  else if(train$OCAL[i] <= 5)        
    tmp.OCAL[i] <- OCALresult$ivtable[[13]][3]      
  else if(train$OCAL[i] <= 7)        
    tmp.OCAL[i] <- OCALresult$ivtable[[13]][4]      
  else if(train$OCAL[i] <= 8)        
    tmp.OCAL[i] <- OCALresult$ivtable[[13]][5]      
  else if(train$OCAL[i] <= 13)        
    tmp.OCAL[i] <- OCALresult$ivtable[[13]][6]      
  else        
    tmp.OCAL[i] <- OCALresult$ivtable[[13]][7]    
}


train_WOE$w_OCAL<-tmp.OCAL

7.对D90Late变量进行分箱

> table(train$D90Late)

     0      1 
103607   8111 

手动分箱

woe.D90Late_1=getWOE(train$D90Late,-Inf,0)
woe.D90Late_2=getWOE(train$D90Late,0,Inf) 
> woe.D90Late_1
[1] 0.3793484
> woe.D90Late_2
[1] -2.279932

#画WOE 代码:

D90Late.WOE<-c(woe.D90Late_1,woe.D90Late_2)
D90Late.WOE<-round(D90Late.WOE,3)
x<- barplot(D90Late.WOE,xlab="preference",ylab="frequency",main = 'D90Late')
lbls<-paste(" ",D90Late.WOE)
text(x,D90Late.WOE,labels=lbls,cex=1.5,pos=1)

信用评分卡模型(R语言)_第22张图片
#WOE转换

tmp.D90Late<- 0    
for(i in 1:nrow(train)) {      
  if(train$D90Late[i] <= 0)        
    tmp.D90Late[i] <- woe.D90Late_1
  else       
    tmp.D90Late[i] <- woe.D90Late_2      
}
train_WOE$w_D90Late<-tmp.D90Late

8.对RealEstate变量进行分箱

RealEstateresult=smbinning(df=train,y="y",x="RealEstate",p=0.05)
smbinning.plot(RealEstateresult,option="WoE",sub="RealEstate")
> RealEstateresult$iv
[1] 0.0491

信用评分卡模型(R语言)_第23张图片
woe转换:

tmp.RealEstate <- 0    
for(i in 1:nrow(train)) {      
  if(train$RealEstate[i] <= 0)        
    tmp.RealEstate[i] <- RealEstateresult$ivtable[[13]][1]      
  else if(train$RealEstate[i] <= 1)        
    tmp.RealEstate[i] <- RealEstateresult$ivtable[[13]][2]      
  else if(train$RealEstate[i] <= 2)        
    tmp.RealEstate[i] <- RealEstateresult$ivtable[[13]][3]      
  else        
    tmp.RealEstate[i] <- RealEstateresult$ivtable[[13]][4]    
}
train_WOE$w_RealEstate<-tmp.RealEstate

9.对D60Past变量进行分箱

> table(train$D60Past)

     0      1 
104720   6998 
woe.D60Past_1 <- getWOE(train$D60Past,-Inf,0)
woe.D60Past_2 <- getWOE(train$D60Past,0,Inf) 
> woe.D60Past_1
[1] 0.275222
> woe.D60Past_2
[1] -2.065585

信用评分卡模型(R语言)_第24张图片
#woe转换:

tmp.D60Past<- 0    
for(i in 1:nrow(train)) {      
  if(train$D60Past[i] <= 0)        
    tmp.D60Past[i] <- woe.D60Past_1
  else       
    tmp.D60Past[i] <- woe.D60Past_2      
}
train_WOE$w_D60Past<-tmp.D60Past

10.对Dependents变量进行分箱:

Dependentsresult=smbinning(df=train,y="y",x="Dependents",p=0.05)
smbinning.plot(Dependentsresult,option="WoE",sub="Dependents")
> Dependentsresult$iv
[1] 0.0391

信用评分卡模型(R语言)_第25张图片
woe转换:

tmp.Dependents <- 0    
for(i in 1:nrow(train)) {      
  if(train$Dependents[i] <= 0)        
    tmp.Dependents[i] <- Dependentsresult$ivtable[[13]][1]      
  else if(train$Dependents[i] <= 1)        
    tmp.Dependents[i] <- Dependentsresult$ivtable[[13]][2]      
  else if(train$Dependents[i] <= 2)        
    tmp.Dependents[i] <- Dependentsresult$ivtable[[13]][3]      
  else        
    tmp.Dependents[i] <- Dependentsresult$ivtable[[13]][4]    
}
train_WOE$w_Dependents<-tmp.Dependents

(三)用分箱后的woe值替换原来的值:

train.woe<-train_WOE[,c(1,12:21)]
View(train.woe)

信用评分卡模型(R语言)_第26张图片

(四)评分卡的创建和实施

1、信用评分卡的评分公式为:

Score = q - p *log(odds)

2、标准odds : 标准评分对应的odds
3、ln(odds)公式为:
ln⁡(odds)=ln⁡(p/(1-p))=β_0+β_1 x_1+β_2 x_2+β_3 x_3+⋯+β_n x_n
在这里插入图片描述
4、PDO:odds翻倍所需增加的分值
建模预测结果为“不发生违约的概率”,log(odds)即表示为“好坏比”。
由于逻辑回归模型的预测结果恰好为ln(odds),因此上式中的ln(odds)可替换为:intercept+∑评分权重*WOE
5、评分需要自己预设一个评分标准,比如:
如果标准评分=600,标准odds=10:1,PDO=40,那么,一个600分的客户所对应的odds就是10:1,一个640分的客户所对应的odds就是20:1,同样,560分则对应的odds就是5:1。
评分标准的设定需根据行业经验不断跟踪调整,下面的分数设定仅代表个人经验。
6、下面开始设立评分,假设按好坏比10为600分,每高40分好坏比翻一倍算出p,q。
即有方程组:
600 = q - p log(10)
600+40 = q - p log(210)
解方程组得:
p = -40/log(2)
q = 600 - 40
log(10)/log(2)
把解得的p,q值代入公式:

Score = q - p *log(odds)

即可算得每个变量的每个分箱的得分以及每个客户的得分
最终做出的评分卡样式如下(此表是在Excel做的,用的是R语言导出的评分卡CSV文件):

评分卡最终样式

信用评分卡模型(R语言)_第27张图片

制作评分卡代码

以下制作评分卡的过程也可以用现成的包直接生成,我是为了让自己加深理解,所以采用手动一步一步操作,过程比较繁琐,由于是初学者,代码也没有很简洁,大家就将就着看吧。 囧。。。
以下是制作评分卡的代码:

> glm.fit <-glm(y~., data=train.woe, family = binomial(link=logit))
> summary(glm.fit)

Call:
glm(formula = y ~ ., family = binomial(link = logit), data = train.woe)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9660   0.2035   0.2762   0.4261   2.5143  

Coefficients:
             Estimate Std. Error z value             Pr(>|z|)    
(Intercept)   1.93612    0.01131 171.218 < 0.0000000000000002 ***
w_RUOUL       0.60571    0.01125  53.838 < 0.0000000000000002 ***
w_age         0.43762    0.02414  18.132 < 0.0000000000000002 ***
w_D30Past     0.55782    0.01237  45.102 < 0.0000000000000002 ***
w_DR          0.87452    0.03992  21.907 < 0.0000000000000002 ***
w_Income      0.10930    0.03866   2.827             0.004693 ** 
w_OCAL        0.27058    0.04270   6.337       0.000000000235 ***
w_D90Late     0.56345    0.01117  50.423 < 0.0000000000000002 ***
w_RealEstate  0.66499    0.05284  12.585 < 0.0000000000000002 ***
w_D60Past     0.45478    0.01370  33.192 < 0.0000000000000002 ***
w_Dependents  0.21368    0.05549   3.851             0.000118 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Bata系数:

> coe =(glm.fit$coefficients)
> coe
 (Intercept)      w_RUOUL        w_age    w_D30Past         w_DR     w_Income 
   1.9361236    0.6057147    0.4376180    0.5578200    0.8745198    0.1093012 
      w_OCAL    w_D90Late w_RealEstate    w_D60Past w_Dependents 
   0.2705830    0.5634518    0.6649864    0.4547815    0.2136834 

根据方程组:
600 = q - p *log(10)
600+40 = q - p log(210)
可算得p,q分别为:

p <- -40/log(2)
q <- 600 - 40*log(10)/log(2)
attach(train.woe)

根据评分公式:

Score = q - p *log(odds)

可算出所有人的分数:

Score=q - p*(as.numeric(coe[1])+as.numeric(coe[2])*w_RUOUL
  +as.numeric(coe[3])*w_age
  +as.numeric(coe[4])*w_D30Past
  +as.numeric(coe[5])*w_DR
  +as.numeric(coe[6])*w_Income
  +as.numeric(coe[7])*w_OCAL
  +as.numeric(coe[8])*w_D90Late
  +as.numeric(coe[9])*w_RealEstate
  +as.numeric(coe[10])*w_D60Past
  +as.numeric(coe[11])*w_Dependents)
> summary(Score)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  282.0   574.5   634.8   615.1   672.8   734.1 

个人总评分=基础分+各部分得分
基础得分base为:

> base <- q - p*as.numeric(coe[1])
> base
[1] 578.8523
对各变量进行打分

1、对w_RUOUL 变量打分

woe.RUOUL_1 <- RUOULresult$ivtable[[13]][1]
woe.RUOUL_2 <- RUOULresult$ivtable[[13]][2] 
woe.RUOUL_3 <- RUOULresult$ivtable[[13]][3]
woe.RUOUL_4 <- RUOULresult$ivtable[[13]][4]
woe.RUOUL_5 <- RUOULresult$ivtable[[13]][5]
woe.RUOUL_6 <- RUOULresult$ivtable[[13]][6]
woe.RUOUL_7 <- RUOULresult$ivtable[[13]][7]
woe.RUOUL_8 <- RUOULresult$ivtable[[13]][8]
RUOUL_1.SCORE <- -p*as.numeric(coe[2])*woe.RUOUL_1
RUOUL_2.SCORE <- -p*as.numeric(coe[2])*woe.RUOUL_2
RUOUL_3.SCORE <- -p*as.numeric(coe[2])*woe.RUOUL_3
RUOUL_4.SCORE <- -p*as.numeric(coe[2])*woe.RUOUL_4
RUOUL_5.SCORE <- -p*as.numeric(coe[2])*woe.RUOUL_5
RUOUL_6.SCORE <- -p*as.numeric(coe[2])*woe.RUOUL_6
RUOUL_7.SCORE <- -p*as.numeric(coe[2])*woe.RUOUL_7
RUOUL_8.SCORE <- -p*as.numeric(coe[2])*woe.RUOUL_8
RUOUL.SCORE <- c(RUOUL_1.SCORE,RUOUL_2.SCORE,RUOUL_3.SCORE,RUOUL_4.SCORE,RUOUL_5.SCORE,RUOUL_6.SCORE,RUOUL_7.SCORE,RUOUL_8.SCORE)

2、对w_age变量打分

woe.age_1 <- ageresult$ivtable[[13]][1]
woe.age_2 <- ageresult$ivtable[[13]][2] 
woe.age_3 <- ageresult$ivtable[[13]][3]
woe.age_4 <- ageresult$ivtable[[13]][4]
woe.age_5 <- ageresult$ivtable[[13]][5]
woe.age_6 <- ageresult$ivtable[[13]][6]
woe.age_7 <- ageresult$ivtable[[13]][7]
woe.age_8 <- ageresult$ivtable[[13]][8]
woe.age_9 <- ageresult$ivtable[[13]][9]
age_1.SCORE <- -p*as.numeric(coe[3])*woe.age_1
age_2.SCORE <- -p*as.numeric(coe[3])*woe.age_2
age_3.SCORE <- -p*as.numeric(coe[3])*woe.age_3
age_4.SCORE <- -p*as.numeric(coe[3])*woe.age_4
age_5.SCORE <- -p*as.numeric(coe[3])*woe.age_5
age_6.SCORE <- -p*as.numeric(coe[3])*woe.age_6
age_7.SCORE <- -p*as.numeric(coe[3])*woe.age_7
age_8.SCORE <- -p*as.numeric(coe[3])*woe.age_8
age_9.SCORE <- -p*as.numeric(coe[3])*woe.age_9
age.SCORE <- c(age_1.SCORE,age_2.SCORE,age_3.SCORE,age_4.SCORE,age_5.SCORE,age_6.SCORE,age_7.SCORE,age_8.SCORE,age_9.SCORE)

3、对w_D30Past变量打分

woe.D30Past_1 <- D30Pastresult$ivtable[[13]][1]
woe.D30Past_2 <- D30Pastresult$ivtable[[13]][2] 
woe.D30Past_3 <- D30Pastresult$ivtable[[13]][3]
D30Past_1.SCORE <- -p*as.numeric(coe[4])*woe.D30Past_1
D30Past_2.SCORE <- -p*as.numeric(coe[4])*woe.D30Past_2
D30Past_3.SCORE <- -p*as.numeric(coe[4])*woe.D30Past_3
D30Past.SCORE <- c(D30Past_1.SCORE,D30Past_2.SCORE,D30Past_3.SCORE)

4、对w_DR变量打分

woe.DR_1=getWOE(train$DR,-Inf,0.016)
woe.DR_2=getWOE(train$DR,0.016,0.381)   
woe.DR_3=getWOE(train$DR,0.381,0.503)
woe.DR_4=getWOE(train$DR,0.503,0.666)
woe.DR_5=getWOE(train$DR,0.666,3.972)
woe.DR_6=getWOE(train$DR,3.972,Inf)
DR_1.SCORE <- -p*as.numeric(coe[5])*woe.DR_1
DR_2.SCORE <- -p*as.numeric(coe[5])*woe.DR_2
DR_3.SCORE <- -p*as.numeric(coe[5])*woe.DR_3
DR_4.SCORE <- -p*as.numeric(coe[5])*woe.DR_4
DR_5.SCORE <- -p*as.numeric(coe[5])*woe.DR_5
DR_6.SCORE <- -p*as.numeric(coe[5])*woe.DR_6
DR.SCORE <- c(DR_1.SCORE,DR_2.SCORE,DR_3.SCORE,DR_4.SCORE,DR_5.SCORE,DR_6.SCORE)

5、对 w_Income变量打分

woe.Income_1=getWOE(train$Income,-Inf,3332)
woe.Income_2=getWOE(train$Income,3332,4838)   
woe.Income_3=getWOE(train$Income,4838,6620)
woe.Income_4=getWOE(train$Income,6620,9830)
woe.Income_5=getWOE(train$Income,9830,Inf)
Income_1.SCORE <- -p*as.numeric(coe[6])*woe.Income_1
Income_2.SCORE <- -p*as.numeric(coe[6])*woe.Income_2
Income_3.SCORE <- -p*as.numeric(coe[6])*woe.Income_3
Income_4.SCORE <- -p*as.numeric(coe[6])*woe.Income_4
Income_5.SCORE <- -p*as.numeric(coe[6])*woe.Income_5
Income.SCORE <- c(Income_1.SCORE,Income_2.SCORE,Income_3.SCORE,Income_4.SCORE,Income_5.SCORE)

6、对 w_OCAL变量打分

woe.OCAL_1 <- OCALresult$ivtable[[13]][1]
woe.OCAL_2 <- OCALresult$ivtable[[13]][2] 
woe.OCAL_3 <- OCALresult$ivtable[[13]][3]
woe.OCAL_4 <- OCALresult$ivtable[[13]][4]
woe.OCAL_5 <- OCALresult$ivtable[[13]][5] 
woe.OCAL_6 <- OCALresult$ivtable[[13]][6]
woe.OCAL_7 <- OCALresult$ivtable[[13]][7]
OCAL_1.SCORE <- -p*as.numeric(coe[7])*woe.OCAL_1
OCAL_2.SCORE <- -p*as.numeric(coe[7])*woe.OCAL_2
OCAL_3.SCORE <- -p*as.numeric(coe[7])*woe.OCAL_3
OCAL_4.SCORE <- -p*as.numeric(coe[7])*woe.OCAL_4
OCAL_5.SCORE <- -p*as.numeric(coe[7])*woe.OCAL_5
OCAL_6.SCORE <- -p*as.numeric(coe[7])*woe.OCAL_6
OCAL_7.SCORE <- -p*as.numeric(coe[7])*woe.OCAL_7
OCAL.SCORE <- c(OCAL_1.SCORE,OCAL_2.SCORE,OCAL_3.SCORE,OCAL_4.SCORE,OCAL_5.SCORE,OCAL_6.SCORE,OCAL_7.SCORE)

7 、对w_D90Late变量打分

woe.D90Late_1=getWOE(train$D90Late,-Inf,0)
woe.D90Late_2=getWOE(train$D90Late,0,Inf) 
D90Late_1.SCORE <- -p*as.numeric(coe[8])*woe.D90Late_1
D90Late_2.SCORE <- -p*as.numeric(coe[8])*woe.D90Late_2
D90Late.SCORE <- c(D90Late_1.SCORE,D90Late_2.SCORE)

8、对w_RealEstate变量打分

woe.RealEstate_1 <- RealEstateresult$ivtable[[13]][1]
woe.RealEstate_2 <- RealEstateresult$ivtable[[13]][2]
woe.RealEstate_3 <- RealEstateresult$ivtable[[13]][3]
woe.RealEstate_4 <- RealEstateresult$ivtable[[13]][4]
RealEstate_1.SCORE <- -p*as.numeric(coe[9])*woe.RealEstate_1
RealEstate_2.SCORE <- -p*as.numeric(coe[9])*woe.RealEstate_2
RealEstate_3.SCORE <- -p*as.numeric(coe[9])*woe.RealEstate_3
RealEstate_4.SCORE <- -p*as.numeric(coe[9])*woe.RealEstate_4
RealEstate.SCORE <- c(RealEstate_1.SCORE,RealEstate_2.SCORE,RealEstate_3.SCORE,RealEstate_4.SCORE)

9 、对w_D60Past变量打分

woe.D60Past_1 <- getWOE(train$D60Past,-Inf,0)
woe.D60Past_2 <- getWOE(train$D60Past,0,Inf) 
D60Past_1.SCORE <- -p*as.numeric(coe[10])*woe.D60Past_1
D60Past_2.SCORE <- -p*as.numeric(coe[10])*woe.D60Past_2
D60Past.SCORE <- c(D60Past_1.SCORE,D60Past_2.SCORE)

10、对w_Dependents变量打分

woe.Dependents_1 <- Dependentsresult$ivtable[[13]][1]
woe.Dependents_2 <- Dependentsresult$ivtable[[13]][2]
woe.Dependents_3 <- Dependentsresult$ivtable[[13]][3]
woe.Dependents_4 <- Dependentsresult$ivtable[[13]][4]
Dependents_1.SCORE <- -p*as.numeric(coe[11])*woe.Dependents_1
Dependents_2.SCORE <- -p*as.numeric(coe[11])*woe.Dependents_2
Dependents_3.SCORE <- -p*as.numeric(coe[11])*woe.Dependents_3
Dependents_4.SCORE <- -p*as.numeric(coe[11])*woe.Dependents_4
Dependents.SCORE <- c(Dependents_1.SCORE,Dependents_2.SCORE,Dependents_3.SCORE,Dependents_4.SCORE)

然后制作一个这样的data.frame,就完成了评分卡的制作:

> score_card
         Characteristic Attribute Points
1  无担保放款的循环利用  <= 0.114     48
2  无担保放款的循环利用  <= 0.227     28
3  无担保放款的循环利用    <= 0.3     21
4  无担保放款的循环利用  <= 0.389      9
5  无担保放款的循环利用  <= 0.493      2
6  无担保放款的循环利用 <= 0.6791    -13
7  无担保放款的循环利用  <= 0.902    -31
8  无担保放款的循环利用   > 0.902    -49
9                  年龄     <= 35    -13
10                 年龄     <= 43     -9
11                 年龄     <= 49     -5
12                 年龄     <= 52     -3
13                 年龄     <= 55      0
14                 年龄     <= 59      6
15                 年龄     <= 63     11
16                 年龄     <= 67     19
17                 年龄      > 67     29
18      30-59天逾期次数      <= 0     17
19      30-59天逾期次数      <= 1    -29
20      30-59天逾期次数       > 1    -60
21             负债比率  <= 0.016     23
22             负债比率  <= 0.381      5
23             负债比率  <= 0.503     -3
24             负债比率  <= 0.666    -16
25             负债比率  <= 3.972    -32
26             负债比率   > 3.972     11
27               月收入   <= 3332     -3
28               月收入   <= 4838     -2
29               月收入   <= 6620      0
30               月收入   <= 9830      1
31               月收入    > 9830      3
32 开放式信贷和贷款数量      <= 2    -10
33 开放式信贷和贷款数量      <= 3     -2
34 开放式信贷和贷款数量      <= 5      0
35 开放式信贷和贷款数量      <= 7      3
36 开放式信贷和贷款数量      <= 8      5
37 开放式信贷和贷款数量     <= 13      2
38 开放式信贷和贷款数量      > 13     -1
39         90天逾期次数       <=0     12
40         90天逾期次数       <=1    -74
41 不动产贷款或额度数量      <= 0     -8
42 不动产贷款或额度数量      <= 1      9
43 不动产贷款或额度数量      <= 2      7
44 不动产贷款或额度数量       > 2    -10
45      60-89天逾期次数       <=0      7
46      60-89天逾期次数       <=1    -54
47             家属数量      <= 0      2
48             家属数量      <= 1     -1
49             家属数量      <= 2     -3
50             家属数量       > 2     -5

over

感谢阅读,欢迎交流!

你可能感兴趣的:(R语言,金融模型,金融,信用评分卡,R语言,金融比赛项目,WOE,逻辑回归)