Python数据分析实战【十一】：学习用scorecardpy搭建风控评分卡模型【文末源码地址】

文章目录

评分卡模型
一、数据预处理
- - scorecardpy自带数据
  - 查看数据行列
  - 查看数据内容,用sample()比head()可以看更多的数据
  - 统计每个变量的缺失占比情况
  - 查看数据的信息
  - 查看每个变量有多少分类
  - 描述性统计
  - 数据之间的相关性
二、数据筛选
- sc.var_filter()
- - 划分数据
三、变量分箱
- - woebin()
  - woebin_plot()
  - 分箱调整
四、WOE转化
五、建立模型
六、模型评估
七、评分稳定性
- 评分映射
- - 计算基础分：
  - credit.amount分数的计算过程
  - 计算所有区间分数：
至此，评分卡模型完成！
源码地址

评分卡模型

scorecardpy库

github地址：https://github.com/ShichenXie/scorecardpy

一、数据预处理

import scorecardpy as sc
import pandas as pd
import numpy as np

scorecardpy自带数据

dat = sc.germancredit()

查看数据行列

dat.shape
(1000, 21)

数据是由1000行，21列数据组成

查看数据内容,用sample()比head()可以看更多的数据

dat.sample(5)

	status.of.existing.checking.account	duration.in.month	credit.history	purpose	credit.amount	savings.account.and.bonds	present.employment.since	installment.rate.in.percentage.of.disposable.income	personal.status.and.sex	other.debtors.or.guarantors	...	property	age.in.years	other.installment.plans	housing	number.of.existing.credits.at.this.bank	job	number.of.people.being.liable.to.provide.maintenance.for	telephone	foreign.worker	creditability
547	no checking account	24	existing credits paid back duly till now	radio/television	1552	... < 100 DM	4 <= ... < 7 years	3	male : single	none	...	car or other, not in attribute Savings account...	32	bank	own	1	skilled employee / official	2	none	yes	good
617	... < 0 DM	6	critical account/ other credits existing (not ...	car (new)	3676	... < 100 DM	1 <= ... < 4 years	1	male : single	none	...	real estate	37	none	rent	3	skilled employee / official	2	none	yes	good
186	0 <= ... < 200 DM	9	all credits at this bank paid back duly	car (used)	5129	... < 100 DM	... >= 7 years	2	female : divorced/separated/married	none	...	unknown / no property	74	bank	for free	1	management/ self-employed/ highly qualified em...	2	yes, registered under the customers name	yes	bad
776	no checking account	36	critical account/ other credits existing (not ...	car (new)	3535	... < 100 DM	4 <= ... < 7 years	4	male : single	none	...	car or other, not in attribute Savings account...	37	none	own	2	skilled employee / official	1	yes, registered under the customers name	yes	good
243	no checking account	12	critical account/ other credits existing (not ...	business	1185	... < 100 DM	1 <= ... < 4 years	3	female : divorced/separated/married	none	...	real estate	27	none	own	2	skilled employee / official	1	none	yes	good

5 rows × 21 columns

可以发现有none出现，代表的是缺失，可以用np.nan替换，方便统计每一个变量的缺失占比情况

dat = dat.replace('none',np.nan)

统计每个变量的缺失占比情况

(dat.isnull().sum()/dat.shape[0]).map(lambda x:"{:.2%}".format(x))

status.of.existing.checking.account                          0.00%
duration.in.month                                            0.00%
credit.history                                               0.00%
purpose                                                      0.00%
credit.amount                                                0.00%
savings.account.and.bonds                                    0.00%
present.employment.since                                     0.00%
installment.rate.in.percentage.of.disposable.income          0.00%
personal.status.and.sex                                      0.00%
other.debtors.or.guarantors                                 90.70%
present.residence.since                                      0.00%
property                                                     0.00%
age.in.years                                                 0.00%
other.installment.plans                                     81.40%
housing                                                      0.00%
number.of.existing.credits.at.this.bank                      0.00%
job                                                          0.00%
number.of.people.being.liable.to.provide.maintenance.for     0.00%
telephone                                                   59.60%
foreign.worker                                               0.00%
creditability                                                0.00%
dtype: object

other.debtors.or.guarantors（担保人）这一列数据的缺失占比超过90%，可以删除。

other.installment.plans（分期付款计划）这一列缺失占比也较高，只有两个分类，也可以删除。

dat["other.installment.plans"].value_counts()

bank      139
stores     47
Name: other.installment.plans, dtype: int64

telephone（电话）对建模没有太大意义，就像姓名，对建模没有太大影响。但是电话是否填写应该被考虑进去，这里先不讨论。

dat = dat.drop(columns=["other.debtors.or.guarantors","other.installment.plans","telephone"])

查看数据的信息

dat.info()


RangeIndex: 1000 entries, 0 to 999
Data columns (total 18 columns):
 #   Column                                                    Non-Null Count  Dtype   
---  ------                                                    --------------  -----   
 0   status.of.existing.checking.account                       1000 non-null   category
 1   duration.in.month                                         1000 non-null   int64   
 2   credit.history                                            1000 non-null   category
 3   purpose                                                   1000 non-null   object  
 4   credit.amount                                             1000 non-null   int64   
 5   savings.account.and.bonds                                 1000 non-null   category
 6   present.employment.since                                  1000 non-null   category
 7   installment.rate.in.percentage.of.disposable.income       1000 non-null   int64   
 8   personal.status.and.sex                                   1000 non-null   category
 9   present.residence.since                                   1000 non-null   int64   
 10  property                                                  1000 non-null   category
 11  age.in.years                                              1000 non-null   int64   
 12  housing                                                   1000 non-null   category
 13  number.of.existing.credits.at.this.bank                   1000 non-null   int64   
 14  job                                                       1000 non-null   category
 15  number.of.people.being.liable.to.provide.maintenance.for  1000 non-null   int64   
 16  foreign.worker                                            1000 non-null   category
 17  creditability                                             1000 non-null   object  
dtypes: category(9), int64(7), object(2)
memory usage: 80.8+ KB

可以看出数据是由int64，category，object类型的数据组成，category类型的数据在pandas中很特殊，建议转为object类型数据。

查看每个变量有多少分类

# 顺便把category类型的数据转为object
for c in dat.columns:
    if str(dat[c].dtype) == "category":
        dat[c] = dat[c].astype(str)
    print(c,"：",len(dat[c].unique()))

status.of.existing.checking.account ： 4
duration.in.month ： 33
credit.history ： 5
purpose ： 10
credit.amount ： 921
savings.account.and.bonds ： 5
present.employment.since ： 5
installment.rate.in.percentage.of.disposable.income ： 4
personal.status.and.sex ： 3
present.residence.since ： 4
property ： 4
age.in.years ： 53
housing ： 3
number.of.existing.credits.at.this.bank ： 4
job ： 4
number.of.people.being.liable.to.provide.maintenance.for ： 2
foreign.worker ： 2
creditability ： 2

可以看到credit.amount（金额）有921个不同的类别，age.in.years（年龄）有53个类别。
类别较多的需要合并区间，类别少的视情况而定。

描述性统计

查看每一个变量的均值，最大，最小，分位数

dat.describe()

	duration.in.month	credit.amount	installment.rate.in.percentage.of.disposable.income	present.residence.since	age.in.years	number.of.existing.credits.at.this.bank	number.of.people.being.liable.to.provide.maintenance.for
count	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000	1000.000000
mean	20.903000	3271.258000	2.973000	2.845000	35.546000	1.407000	1.155000
std	12.058814	2822.736876	1.118715	1.103718	11.375469	0.577654	0.362086
min	4.000000	250.000000	1.000000	1.000000	19.000000	1.000000	1.000000
25%	12.000000	1365.500000	2.000000	2.000000	27.000000	1.000000	1.000000
50%	18.000000	2319.500000	3.000000	3.000000	33.000000	1.000000	1.000000
75%	24.000000	3972.250000	4.000000	4.000000	42.000000	2.000000	1.000000
max	72.000000	18424.000000	4.000000	4.000000	75.000000	4.000000	2.000000

数据之间的相关性

dat.corr()

	duration.in.month	credit.amount	installment.rate.in.percentage.of.disposable.income	present.residence.since	age.in.years	number.of.existing.credits.at.this.bank	number.of.people.being.liable.to.provide.maintenance.for
duration.in.month	1.000000	0.624984	0.074749	0.034067	-0.036136	-0.011284	-0.023834
credit.amount	0.624984	1.000000	-0.271316	0.028926	0.032716	0.020795	0.017142
installment.rate.in.percentage.of.disposable.income	0.074749	-0.271316	1.000000	0.049302	0.058266	0.021669	-0.071207
present.residence.since	0.034067	0.028926	0.049302	1.000000	0.266419	0.089625	0.042643
age.in.years	-0.036136	0.032716	0.058266	0.266419	1.000000	0.149254	0.118201
number.of.existing.credits.at.this.bank	-0.011284	0.020795	0.021669	0.089625	0.149254	1.000000	0.109667
number.of.people.being.liable.to.provide.maintenance.for	-0.023834	0.017142	-0.071207	0.042643	0.118201	0.109667	1.000000

可以看出，credit.amount与duration.in.month的相关性为0.624984。可以根据实际业务，将相关性高的变量保留一个。

二、数据筛选

参考文章：https://zhuanlan.zhihu.com/p/80134853

评分卡建模常用WOE、IV来筛选变量，通常选择IV值>0.02的变量。IV值越大，变量对y的预测能力较强，就越应该进入模型中。

WOE：(Weight of Evidence)中文“证据权重”，某个变量的区间对y的影响程度。

计算方法：
$WOE_i=ln(\frac{R_{0i}}{R_{0T}})-ln(\frac{R_{1i}}{R_{1T}})$
$R_{0i}：变量的第i个区间，y=0的个数。\\ R_{0T}：y=0的个数。 \\ R_{1i}：变量的第i个区间，y=1的个数。\\ R_{1T}：y=1的个数。$
举例说明：
将age.in.years划分为[-inf,26.0),[26.0,35.0),[35.0,40.0),[40.0,inf)四个区间，统计各个区间y=0(good)，y=1(bad)的数量，计算WOE。
比如计算age.in.year在[26,35)区间的WOE：
$WOE_i=ln(\frac{R_{0i}}{R_{0T}})-ln(\frac{R_{1i}}{R_{1T}})=ln(\frac{246}{700})-ln(\frac{112}{300})=-0.060465$
同理可以计算出其他区间对应的WOE值。

IV:（Information Value）中文“信息价值”，变量所含信息的价值。

计算方法：
$IV=\sum_{i=1}^n(\frac{R_{0i}}{R_{0T}}-\frac{R_{1i}}{R_{1T}})*WOE_i$
举例说明：
$IV=\sum_{i=1}^n(\frac{R_{0i}}{R_{0T}}-\frac{R_{1i}}{R_{1T}})*WOE_i\\ =(\frac{110}{700}-\frac{80}{300})*0.528844\\ +(\frac{246}{700}-\frac{112}{300})*0.060465\\ +(\frac{123}{700}-\frac{30}{300})*-0.563689\\ +(\frac{221}{700}-\frac{78}{300})*-0.194156\\ =0.112742$

公式看似复杂，其实仔细想想，用到的知识也不是很难。另外，这些程序scorecardpy中已经实现，只需要调用传参即可。

用scorecardpy计算的age.in.years的WOE：

# bins_adj_df[bins_adj_df.variable=="age.in.years"]

	level_1	variable	bin	count	count_distr	good	bad	badprob	woe	bin_iv	total_iv	breaks	is_special_values
4	0	age.in.years	[-inf,26.0)	190	0.190	110	80	0.421053	0.528844	0.057921	0.112742	26.0	False
5	1	age.in.years	[26.0,35.0)	358	0.358	246	112	0.312849	0.060465	0.001324	0.112742	35.0	False
6	2	age.in.years	[35.0,40.0)	153	0.153	123	30	0.196078	-0.563689	0.042679	0.112742	40.0	False
7	3	age.in.years	[40.0,inf)	299	0.299	221	78	0.260870	-0.194156	0.010817	0.112742	inf	False

sc.var_filter()

dt：数据
y：y变量名
iv_limit：0.02
missing_limit：0.95
identical_limit：0.95
positive：坏样本的标签
dt：DataFrame数据
var_rm：强制删除变量的名称
var_kp：强制保留变量的名称
return_rm_reason：是否返回每个变量被删除的原因

dt_s = sc.var_filter(dat,y="creditability",iv_limit=0.02)

dat.shape

(1000, 18)

dt_s.shape

(1000, 13)

可以看出，用var_filter()方法，将变量从18个筛选到13个变量。

划分数据

sc.split_df(dt, y=None, ratio=0.7, seed=186)

train,test = sc.split_df(dt=dt_s,y="creditability").values()

训练数据y的统计：

train.creditability.value_counts()

0    490
1    210
Name: creditability, dtype: int64

测试数据y的统计：

test.creditability.value_counts()

0    210
1     90
Name: creditability, dtype: int64

三、变量分箱

常用的分箱：卡方分箱，决策树分箱… ，这里简单介绍一下卡方分箱。

为什么要分箱？
分箱之后，变量的轻微波动，不影响模型的稳定。比如：收入这一变量，10000和11000对y的影响可能是一样的，将其归为一类是一个不错的选择。

分箱要求？

变量的类别在5到7类最好
有序，单调，平衡

卡方分箱：

参考文章：https://zhuanlan.zhihu.com/p/115267395

卡方分箱的思想，衡量预测值与观察值的差异，究竟有多大的概率是由随机因素引起的。
卡方值计算：
$\chi^2=\sum_{i=1}^n\sum_{c=1}^m\frac{(A_{ic}-E_{ic})^2}{E_{ic}}$
$n：划分的区间总数。\\ m：y的类别，一般为2个。 \\ A_{ic}：实际样本在每个区间统计的数量。$

$E_{ic}：期望样本在每个区间的数量，E_{ic}=\frac{T_i*T_c}{T}，T_i：第i个分组的总数，T_c：第c个类别的总数，T：总样本数。$

步骤：（数值型数据）
1. 将数据去重并排序，得到A1，A2，A3等分组区间，统计每个区间的量。
2. 计算A1与A2的卡方值，计算A2与A3的卡方值，（计算相邻区间的卡方值）
3. 如果相邻的卡方值小于阈值（根据自由度和置信度计算得出的出的阈值），就合并区间为一个新的区间。
4. 重复第2、3步的操作。直到达到某个条件停止计算。
5. 当最小的卡方值大于阈值，停止。
6. 当划分的区间到达指定的区间个数，停止。

woebin()

scorecardpy默认使用决策树分箱，method=‘tree’
这里使用卡方分箱，method=‘chimerge’
返回的是一个字典数据，用pandas.concat()查看所有数据

bins = sc.woebin(dt_s,y="creditability",method="chimerge")

bins["installment.rate.in.percentage.of.disposable.income"]

	variable	bin	count	count_distr	good	bad	badprob	woe	bin_iv	total_iv	breaks	is_special_values
0	installment.rate.in.percentage.of.disposable.i...	[-inf,3.0)	367	0.367	271	96	0.261580	-0.190473	0.012789	0.019769	3.0	False
1	installment.rate.in.percentage.of.disposable.i...	[3.0,inf)	633	0.633	429	204	0.322275	0.103961	0.006980	0.019769	inf	False

bins_df = pd.concat(bins).reset_index().drop(columns="level_0")

bins_df

	level_1	variable	bin	count	count_distr	good	bad	badprob	woe	bin_iv	total_iv	breaks	is_special_values
0	0	credit.amount	[-inf,1400.0)	267	0.267	185	82	0.307116	0.033661	0.000305	0.171431	1400.0	False
1	1	credit.amount	[1400.0,1800.0)	105	0.105	87	18	0.171429	-0.728239	0.046815	0.171431	1800.0	False
2	2	credit.amount	[1800.0,2000.0)	60	0.060	39	21	0.350000	0.228259	0.003261	0.171431	2000.0	False
3	3	credit.amount	[2000.0,4000.0)	322	0.322	248	74	0.229814	-0.362066	0.038965	0.171431	4000.0	False
4	4	credit.amount	[4000.0,inf)	246	0.246	141	105	0.426829	0.552498	0.082085	0.171431	inf	False
5	0	age.in.years	[-inf,26.0)	190	0.190	110	80	0.421053	0.528844	0.057921	0.123935	26.0	False
6	1	age.in.years	[26.0,35.0)	358	0.358	246	112	0.312849	0.060465	0.001324	0.123935	35.0	False
7	2	age.in.years	[35.0,37.0)	79	0.079	67	12	0.151899	-0.872488	0.048610	0.123935	37.0	False
8	3	age.in.years	[37.0,inf)	373	0.373	277	96	0.257373	-0.212371	0.016080	0.123935	inf	False
9	0	housing	own	713	0.713	527	186	0.260870	-0.194156	0.025795	0.082951	own	False
10	1	housing	rent%,%for free	287	0.287	173	114	0.397213	0.430205	0.057156	0.082951	rent%,%for free	False
11	0	property	real estate	282	0.282	222	60	0.212766	-0.461035	0.054007	0.112634	real estate	False
12	1	property	building society savings agreement/ life insur...	564	0.564	391	173	0.306738	0.031882	0.000577	0.112634	building society savings agreement/ life insur...	False
13	2	property	unknown / no property	154	0.154	87	67	0.435065	0.586082	0.058050	0.112634	unknown / no property	False
14	0	duration.in.month	[-inf,8.0)	87	0.087	78	9	0.103448	-1.312186	0.106849	0.282618	8.0	False
15	1	duration.in.month	[8.0,16.0)	344	0.344	264	80	0.232558	-0.346625	0.038294	0.282618	16.0	False
16	2	duration.in.month	[16.0,34.0)	399	0.399	270	129	0.323308	0.108688	0.004813	0.282618	34.0	False
17	3	duration.in.month	[34.0,44.0)	100	0.100	58	42	0.420000	0.524524	0.029973	0.282618	44.0	False
18	4	duration.in.month	[44.0,inf)	70	0.070	30	40	0.571429	1.134980	0.102689	0.282618	inf	False
19	0	status.of.existing.checking.account	no checking account	394	0.394	348	46	0.116751	-1.176263	0.404410	0.666012	no checking account	False
20	1	status.of.existing.checking.account	... >= 200 DM / salary assignments for at leas...	63	0.063	49	14	0.222222	-0.405465	0.009461	0.666012	... >= 200 DM / salary assignments for at leas...	False
21	2	status.of.existing.checking.account	0 <= ... < 200 DM	269	0.269	164	105	0.390335	0.401392	0.046447	0.666012	0 <= ... < 200 DM	False
22	3	status.of.existing.checking.account	... < 0 DM	274	0.274	139	135	0.492701	0.818099	0.205693	0.666012	... < 0 DM	False
23	0	installment.rate.in.percentage.of.disposable.i...	[-inf,3.0)	367	0.367	271	96	0.261580	-0.190473	0.012789	0.019769	3.0	False
24	1	installment.rate.in.percentage.of.disposable.i...	[3.0,inf)	633	0.633	429	204	0.322275	0.103961	0.006980	0.019769	inf	False
25	0	savings.account.and.bonds	... >= 1000 DM%,%500 <= ... < 1000 DM%,%unknow...	294	0.294	245	49	0.166667	-0.762140	0.142266	0.189391	... >= 1000 DM%,%500 <= ... < 1000 DM%,%unknow...	False
26	1	savings.account.and.bonds	100 <= ... < 500 DM%,%... < 100 DM	706	0.706	455	251	0.355524	0.252453	0.047125	0.189391	100 <= ... < 500 DM%,%... < 100 DM	False
27	0	present.employment.since	4 <= ... < 7 years%,%... >= 7 years	427	0.427	324	103	0.241218	-0.298717	0.035704	0.082865	4 <= ... < 7 years%,%... >= 7 years	False
28	1	present.employment.since	1 <= ... < 4 years	339	0.339	235	104	0.306785	0.032103	0.000352	0.082865	1 <= ... < 4 years	False
29	2	present.employment.since	unemployed%,%... < 1 year	234	0.234	141	93	0.397436	0.431137	0.046809	0.082865	unemployed%,%... < 1 year	False
30	0	personal.status.and.sex	male : single%,%male : married/widowed	640	0.640	469	171	0.267188	-0.161641	0.016164	0.042633	male : single%,%male : married/widowed	False
31	1	personal.status.and.sex	female : divorced/separated/married	360	0.360	231	129	0.358333	0.264693	0.026469	0.042633	female : divorced/separated/married	False
32	0	credit.history	critical account/ other credits existing (not ...	293	0.293	243	50	0.170648	-0.733741	0.132423	0.291829	critical account/ other credits existing (not ...	False
33	1	credit.history	delay in paying off in the past%,%existing cre...	618	0.618	421	197	0.318770	0.087869	0.004854	0.291829	delay in paying off in the past%,%existing cre...	False
34	2	credit.history	all credits at this bank paid back duly%,%no c...	89	0.089	36	53	0.595506	1.234071	0.154553	0.291829	all credits at this bank paid back duly%,%no c...	False
35	0	purpose	retraining%,%car (used)%,%radio/television	392	0.392	312	80	0.204082	-0.513679	0.091973	0.142092	retraining%,%car (used)%,%radio/television	False
36	1	purpose	furniture/equipment%,%domestic appliances%,%bu...	608	0.608	388	220	0.361842	0.279920	0.050119	0.142092	furniture/equipment%,%domestic appliances%,%bu...	False

woebin_plot()

制作变量分布图

bins["age.in.years"]

	variable	bin	count	count_distr	good	bad	badprob	woe	bin_iv	total_iv	breaks	is_special_values
0	age.in.years	[-inf,26.0)	190	0.190	110	80	0.421053	0.528844	0.057921	0.123935	26.0	False
1	age.in.years	[26.0,35.0)	358	0.358	246	112	0.312849	0.060465	0.001324	0.123935	35.0	False
2	age.in.years	[35.0,37.0)	79	0.079	67	12	0.151899	-0.872488	0.048610	0.123935	37.0	False
3	age.in.years	[37.0,inf)	373	0.373	277	96	0.257373	-0.212371	0.016080	0.123935	inf	False

sc.woebin_plot(bins["age.in.years"])

sc.woebin_plot(bins["credit.amount"])

从变量的分布图，看出bad_prob、credit.amount这两个变量并不单调，接下来就需要调整一下区间。

分箱调整

scorecardpy可以自定义分箱，也可以自动分箱。
自己手动调整比较好（根据业务，实际经验调整）

# 自动分箱
# break_adj = sc.woebin_adj(dt_s,y="creditability",bins=bins)

bins["credit.amount"]

	variable	bin	count	count_distr	good	bad	badprob	woe	bin_iv	total_iv	breaks	is_special_values
0	credit.amount	[-inf,1400.0)	267	0.267	185	82	0.307116	0.033661	0.000305	0.171431	1400.0	False
1	credit.amount	[1400.0,1800.0)	105	0.105	87	18	0.171429	-0.728239	0.046815	0.171431	1800.0	False
2	credit.amount	[1800.0,2000.0)	60	0.060	39	21	0.350000	0.228259	0.003261	0.171431	2000.0	False
3	credit.amount	[2000.0,4000.0)	322	0.322	248	74	0.229814	-0.362066	0.038965	0.171431	4000.0	False
4	credit.amount	[4000.0,inf)	246	0.246	141	105	0.426829	0.552498	0.082085	0.171431	inf	False

将年龄划分在[-inf,26.0)，[26.0,35.0)，[35.0,40.0)，[40.0,inf)区间大致能满足单调性。金额划分在[-inf,1400.0)，[1400.0,1900.0)，[1900.0,4000.0)，[4000.0,inf)区间大致能满足单调性。

# 手动分箱
break_adj = {
    'age.in.years':[26,35,40],
    'credit.amount':[1400,1900,4000]
}
bins_adj = sc.woebin(dt_s,y="creditability",breaks_list=break_adj)

bins_adj_df = pd.concat(bins_adj).reset_index().drop(columns="level_0")

bins_adj_df[bins_adj_df.variable.isin(["age.in.years",'credit.amount'])]

	level_1	variable	bin	count	count_distr	good	bad	badprob	woe	bin_iv	total_iv	breaks	is_special_values
0	0	credit.amount	[-inf,1400.0)	267	0.267	185	82	0.307116	0.033661	0.000305	0.141144	1400.0	False
1	1	credit.amount	[1400.0,1900.0)	131	0.131	104	27	0.206107	-0.501256	0.029359	0.141144	1900.0	False
2	2	credit.amount	[1900.0,4000.0)	356	0.356	270	86	0.241573	-0.296777	0.029395	0.141144	4000.0	False
3	3	credit.amount	[4000.0,inf)	246	0.246	141	105	0.426829	0.552498	0.082085	0.141144	inf	False
4	0	age.in.years	[-inf,26.0)	190	0.190	110	80	0.421053	0.528844	0.057921	0.112742	26.0	False
5	1	age.in.years	[26.0,35.0)	358	0.358	246	112	0.312849	0.060465	0.001324	0.112742	35.0	False
6	2	age.in.years	[35.0,40.0)	153	0.153	123	30	0.196078	-0.563689	0.042679	0.112742	40.0	False
7	3	age.in.years	[40.0,inf)	299	0.299	221	78	0.260870	-0.194156	0.010817	0.112742	inf	False

sc.woebin_plot(bins_adj["age.in.years"])

sc.woebin_plot(bins_adj['credit.amount']

四、WOE转化

将原始数据都转化为对应区间的WOE值，当然也可以不转化，但是转化之后：

变量内部之间可以比较
变量与变量之间也可以比较
所有变量都在同一“维度”下

train_woe = sc.woebin_ply(train,bins_adj)

test_woe = sc.woebin_ply(test,bins_adj)

train_woe.sample(5)

	creditability	credit.amount_woe	age.in.years_woe	housing_woe	property_woe	duration.in.month_woe	status.of.existing.checking.account_woe	installment.rate.in.percentage.of.disposable.income_woe	savings.account.and.bonds_woe	present.employment.since_woe	personal.status.and.sex_woe	credit.history_woe	purpose_woe
723	0	0.033661	-0.194156	-0.194156	-0.461035	-0.346625	0.614204	0.103961	-0.762140	0.032103	0.264693	0.088319	-0.410063
331	1	-0.501256	0.060465	-0.194156	-0.461035	0.108688	-1.176263	0.103961	0.139552	0.032103	0.264693	-0.733741	0.279920
690	0	0.033661	0.528844	-0.194156	0.028573	-0.346625	0.614204	-0.155466	0.271358	0.032103	0.264693	-0.733741	0.279920
537	0	-0.296777	-0.563689	-0.194156	0.028573	0.108688	0.614204	0.103961	0.271358	-0.235566	0.264693	-0.733741	0.279920
0	0	0.033661	-0.194156	-0.194156	-0.461035	-1.312186	0.614204	0.103961	-0.762140	-0.235566	-0.165548	-0.733741	-0.410063

五、建立模型

逻辑回归，挺复杂的。

from sklearn.linear_model import LogisticRegression

y_train = train_woe.loc[:,"creditability"]
X_train = train_woe.loc[:,train_woe.columns!="creditability"]

y_test = test_woe.loc[:,"creditability"]
X_test = test_woe.loc[:,test_woe.columns!="creditability"]

lr = LogisticRegression(penalty='l1',C=0.9,solver='saga',n_jobs=-1)
lr.fit(X_train,y_train)

LogisticRegression(C=0.9, n_jobs=-1, penalty='l1', solver='saga')

lr.coef_

array([[0.77881419, 0.6892819 , 0.36660545, 0.37598509, 0.59990642,
        0.75916199, 1.68181704, 0.50153176, 0.23641609, 0.70438936,
        0.63125597, 0.99437898]])

lr.intercept_

array([-0.82463787])

六、模型评估

逻辑回归，预测结果为接近1的概率值。
0.6表示：数据划分为标签1的概率为0.6。那么究竟多大的概率才能划为标签1呢？这就需要一个阈值。这个阈值可以根据KS的值来确定。高于阈值得划分为1标签，低于阈值得划分为0标签。

TRP与FRP：
$TRP=\frac{预测为1，真实值为1的数据量}{预测为1的总量}$
$FRP=\frac{预测为0，真实值为1的数据量}{预测为0的总量}$
ROC曲线绘制步骤：

将预测的y_score去重排序后得到一系列阈值。
用每一个y_score做为阈值，统计数量并计算TRP、FRP的值。
这样得到一组数据后，以FPR为横坐标，TPR为纵轴标绘制图像。

AUC：

ROC曲线与横坐标轴围成的面积。

KS曲线：用来确定最好的阈值
$K S = ma x (TRP - FRP)$

x轴为一些阈值的长度（区间序号都行），将TRP、FRP绘制在同一个坐标轴中。

train_pred = lr.predict_proba(X_train)[:,1]
test_pred =  lr.predict_proba(X_test)[:,1]

train_perf = sc.perf_eva(y_train,train_pred,title="train")

test_perf = sc.perf_eva(y_test,test_pred,title="test")

七、评分稳定性

PSI（Population Stability Index）群组稳定性指标

模型在训练数据得到的实际分布(A)，与测试集上得到的预期分布(E)
$PSI=\sum_{i=1}^n(A_i-E_i)*ln(\frac{A_i}{E_i})$
$A_i：实际分布在第i个区间的数量。$

$E_i：预期分布在第i个区间的数量。$

PSI越小，说明模型越稳定。通常PSI小于0.1，模型稳定性好。

train_score = sc.scorecard_ply(train, card, print_step=0)
test_score = sc.scorecard_ply(test, card, print_step=0)

sc.perf_psi(
    score = {'train':train_score,'test':test_score},
    label = {'train':y_train,'test':y_test}
)

评分映射

参考地址：https://github.com/xsj0609/data_science/tree/master/ScoreCard

逻辑回归结果：
$f(x)=\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_nx_n$
评分计算公式：
$Score=A-B*log(\frac{p}{1-p})，p：客户违约率$
计算评分前需要先给出两个条件：
$1、给定某个违约率，对应的分数P_0。scorecardpy默认\theta_0=\frac{1}{19}，P_0=600\\ 2、当违约率翻一番的时候，分数变化幅度PDO。scorecardpy默认PDO=50$
通过推导可以计算出：
$B=\frac{PDO}{log(2)}，A=P_0+B*log(\theta_0)，log(\frac{p}{1-p})=f(x)$
举例说明：

计算基础分：

import math

B = 50/math.log(2)
A = 600+B*math.log(1/19)
basepoints=A-B*lr.intercept_[0]
print("A:",A,"B:",B,"basepoints:",basepoints)

A: 387.6036243278207 B: 72.13475204444818 basepoints: 447.0886723193208

credit.amount分数的计算过程

bins_adj_df[bins_adj_df["variable"]=="credit.amount"]

	level_1	variable	bin	count	count_distr	good	bad	badprob	woe	bin_iv	total_iv	breaks	is_special_values
0	0	credit.amount	[-inf,1400.0)	267	0.267	185	82	0.307116	0.033661	0.000305	0.141144	1400.0	False
1	1	credit.amount	[1400.0,1900.0)	131	0.131	104	27	0.206107	-0.501256	0.029359	0.141144	1900.0	False
2	2	credit.amount	[1900.0,4000.0)	356	0.356	270	86	0.241573	-0.296777	0.029395	0.141144	4000.0	False
3	3	credit.amount	[4000.0,inf)	246	0.246	141	105	0.426829	0.552498	0.082085	0.141144	inf	False

lr.coef_

array([[0.77881419, 0.6892819 , 0.36660545, 0.37598509, 0.59990642,
        0.75916199, 1.68181704, 0.50153176, 0.23641609, 0.70438936,
        0.63125597, 0.99437898]])

lr.intercept_

array([-0.82463787])

# [-inf,1400.0)区间分数，按照顺序，对应的系数为0.77881419
-B*0.77881419*0.033661

-1.8910604547516296

# [1400.0,1900.0)
-B*0.77881419*(-0.501256)

28.160345780190216

计算所有区间分数：

card = sc.scorecard(bins_adj,lr,X_train.columns)

card_df = pd.concat(card)

card_df

		variable	bin	points
basepoints	0	basepoints	NaN	447.0
credit.amount	0	credit.amount	[-inf,1400.0)	-2.0
	1	credit.amount	[1400.0,1900.0)	28.0
	2	credit.amount	[1900.0,4000.0)	17.0
	3	credit.amount	[4000.0,inf)	-31.0
age.in.years	4	age.in.years	[-inf,26.0)	-26.0
	5	age.in.years	[26.0,35.0)	-3.0
	6	age.in.years	[35.0,40.0)	28.0
	7	age.in.years	[40.0,inf)	10.0
housing	8	housing	own	5.0
	9	housing	rent	-11.0
	10	housing	for free	-12.0
property	11	property	real estate	13.0
	12	property	building society savings agreement/ life insur...	-1.0
	13	property	car or other, not in attribute Savings account...	-1.0
	14	property	unknown / no property	-16.0
duration.in.month	15	duration.in.month	[-inf,8.0)	57.0
	16	duration.in.month	[8.0,16.0)	15.0
	17	duration.in.month	[16.0,34.0)	-5.0
	18	duration.in.month	[34.0,44.0)	-23.0
	19	duration.in.month	[44.0,inf)	-49.0
status.of.existing.checking.account	20	status.of.existing.checking.account	no checking account	64.0
	21	status.of.existing.checking.account	... >= 200 DM / salary assignments for at leas...	22.0
	22	status.of.existing.checking.account	0 <= ... < 200 DM%,%... < 0 DM	-34.0
installment.rate.in.percentage.of.disposable.income	23	installment.rate.in.percentage.of.disposable.i...	[-inf,2.0)	30.0
	24	installment.rate.in.percentage.of.disposable.i...	[2.0,3.0)	19.0
	25	installment.rate.in.percentage.of.disposable.i...	[3.0,inf)	-13.0
savings.account.and.bonds	26	savings.account.and.bonds	... >= 1000 DM%,%500 <= ... < 1000 DM%,%unknow...	28.0
	27	savings.account.and.bonds	100 <= ... < 500 DM	-5.0
	28	savings.account.and.bonds	... < 100 DM	-10.0
present.employment.since	29	present.employment.since	4 <= ... < 7 years	7.0
	30	present.employment.since	... >= 7 years	4.0
	31	present.employment.since	1 <= ... < 4 years	-1.0
	32	present.employment.since	unemployed%,%... < 1 year	-7.0
personal.status.and.sex	33	personal.status.and.sex	male : single	8.0
	34	personal.status.and.sex	male : married/widowed	7.0
	35	personal.status.and.sex	female : divorced/separated/married	-13.0
credit.history	36	credit.history	critical account/ other credits existing (not ...	33.0
	37	credit.history	delay in paying off in the past	-4.0
	38	credit.history	existing credits paid back duly till now	-4.0
	39	credit.history	all credits at this bank paid back duly%,%no c...	-56.0
purpose	40	purpose	retraining%,%car (used)	58.0
	41	purpose	radio/television	29.0
	42	purpose	furniture/equipment%,%domestic appliances%,%bu...	-20.0

每个变量的每个区间的分数计算完成，将客户的数据对应到区间，将分数相加，即可得出对应的评分。

至此，评分卡模型完成！

源码地址

链接：https://pan.baidu.com/s/1DAI1hxWPHEb6-46erjDaKg?pwd=e4sw
提取码：e4sw

你可能感兴趣的:(机器学习,数据分析,python,机器学习)

【Python】一文详细介绍 py格式文件高斯小哥 Python基础【高质量合集】python 新手入门学习
【Python】一文详细介绍py格式文件个人主页：高斯小哥高质量专栏：Matplotlib之旅：零基础精通数据可视化、Python基础【高质量合集】、PyTorch零基础入门教程希望得到您的订阅和支持~创作高质量博文(平均质量分92+)，分享更多关于深度学习、PyTorch、Python领域的优质内容！（希望得到您的关注~）文章目录一、py格式文件简介二、如何创建和编辑py格式文件三、如何运行py
python抓包与解包_Python—网络抓包与解包（pcap、dpkt） weixin_39691055 python抓包与解包
pcap安装[root@localhost~]#pipinstallpypcap抓包与解包#-*-coding:utf-8-*-importpcap,dpktimportre,threading,requests__black_ip=['103.224.249.123','203.66.1.212']#抓包：param1eth_name网卡名，如：eth0,eth3。param2p_type日志捕
华为OD机试 - 单向链表中间节点（Java & JS & Python & C & C++）华为OD题库华为od 链表 java
须知哈喽，本题库完全免费，收费是为了防止被爬，大家订阅专栏后可以私信联系退款。感谢支持文章目录须知题目描述输出描述解析代码题目描述给定一个单链表L，请编写程序输出L中间结点保存的数据。如果有两个中间结点，则输出第二个中间结点保存的数据。例如：给定L为1→7→5，则输出应该为7；给定L为1→2→3→4，则输出应该为3；输入描述每个输入包含1个测试用例。每个测试用例：第一行给出链表首结点的地址、结点总
python 推导式(派生、衍生) sanduo112 人工智能 python windows 开发语言
python推导式一、推导式(派生、衍生)1.Python推导式是一种独特的数据处理方式，可以从一个数据序列构建另一个新的数据序列的结构体。2.列表(list)推导式3.字典(dict)推导式4.集合(set)推导式5.元组(tuple)推导式二、代码概述一、推导式(派生、衍生)1.Python推导式是一种独特的数据处理方式，可以从一个数据序列构建另一个新的数据序列的结构体。Python支持各种数
数据分析：低代码平台助力大数据时代的飞跃发展快乐非自愿数据分析低代码大数据
随着信息技术的突飞猛进，我们身处于一个数据量空前增长的时代——大数据时代。在这个时代背景下，数据分析已经成为企业决策、政策制定、科学研究等众多领域不可或缺的重要工具。然而，面对海量的数据和日益复杂多变的分析需求，传统的数据分析方法往往捉襟见肘，难以应对。幸运的是，低代码平台的兴起为大数据分析注入了新的活力，成为推动大数据时代发展的重要力量。低代码平台，顾名思义，是一种通过少量甚至无需编写代码，就能
数据挖掘|数据预处理|基于Python的数据标准化方法皖山文武数据挖掘数据建模与分析 python 数据挖掘开发语言
基于Python的数据标准化方法1.z-score方法2.极差标准化方法3.最大绝对值标准化方法在数据分析之前，通常需要先将数据标准化（Standardization），利用标准化后的数据进行数据分析，以避免属性之间不同度量和取值范围差异造成数据对分析结果的影响。1.z-score方法Z-score方法是基于原始数据的均值和标准差来进行数据标准化的，处理后的数据均值为0，方差为1，符合标准正态分布
CSV指南：Python程序获取大型CSV文件行数孤独打铁匠Julian 笔记经验分享 python
本指南提供了几种使用Python来获取大型CSV文件行数的方法，并解释了每种方法的适用场景。方法1:使用csv.reader处理复杂CSV文件当你的CSV文件中包含多行字段（即某些字段的值中包含换行符）时，使用csv.reader是一个可靠的选择，因为它能够正确处理这些复杂情况。这个方法适用于大多数大小的CSV文件，但是对于非常大的文件，读取整个文件可能会占用较多的时间和内存。对于极大的文件，考虑
谷歌浏览器驱动Chromedriver（114-120版本）文件以及驱动下载教程 pigerr杨 Python python chrome drivers
ChromeDriver官方网站GitHub||GoogleChromeLabs/chrome-for-testingChromeDriver113-125_JSONChromeforTestingavailability123-125zip白月黑羽Python基础|进阶|Qt图形界面|Django|自动化测试|性能测试|JS语言|JS前端|原理与安装
大创项目推荐深度学习 opencv python 公式识别(图像识别机器视觉) laafeer python
文章目录0前言1课题说明2效果展示3具体实现4关键代码实现5算法综合效果6最后0前言优质竞赛项目系列，今天要分享的是基于深度学习的数学公式识别算法实现该项目较为新颖，适合作为竞赛课题方向，学长非常推荐！学长这里给一个题目综合评分(每项满分5分)难度系数：3分工作量：4分创新点：4分更多资料,项目分享：https://gitee.com/dancheng-senior/postgraduate1课题
ES-LTR粗排模块 poins jenkins 运维
ES-LTR粗排模块官方资源：https://github.com/HeiBoWang/elasticsearch-learning-to-rankElasticsearch学习排名插件使用机器学习提高搜索相关性排名。它为维基媒体基金会和Snagajob等地方的搜索提供了动力！这个插件有什么功能此插件：允许您在Elasticsearch中存储特征（Elasticsearch查询模板）记录特征得分（
Ai插件脚本合集安装包，免费教程视频网盘分享全网优惠分享君
随着人工智能技术的不断发展，越来越多的插件脚本涌现出来，为我们的生活和工作带来了便利。然而，如何快速、方便地获取和使用这些插件脚本呢？今天，我将为大家分享一个非常实用的资源——AI插件脚本合集安装包，以及免费教程视频网盘分享。首先，让我们来了解一下这个AI插件脚本合集安装包。它是一个集合了众多AI插件脚本的资源包，涵盖了各种领域，如数据分析、自动化办公、智能客服等等。通过这个安装包，用户可以轻松地
python转码 Desamond python 开发语言
转码在许多场景中都有应用，以下是一些常见的场景：网页开发：当用户在网页上输入文本时，可能需要将特殊字符（如空格、引号、特殊符号等）进行转码，以防止这些字符对URL或HTML代码产生干扰。文件名处理：在处理文件名时，可能需要将特殊字符进行转码，以避免文件名被错误地解析或显示。数据传输：在数据传输过程中，为了确保数据的完整性和正确性，可能需要将数据中的特殊字符进行转码。数据存储：在数据库或数据存储中，
排序算法太多？常用排序都在这了，一篇文章总结和实现所有面试会考的排序算法（基于Python实现）宇宙之一粟不归路之Python #IT面试题收集与总结数据结构与算法算法数据结构排序算法 python java
文章目录排序算法1.常见的排序算法1.1选择排序1.1.1思想1.1.2实现**1.1.3选择排序分析**1.2冒泡排序**1.2.1思想****1.2.2实现****1.2.3冒泡排序分析**1.3插入排序**1.3.1思想****1.3.2实现****1.3.3插入排序分析**1.4归并排序☆☆★**1.4.1思想****1.4.2实现****1.4.3归并排序分析**1.5快速排序☆★★**
27.Python从入门到精通—Python异常处理抛出异常用户自定义异常定义清理行为预定义的清理行为以山河作礼。 #Python基础入门—详解版 python java 服务器
27.从入门到精通：Python异常处理抛出异常用户自定义异常定义清理行为预定义的清理行为异常处理抛出异常用户自定义异常定义清理行为预定义的清理行为异常处理在Python中，异常处理是一种处理程序在执行期间可能遇到的错误的方法。当Python解释器遇到错误时，它会引发异常。异常是一种Python对象，它包含有关错误的信息，例如错误类型和错误位置。为了处理异常，您可以使用try-except语句。在
python清华大学出版社答案_Python机器学习及实践 weixin_39805119 python清华大学出版社答案
第1章机器学习的基础知识1.1何谓机器学习1.1.1传感器和海量数据1.1.2机器学习的重要性1.1.3机器学习的表现1.1.4机器学习的主要任务1.1.5选择合适的算法1.1.6机器学习程序的步骤1.2综合分类1.3推荐系统和深度学习1.3.1推荐系统1.3.2深度学习1.4何为Python1.4.1使用Python软件的由来1.4.2为什么使用Python1.4.3Python设计定位1.4.
Python | Redis工具类 -拟墨画扇- Python redis 数据库缓存 python
一、需求自动连接Redis数据库，通过连接池处理数据对输出结果进行Log打印并保存到文件二、代码Utils.redisUtils.py#!/usr/bin/envpython#-*-coding:utf-8-*-importredisfromUtils.loggerimportlog"""Redis数据格式(1)字符串|存储形式:key-value:str-存储二进制数据:可以存储任意类型的数据，
数据管理知识体系指南（第二版）-第五章——数据建模和设计-学习笔记键盘上的五花肉数据治理数据库数据仓库数据治理
目录5.1引言5.1.1业务驱动因素5.1.2目标和原则5.1.3基本概念5.2活动5.2.1规划数据建模5.2.2建立数据模型5.2.3审核数据模型5.2.4维护数据模型5.3工具5.3.1数据建模工具5.3.2数据血缘工具5.3.3数据分析工具5.3.4元数据资料库5.3.5数据模型模式5.3.6行业数据模型5.4方法5.4.1命名约定的最佳实践5.4.2数据库设计中的最佳实践5.5数据建模和
Python dict字符串转json对象，小数精度丢失问题朝如青丝暮成雪 json python
一前言JSON(JavaScriptObjectNotation)是一种轻量级的数据交换格式，dict是Python的一种数据格式。本篇介绍一个float数据转换时精度丢失的案例。二问题描述importjsontest_str1='{"π":3.1415926535897932384626433832795028841971}'test_str2='{"value":10.00000}'print
UNDERSTANDING HTML WITH LARGE LANGUAGE MODELS liferecords LLM 语言模型人工智能自然语言处理
UNDERSTANDINGHTMLWITHLARGELANGUAGEMODELS相关链接：arXiv关键字：大型语言模型、HTML理解、Web自动化、自然语言处理、机器学习摘要大型语言模型（LLMs）在各种自然语言任务上表现出色。然而，它们在HTML理解方面的能力——即解析网页的原始HTML，对于自动化基于Web的任务、爬取和浏览器辅助检索等应用——尚未被充分探索。我们为HTML理解模型（经过微调
Python+Requests模拟发送GET请求爱学习的执念自动化测试软件测试技术分享 python 开发语言
模拟发送GET请求前置条件：导入requests库一、发送不带参数的get请求代码如下：以百度首页为例importrequests#发送get请求response=requests.get(url="http://www.baidu.com")print(response.content.decode("utf-8"))#以utf-8的编码输出内容二、发送带参数的get请求发送带参数的get请求有
Python极速入门：五分钟开启实战之旅！知白守黑V Python 编程语言系统运维 python 编程语言 python开发 python学习 python入门 python数据分析
1.Python基础语法和结构：了解Python的基本语法，包括变量、数据类型、运算符、注释等。控制流：掌握条件语句（if-elif-else）、循环（for和while）及其控制（break和continue）。函数：学习如何定义和使用函数，包括参数传递、返回值、作用域和闭包。模块和包：理解如何导入和使用模块，以及如何创建和使用自己的包。2.数据处理列表、元组和集合：学习这些序列类型的操作和方法
Python Flask 使用数据库安果移不动 python flask 开发语言
pipinstallflask_sqlalchemy官方文档：Flask-SQLAlchemy—Flask-SQLAlchemyDocumentation(3.1.x)为了不报错也需要导入另外两个库#pipinstallflask_sqlalchemy#pipinstallmysqlclient完整代码importosfromflaskimportFlaskfromflask_sqlalchemy
PaperWeekly sapienst Papers PaperwithCode General ML
1.Python软件包解决DL在未见过的数据分布下性能差的问题：（1）神经网络和损失分离的模块化设计（2）强大便捷的基准测试能力（3）易于使用但难以修改（4）github:https://github.com/marrlab/domainlabTrainer和Models之间是什么关系Trainer和Models是DomainLab中的两个核心概念。Trainer是一个用于指导数据流向模型并计算S
使用Python读取Excel文件并计算平均分嘻嘻爱编码 Python从入门到放弃 python excel 开发语言
在这篇博客中，我们将探讨如何使用Python的pandas库来读取Excel文件，并计算其中数据的平均分。pandas是一个强大的数据分析工具，它允许我们以简单直观的方式处理表格数据。安装必要的库在开始之前，确保你的环境中安装了pandas和openpyxl库。可以使用以下命令进行安装：pipinstallpandasopenpyxl读取Excel文件首先，我们需要读取Excel文件。假设我们有一
python项目练习——7.网站访问日志分析器 F—— python项目练习 python 信息可视化数据分析数据挖掘开发语言学习
项目功能分析：这个项目可以读取网站的访问日志文件，统计访问量、独立访客数、访问来源等信息，并以图表或表格的形式展示出来。这个项目涉及到文件操作、数据处理、数据可视化等方面的技术。示例代码：importrefromcollectionsimportCounterimportmatplotlib.pyplotaspltdefparse_log_file(log_file):#读取日志文件内容witho
python的while双重循环九九乘法表 Jinm_R python 开发语言
a=1whilea<=9:b=1#乘数每次需要从1开始whileb<=a:print(f"{a}*{b}={a*b}\t",end='')#\t为制表符使乘法表整齐end=''代表用空格代替换行b+=1a+=1print()#乘数每加一换行
【Python】成功解决ModuleNotFoundError: No module named ‘torchinfo‘ 高斯小哥 BUG解决方案合集 python pytorch 新手入门学习 debug
【Python】成功解决ModuleNotFoundError:Nomodulenamed‘torchinfo’个人主页：高斯小哥高质量专栏：Matplotlib之旅：零基础精通数据可视化、Python基础【高质量合集】、PyTorch零基础入门教程希望得到您的订阅和支持~创作高质量博文(平均质量分92+)，分享更多关于深度学习、PyTorch、Python领域的优质内容！（希望得到您的关注~）文
RNA-seq数据分析_未完成子诚之组学数据分析数据分析
目录基础分析1.质控（reads）2.比对3.质控（alignment）4.定量5.样本合并差异表达1.质控（cohort）2.差异分析3.可视化（差异）富集分析肿瘤免疫1.免疫组库2.免疫浸润3.免疫响应4.新抗原预测微生物组参考本文主要覆盖了肿瘤样本bulkRNA-seq数据常见的分析步骤，并从实践角度出发，较为具体地介绍了每一步骤依赖的工具和数据集。另外，尽管本文适用于肿瘤样本，但其中的一些
OpenCV（一个C++人工智能领域重要开源基础库）简介愚梦者 OpenCV 人工智能人工智能 opencv c++图像处理计算机视觉开源
返回：OpenCV系列文章目录（持续更新中......）上一篇：OpenCV4.9.0配置选项参考下一篇：OpenCV4.9.0开源计算机视觉库安装概述引言：OpenCV（全称OpenSourceComputerVisionLibrary）是一个基于开放源代码发行的跨平台计算机视觉库，可以用来进行图像处理、计算机视觉和机器学习等领域的开发。该库由英特尔公司于1999年开始开发，最初是为了加速处理器
Python自动化测试web常见框架汇总自动化测试薰儿软件测试技术分享 python 前端开发语言
1、前言目前，有非常多的Python框架，用来帮助你更轻松的创建web应用。这些框架把相应的模块组织起来，使得构建应用的时候可以更快捷，也不用去关注一些细节（例如socket和协议），所以需要的都在框架里了。接下来我们会介绍不同的选项。经过初期的不起眼，Python已经成为互联网最流行的服务端编程语言之一。根据W3Techs的统计，它被用于很多的大流量的站点很多的大流量的站点很多的大流量的站点，超
Java序列化进阶篇 g21121 java序列化
1.transient 类一旦实现了Serializable 接口即被声明为可序列化，然而某些情况下并不是所有的属性都需要序列化，想要人为的去阻止这些属性被序列化，就需要用到transient 关键字。
escape()、encodeURI()、encodeURIComponent()区别详解 aigo JavaScript Web
原文：http://blog.sina.com.cn/s/blog_4586764e0101khi0.html JavaScript中有三个可以对字符串编码的函数，分别是： escape,encodeURI,encodeURIComponent，相应3个解码函数：,decodeURI,decodeURIComponent 。下面简单介绍一下它们的区别 1 escape()函
ArcgisEngine实现对地图的放大、缩小和平移 Cb123456 添加矢量数据对地图的放大、缩小和平移 Engine
ArcgisEngine实现对地图的放大、缩小和平移: 个人觉得是平移，不过网上的都是漫游，通俗的说就是把一个地图对象从一边拉到另一边而已。就看人说话吧. 具体实现: 一、引入命名空间 using ESRI.ArcGIS.Geometry; using ESRI.ArcGIS.Controls; 二、代码实现.
Java集合框架概述天子之骄 Java集合框架概述
集合框架集合框架可以理解为一个容器，该容器主要指映射(map)、集合(set)、数组(array)和列表(list)等抽象数据结构。从本质上来说，Java集合框架的主要组成是用来操作对象的接口。不同接口描述不同的数据类型。简单介绍： Collection接口是最基本的接口，它定义了List和Set，List又定义了LinkLi
旗正4.0页面跳转传值问题何必如此 java jsp
跳转和成功提示 a) 成功字段非空forward 成功字段非空forward，不会弹出成功字段，为jsp转发，页面能超链接传值,传输变量时需要拼接。接拼接方式list.jsp?test="+strweightUnit+"或list.jsp?test="+weightUnit+&qu
全网唯一:移动互联网服务器端开发课程 cocos2d-x小菜 web开发移动开发移动端开发移动互联程序员
移动互联网时代来了！ App市场爆发式增长为Web开发程序员带来新一轮机遇，近两年新增创业者，几乎全部选择了移动互联网项目！传统互联网企业中超过98%的门户网站已经或者正在从单一的网站入口转向PC、手机、Pad、智能电视等多端全平台兼容体系。据统计，AppStore中超过85%的App项目都选择了PHP作为后端程
Log4J通用配置|注意问题笔记 7454103 DAO apache tomcat log4j Web
关于日志的等级那些去百度就知道了！这几天要搭个新框架配置了日志记下来！做个备忘！ #这里定义能显示到的最低级别,若定义到INFO级别,则看不到DEBUG级别的信息了~! log4j.rootLogger=INFO,allLog # DAO层 log记录到dao.log 控制台和总日志文件 log4j.logger.DAO=INFO,dao,C
SQLServer TCP/IP 连接失败问题 ---SQL Server Configuration Manager darkranger sql c windows SQL Server XP
当你安装完之后,连接数据库的时候可能会发现你的TCP/IP 没有启动.. 发现需要启动客户端协议 : TCP/IP 需要打开 SQL Server Configuration Manager... 却发现无法打开 SQL Server Configuration Manager..?? 解决方法: C:\WINDOWS\system32目录搜索framedyn.
[置顶] 做有中国特色的程序员 aijuans 程序员
从出版业说起网络作品排到靠前的，都不会太难看，一般人不爱看某部作品也是因为不喜欢这个类型，而此人也不会全不喜欢这些网络作品。究其原因，是因为网络作品都是让人先白看的，看的好了才出了头。而纸质作品就不一定了，排行榜靠前的，有好作品，也有垃圾。许多大牛都是写了博客，后来出了书。这些书也都不次，可能有人让为不好，是因为技术书不像小说，小说在读故事，技术书是在学知识或温习知识，有些技术书读得可
document.domain 跨域问题 avords document
document.domain用来得到当前网页的域名。比如在地址栏里输入：javascript:alert(document.domain); //www.315ta.com我们也可以给document.domain属性赋值，不过是有限制的，你只能赋成当前的域名或者基础域名。比如：javascript:alert(document.domain = "315ta.com");
关于管理软件的一些思考 houxinyou 管理
工作好多看年了,一直在做管理软件,不知道是我最开始做的时候产生了一些惯性的思维,还是现在接触的管理软件水平有所下降.换过好多年公司,越来越感觉现在的管理软件做的越来越乱. 在我看来,管理软件不论是以前的结构化编程,还是现在的面向对象编程,不管是CS模式,还是BS模式.模块的划分是很重要的.当然,模块的划分有很多种方式.我只是以我自己的划分方式来说一下. 做为管理软件,就像现在讲究MVC这
NoSQL数据库之Redis数据库管理(String类型和hash类型) bijian1013 redis 数据库 NoSQL
一.Redis的数据类型 1.String类型及操作 String是最简单的类型，一个key对应一个value，string类型是二进制安全的。Redis的string可以包含任何数据，比如jpg图片或者序列化的对象。 Set方法：设置key对应的值为string类型的value
Tomcat 一些技巧征客丶 java tomcat dos
以下操作都是在windows 环境下一、Tomcat 启动时配置 JAVA_HOME 在 tomcat 安装目录，bin 文件夹下的 catalina.bat 或 setclasspath.bat 中添加 set JAVA_HOME=JAVA 安装目录 set JRE_HOME=JAVA 安装目录/jre 即可；二、查看Tomcat 版本在 tomcat 安装目
【Spark七十二】Spark的日志配置 bit1129 spark
在测试Spark Streaming时，大量的日志显示到控制台，影响了Spark Streaming程序代码的输出结果的查看(代码中通过println将输出打印到控制台上)，可以通过修改Spark的日志配置的方式，不让Spark Streaming把它的日志显示在console 在Spark的conf目录下，把log4j.properties.template修改为log4j.p
Haskell版冒泡排序 bookjovi 冒泡排序 haskell
面试的时候问的比较多的算法题要么是binary search，要么是冒泡排序，真的不想用写C写冒泡排序了，贴上个Haskell版的，思维简单，代码简单，下次谁要是再要我用C写冒泡排序，直接上个haskell版的，让他自己去理解吧。 sort [] = [] sort [x] = [x] sort (x:x1:xs) | x>x1 = x1:so
java 路径配置文件读取 bro_feng java
这几天做一个项目，关于路径做如下笔记，有需要供参考。取工程内的文件，一般都要用相对路径，这个自然不用多说。在src统计目录建配置文件目录res,在res中放入配置文件。读取文件使用方式： 1. MyTest.class.getResourceAsStream("/res/xx.properties") 2. properties.load(MyTest.
读《研磨设计模式》-代码笔记-简单工厂模式 bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ package design.pattern; /* * 个人理解：简单工厂模式就是IOC; * 客户端要用到某一对象，本来是由客户创建的，现在改成由工厂创建，客户直接取就好了 */ interface IProduct {
SVN与JIRA的关联 chenyu19891124 SVN
SVN与JIRA的关联一直都没能装成功，今天凝聚心思花了一天时间整合好了。下面是自己整理的步骤：一、搭建好SVN环境，尤其是要把SVN的服务注册成系统服务二、装好JIRA，自己用是jira-4.3.4破解版三、下载SVN与JIRA的插件并解压，然后拷贝插件包下lib包里的三个jar，放到Atlassian\JIRA 4.3.4\atlassian-jira\WEB-INF\lib下，再
JWFDv0.96 最新设计思路 comsci 数据结构算法工作企业应用公告
随着工作流技术的发展，工作流产品的应用范围也不断的在扩展，开始进入了像金融行业(我已经看到国有四大商业银行的工作流产品招标公告了)，实时生产控制和其它比较重要的工程领域，而
vi 保存复制内容格式粘贴 daizj vi 粘贴复制保存原格式不变形
vi是linux中非常好用的文本编辑工具，功能强大无比，但对于复制带有缩进格式的内容时，粘贴的时候内容错位很严重，不会按照复制时的格式排版，vi能不能在粘贴时，按复制进的格式进行粘贴呢？答案是肯定的，vi有一个很强大的命令可以实现此功能。在命令模式输入:set paste，则进入paste模式，这样再进行粘贴时
shell脚本运行时报错误：/bin/bash^M: bad interpreter 的解决办法 dongwei_6688 shell脚本
出现原因：windows上写的脚本，直接拷贝到linux系统上运行由于格式不兼容导致解决办法： 1. 比如文件名为myshell.sh，vim myshell.sh 2. 执行vim中的命令 : set ff?查看文件格式，如果显示fileformat=dos，证明文件格式有问题 3. 执行vim中的命令 :set fileformat=unix 将文件格式改过来就可以了，然后:w
高一上学期难记忆单词 dcj3sjt126com word english
honest 诚实的；正直的 argue 争论 classical 古典的 hammer 锤子 share 分享；共有 sorrow 悲哀；悲痛 adventure 冒险 error 错误；差错 closet 壁橱；储藏室 pronounce 发音；宣告 repeat 重做；重复 majority 大多数；大半 native 本国的，本地的，本国
hibernate查询返回DTO对象，DTO封装了多个pojo对象的属性 frankco POJO hibernate查询 DTO
DTO-数据传输对象；pojo-最纯粹的java对象与数据库中的表一一对应。简单讲：DTO起到业务数据的传递作用，pojo则与持久层数据库打交道。有时候我们需要查询返回DTO对象，因为DTO
Partition List hcx2013 partition
Given a linked list and a value x, partition it such that all nodes less than x come before nodes greater than or equal to x. You should preserve the original relative order of th
Spring MVC测试框架详解——客户端测试 jinnianshilongnian
上一篇《Spring MVC测试框架详解——服务端测试》已经介绍了服务端测试，接下来再看看如果测试Rest客户端，对于客户端测试以前经常使用的方法是启动一个内嵌的jetty/tomcat容器，然后发送真实的请求到相应的控制器；这种方式的缺点就是速度慢；自Spring 3.2开始提供了对RestTemplate的模拟服务器测试方式，也就是说使用RestTemplate测试时无须启动服务器，而是模拟一
关于推荐个人观点 liyonghui160com 推荐系统关于推荐个人观点
回想起来，我也做推荐了3年多了，最近公司做了调整招聘了很多算法工程师，以为需要多么高大上的算法才能搭建起来的，从实践中走过来，我只想说【不是这样的】第一次接触推荐系统是在四年前入职的时候，那时候，机器学习和大数据都是没有的概念，什么大数据处理开源软件根本不存在，我们用多台计算机web程序记录用户行为，用.net的w
不间断旋转的动画 pangyulei 动画
CABasicAnimation* rotationAnimation; rotationAnimation = [CABasicAnimation animationWithKeyPath:@"transform.rotation.z"]; rotationAnimation.toValue = [NSNumber numberWithFloat: M
自定义annotation sha1064616837 java enum annotation reflect
对象有的属性在页面上可编辑，有的属性在页面只可读，以前都是我们在页面上写死的，时间一久有时候会混乱，此处通过自定义annotation在类属性中定义。越来越发现Java的Annotation真心很强大，可以帮我们省去很多代码，让代码看上去简洁。下面这个例子主要用到了 1.自定义annotation：@interface，以及几个配合着自定义注解使用的几个注解 2.简单的反射 3.枚举
Spring 源码 up2pu spring
1.Spring源代码 https://github.com/SpringSource/spring-framework/branches/3.2.x 注：兼容svn检出 2.运行脚本 import-into-eclipse.bat 注：需要设置JAVA_HOME为jdk 1.7 build.gradle compileJava { sourceCompatibilit
利用word分词来计算文本相似度 yangshangchuan word word分词文本相似度余弦相似度简单共有词
word分词提供了多种文本相似度计算方式：方式一：余弦相似度，通过计算两个向量的夹角余弦值来评估他们的相似度实现类：org.apdplat.word.analysis.CosineTextSimilarity 用法如下： String text1 = "我爱购物"; String text2 = "我爱读书"; String text3 =