综上所述,树模型为基础理论的主要算法,我们对其精确度高的优点有着很大的需求量,同时如果能解决变量的解释性,找到变量与最终预测值间较容易理解的函数关系,将在一定程度上兼顾精确性的同时,满足模型可解释性的需求。
1.实践基础
2.理论基础
综上两者的发展,结合评分转换的相关基础知识,本文主要对变量评分与预测评分间的关系进行实践操作验证与研究,具体的理论基础,大家可以详细阅读论文和相关资料。
pred_contribs=True
下输出值的函数关系Collecting xgboost
Downloading https://files.pythonhosted.org/packages/6a/49/7e10686647f741bd9c8918b0decdb94135b542fe372ca1100739b8529503/xgboost-0.82-py2.py3-none-manylinux1_x86_64.whl (114.0MB)
100% |████████████████████████████████| 114.0MB 151kB/s
Requirement already satisfied: numpy in /opt/conda/lib/python3.6/site-packages (from xgboost) (1.13.3)
Requirement already satisfied: scipy in /opt/conda/lib/python3.6/site-packages (from xgboost) (1.1.0)
Installing collected packages: xgboost
Successfully installed xgboost-0.82
import pandas as pd
from sklearn.datasets import load_iris
iris_df = load_iris()
iris_df.target[iris_df.target == 2] = 0
iris_data = pd.DataFrame(iris_df.data, columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])
iris_data['target'] = iris_df.target
import xgboost as xgb
dtrain = xgb.DMatrix(iris_data[iris_data.columns.difference(['target'])], label=iris_data.target)
# specify parameters via map, definition are same as c++ version
param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic', 'seed': 0}
# specify validations set to watch performance
# watchlist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 20
bst = xgb.train(param, dtrain, num_round)
/opt/conda/lib/python3.6/site-packages/xgboost/core.py:587: FutureWarning: Series.base is deprecated and will be removed in a future version
if getattr(data, 'base', None) is not None and \
/opt/conda/lib/python3.6/site-packages/xgboost/core.py:588: FutureWarning: Series.base is deprecated and will be removed in a future version
data.base is not None and isinstance(data, np.ndarray) \
bst
ypred = bst.predict(dtrain)
ypred[0:5]
array([ 0.00490831, 0.00490831, 0.00490831, 0.00490831, 0.00490831], dtype=float32)
ypred_contribs = bst.predict(dtrain, pred_contribs=True)
ypred_contribs[0:5]
array([[-3.0387814 , 0.65096694, -1.09195888, -0.33184546, -1.50028646],
[-3.0387814 , 0.65096694, -1.09195888, -0.33184546, -1.50028646],
[-3.0387814 , 0.65096694, -1.09195888, -0.33184546, -1.50028646],
[-3.0387814 , 0.65096694, -1.09195888, -0.33184546, -1.50028646],
[-3.0387814 , 0.65096694, -1.09195888, -0.33184546, -1.50028646]], dtype=float32)
score_a = sum(ypred_contribs[0])
print(score_a)
-5.31190526485
使用logis函数实现pred_contribs值与预测概率间的函数关系
import numpy as np
def logis(x):
return 1/(1+np.exp(-x))
logis(score_a)
0.0049083096667698247
ypred[0]
0.0049083065
上述实验说明了,对于 pred_contribs=True 下输出的pred_contribs(解释为每个特征对最后打分的影响因子),其中最后一列为bais,相加进行logistics函数转换后,即为预测为1的概率值;两者之间结果在小数点前6位上能保持很好的一致性
def prob2Score(prob, thea=50, basescore=600, PDO=20):
B = PDO / np.log(2)
A = basescore + B * np.log(1 / thea)
score = A - B * np.log(prob / (1 - prob))
return score
prob2Score(logis(score_a))
640.3920638700547
def pred_contrib2score(ypred_contribs,thea=50, basescore=600, PDO=20):
B = PDO / np.log(2)
A = basescore + B * np.log(1 / thea)
base_score = A - B * ypred_contribs[-1]
x_score = [-B*beta for beta in ypred_contribs[0:-1]]
return base_score,x_score,sum(x_score)+base_score
pred_contrib2score(ypred_contribs[0])
(530.41199291737485,
[87.680697252217627,
-18.782935589075397,
31.507273232861802,
9.575036056675879],
640.39206387005481)
上述实验说明了,各个变量的pred_contribs值与评分转换的关系为:
A - B *bais
-B*beta
,其中beta为每个变量对应的pred_contribs值iris_data.head()
sepal_length | sepal_width | petal_length | petal_width | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
iris_data.columns
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'target'], dtype='object')
new_data = pd.DataFrame([{'sepal_length':5.2, 'sepal_width':4.5, 'petal_length':1.2, 'petal_width':0.15}])
new_data
petal_length | petal_width | sepal_length | sepal_width | |
---|---|---|---|---|
0 | 1.2 | 0.15 | 5.2 | 4.5 |
dtest = xgb.DMatrix(new_data)
new_ypred_contribs = bst.predict(dtest, pred_contribs=True)
new_ypred_contribs
array([[-2.95274711, 0.65096694, -0.49509707, -0.33184546, -1.50028646]], dtype=float32)
new_ypred = bst.predict(dtest)
new_ypred
array([ 0.00967], dtype=float32)
logis(sum(new_ypred_contribs[0]))
0.0096700072210058954
prob2Score(new_ypred)
array([ 620.68786621], dtype=float32)
pred_contrib2score(new_ypred_contribs[0])
(530.41199291737485,
[85.198272152439699,
-18.782935589075397,
14.285481779856159,
9.575036056675879],
620.68784731727123)
综上可知,能对新数据进行预测,通过公式,能得到评分,且评分在小数点前4位,计算结果是一致的。
1.LIME算法理论核心基础截图 [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BbUSqvJ0-1654437789362)(./image/LIME图片.png)]
2.17SHAP
3.SHAP代码
4.Demystifying Black-Box Models with SHAP Value Analysis