数据分类框架图

介绍 (Introduction)

Classification is a type of supervised learning in Machine Learning that deals with categorizing data into classes. Supervised learning implies models that take input with their matched output to train a model which can later make useful predictions on a new set of data with no output. Some examples of classification problems include: spam detection in mails, subscription analysis, hand written digit recognition, survival predictions et.c. They all involve the use of classifiers that utilize the train data to understand how the input variables relate to the output (target) variable.

Çlassification是机器学习型监督学习的，与数据分类成类交易。监督学习意味着模型接受输入及其匹配的输出以训练模型，该模型以后可以对没有输出的新数据集做出有用的预测。分类问题的一些示例包括： 邮件中的垃圾邮件检测，订阅分析，手写数字识别，生存预测等。它们都涉及使用分类器，该分类器利用火车数据来了解输入变量与输出(目标)变量之间的关系。

Class imbalance refers to a problem in classification where the distribution of the classes are skewed. This can range from a slight to an extreme imbalance.

类不平衡是指分类中的问题，其中类的分布偏斜。范围从轻微到极端不平衡。

This is a problem because most classification algorithms will have a low prediction accuracy towards the minority class because they run with the assumption that there’s balance between the classes.

这是一个问题，因为大多数分类算法对少数类的预测准确性较低，因为它们在假定类之间平衡的前提下运行。

An example of class imbalance is in credit card fraud detection. The classes, in this case, are fraud and not fraud. Most of the transactions are not fraud, thus the fraud classes are the minority class. Having a low accuracy in the minority class predictions is problematic because it is the most important class.

类不平衡的一个例子是信用卡欺诈检测。在这种情况下，这些类别是欺诈而不是欺诈 。大多数交易不是欺诈，因此欺诈类别为少数类别。少数族裔类别预测的准确性较低，这是有问题的，因为它是最重要的类别。

This blog covers the steps involved in tackling a classification problem in imbalanced dataset. The Github repository containing all the code is available here.

该博客介绍了解决不平衡数据集中分类问题的步骤。包含所有代码的Github存储库可在此处获得。

数据集 (Dataset)

Data used is from the UCI Machine Learning Repository. The data is related with marketing campaigns of a Portuguese banking institution. The classification goal is to predict if the client will subscribe to a term deposit (variable y).

使用的数据来自UCI机器学习存储库。数据与葡萄牙银行机构的市场营销活动有关。分类的目的是预测客户是否将订阅定期存款(变量y )。

An effective model can help increase campaign efficiency as more efforts can be directed towards customers with high subscription chances.

有效的模型可以帮助提高活动效率，因为更多的努力可以针对具有较高订阅机会的客户。

Sample of the data:

数据样本：

Visualizing the target variable(y) in order to observe the class imbalance:

可视化目标变量(y)以观察类不平衡：

Image by author. 图片由作者提供。

The circles’ sizes represents the value counts of each class. Clearly there’s extreme class imbalance. This will be taken care of in the preprocessing section.

圆圈的大小代表每个类别的价值计数。显然存在极端的阶级失衡。这将在预处理部分中解决。

前处理 (Preprocessing)

Preprocessing involves the following steps:

预处理包括以下步骤：

归零值 (Imputing null values)

Missing values need to be handled because they can lead to wrong predictions and can also cause a high bias for any given model being used.

缺失值需要处理，因为它们可能导致错误的预测，并且对于所使用的任何给定模型也可能导致较高的偏差。

The categorical features will be imputed with the column mode, the discrete numerical features with the column median and the continuous numerical features with the column mean.

类别特征将用列模式推算，离散数值特征用列中值推算，连续数值特征用列平均值推算。

处理异常值 (Treating outliers)

Outliers are a problem for many machine learning algorithms because they can cause missing of important insights or contort real results which eventually results to less accurate models.

离群值是许多机器学习算法的问题，因为它们可能导致重要见解的缺失或扭曲实际结果，最终导致模型精度降低。

The outliers will be clipped with the 10th and 90th percentiles.

离群值将被限制在第10个百分点和第90个百分点。

特征生成 (Feature Generation)

Generating new features from already existing features adds new information to be accessible during the model training and therefore increasing model accuracy.

从现有特征生成新特征会增加新的信息，以便在模型训练期间可以访问这些信息，从而提高模型的准确性。

缩放数值变量 (Scaling numerical variables)

Standardizing numerical features with a StandardScaler to remove the differences brought about by different units of measurement.

标准化数值特征与StandardScaler去除由不同的测量单位带来的差异。

编码分类变量 (Encoding categorical variables)

Most Machine Learning and Neural Nets algorithms require numerical inputs, thus to make use of the categorical features, we have to remap them.

大多数机器学习和神经网络算法都需要数字输入，因此要利用分类特征，我们必须重新映射它们。

One hot encoding technique is applied. It takes a column which has categorical data then splits the column into multiple columns. The entries are replaced by zeros and ones, depending which column has what value.

应用了一种热编码技术。它需要一个具有分类数据的列，然后将该列拆分为多个列。条目将用零和一代替，这取决于哪一列具有什么值。

重新采样不平衡的数据集。 (Resampling the unbalanced datasets.)

Resampling involves creating a new version of our imbalanced dataset.

重采样涉及创建不平衡数据集的新版本。

There are 2 main approaches for resampling:

有两种主要的重采样方法：

Over sampling: Randomly duplicating entries in the minority class. Appropriate for small datasets.
过度抽样 ：在少数族裔中随机复制条目。适用于小型数据集。
Under sampling: Randomly deleting entries from the majority class. Appropriate for large datasets.
抽样不足：从多数类别中随机删除条目。适用于大型数据集。

Our dataset has 41188 rows and 21 columns thus safe to use oversampling.

我们的数据集有41188行和21列，因此可以安全地使用过采样。

The Preprocessing class:

预处理类：

preprocessing class code. 预处理类代码。

# calling the class and its methodsd = DataPrep()
path = '/content/bank-additional-full.csv'
data = d.read_data(path)
data = d.treat_null(data)
data = d.outlier_correcter(data)
data = d.generate_features(data)
data = d.scaler(data)
print('After scaling:', data.shape)
data = d.encoder(data)
data = d.over_sample(data)
data.head()

The output:

输出：

Notice the changes in the categorical and numerical columns.

请注意类别和数字列中的更改。

造型 (Modeling)

Photo by Hunter Harritt on Unsplash Hunter Harritt在 Unsplash上拍摄的照片

Straight off from the preprocessing to making predictions, we begin by training and evaluating our models using the train and validation data. But first, we have to separate the target and predictor variables then split them into train, validation and test sets. The test set is not provided separately thus we retrieve it from our dataset.

从预处理到做出预测，我们首先使用训练和验证数据对模型进行训练和评估。但是首先，我们必须分离目标变量和预测变量，然后将它们分为训练，验证和测试集。测试集没有单独提供，因此我们从数据集中检索它。

# split the data to have the predictor and predicted variables
x = data.drop(['y'], axis = 1)
y = data[['y']]# Encode labels in target df.
encoder = LabelEncoder()
y = encoder.fit_transform(y)# get the sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.10, random_state = 42)x_train, x_val, y_train, y_val = train_test_split(x_train ,y_train, test_size = 0.20, random_state = 42)

Algorithms to be explored:

要探索的算法：

XGBoost:
XGBoost ：

XGBoost classifier is tree-based ensemble learning algorithm that’s an implementation of gradient boosting machines.

XGBoost分类器是基于树的集成学习算法，是梯度提升机的一种实现。

It is optimized for speed and performance.

它针对速度和性能进行了优化。

* Multilayer Perceptron:

* 多层感知器：

A multilayer perceptron (MLP) is a class of feed-forward artificial neural network. It consists of at least three layers of nodes: an input layer, a hidden layer and an output layer.

多层感知器(MLP)是一类前馈人工神经网络。它至少由三层节点组成：输入层，隐藏层和输出层。

Its ability to distinguish data that is not linearly separable is why we explore it.

它具有区分不可线性分离的数据的能力，这就是我们探索它的原因。

* Logistic Regression:

* Logistic回归：

Logistic regression is a simple and yet very effective classification algorithm that uses log odds ratio to predict group membership.

Logistic回归是一种简单但非常有效的分类算法，它使用对数比值比预测组成员身份。

Its simplicity and use of log odds ratio in place for probabilities is why we explore it.

它的简单性和对数比值比用于概率的原因就是我们探索它的原因。

交叉验证(CV) (Cross Validation (CV))

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

交叉验证是一种重采样过程，用于评估有限数据样本上的机器学习模型。

Techniques used:

使用的技术：

K-fold: K-Fold CV is where a given data set is split into n number of folds where each fold is used as a testing set at some point and the rest as the training set.
K折：K折CV是将给定数据集拆分为n个折数的地方，其中每个折点在某个点用作测试集，其余部分用作训练集。
Stratified K-fold: StratifiedKFold shuffles the data, then splits it into n number of folds then the folds are each used as a testing set. Stratification keeps the balance between targets of the dataset (each stratified fold keeps the same ratio of the target classes). This strategy is best in cases of imbalanced data.
分层K折 ：StratifiedKFold将数据洗牌，然后将其拆分为n个折数，然后将每个折数用作测试集。分层可保持数据集目标之间的平衡(每个分层折叠均保持目标类别的比率相同)。在数据不平衡的情况下，这种策略是最好的。

The scoring metric by default in both techniques is the accuracy score (the number of correct predictions made divided by the total number of predictions made).

两种技术默认情况下的得分指标是准确性得分 (做出的正确预测数除以做出的预测总数)。

Implementation:

实现方式：

Now we observe our models performance on the train data using these techniques.

现在，我们使用这些技术观察模型在列车数据上的性能。

# using grid search to find optimal parametersregressor = LogisticRegression()
grid_values = {'penalty': ['l1','l2'], 'C': [0.001,0.01,0.1,1,10,100,1000]}model = GridSearchCV(regressor, param_grid=grid_values)
model.fit(x_train,y_train)
print(model.best_score_)
print(model.best_params_)

# using the optimal parametes printed out
regressor = LogisticRegression(C = 1000, penalty= 'l2')
regressor.fit(x_train, y_train)# using kfolds
print('Logistic Regression mean accuracy score using kfold:', overall_score(regressor, x_train))# stratified KFold
print('Logistic Regression mean accuracy score using Stratifiedkfold :', overall__stratified_score(regressor, x_train))

Output:

输出：

Logistic Regression mean accuracy score using kfold: 0.742437522099093
Logistic Regression mean accuracy score using Stratifiedkfold : 0.7420879248958712

Repeating the same procedure for the XGBoost and the MLP.

对XGBoost和MLP重复相同的过程。

xgb = XGBClassifier(silent = True,max_depth = 6, n_estimators = 200)
xgb.fit(x_train, y_train)# using kfolds
print('xgb mean score on the original dataset (kfold):', overall_score(xgb, x_train))# stratified KFold
print('xgb mean score on the original dataset (stratified kfold):', overall__stratified_score(xgb, x_train))

mlp = MLPClassifier() # working with default parameters
mlp.fit(x_train, y_train)# using kfolds
print('mlp mean score on the original dataset (kfold):', overall_score(mlp, x_train))# stratified KFold
print('mlp mean score on the original dataset (stratified kfold):', overall__stratified_score(mlp, x_train))

In all the 3 models, K-fold yields the highest accuracy although its variation with Stratified is so slight. Like mentioned earlier, Stratified is best for imbalanced data but we already resampled our data during preprocessing rendering it not so useful.

在所有3个模型中，K折产生的精度最高，尽管其与Stratified的差异很小。如前所述，“分层”最适合不平衡的数据，但是我们已经在预处理过程中对数据进行了重新采样，使其变得不太有用。

进一步评估 (Further Evaluation)

We can explore other accuracy metrics through our validation set.

我们可以通过验证集探索其他准确性指标。

ROC — ROC curve is a plot of the true positive rate (y-axis) against the false positive rate (x-axis) for a number of different candidate threshold values between 0.0 and 1.0.
ROC — ROC曲线是0.0到1.0之间的许多不同候选阈值的真实阳性率(y轴)相对于错误阳性率(x轴)的图。
PRECISION AND RECALL — Precision is the number of correctly identified positive results divided by the number of all positive results, including those not identified correctly and recall is the number of correctly identified positive results divided by the number of all samples that should have been identified as positive.
精度和召回率 -精度是所有阳性结果数除以正确识别了积极成果，包括那些没有确定正确召回的数量是由应该被认定为所有样本数除以正确识别阳性结果的数量正。
F1 SCORE — F1 score is a measure of a test’s accuracy. It is calculated from the precision and recall of the test.
F1分数-F1分数是对测试准确性的度量。它是根据测试的精度和召回率计算得出的。

ROC curves should be used when there are roughly equal numbers of observations for each class. Precision-Recall curves should be used when there is moderate to large class imbalance.

当每个类别的观测值大致相等时，应使用ROC曲线。当出现中等到较大等级的不平衡时，应使用精确召回曲线。

Since we resampled our data to be balanced, ROC is our best take.

由于我们重新采样了数据以平衡数据，因此ROC是我们的最佳选择。

ROC plots: Implementation :

ROC图：实现：

The plots are implemented on other models: RandomForest, CatBoost and LGBM, they can be edited to fit whichever model. (Code available in the github link shared).

这些图可在其他模型上执行：RandomForest，CatBoost和LGBM，可对其进行编辑以适合任何模型。 (代码可在 github 链接中共享)。

from sklearn.datasets import make_classification
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score#predict probabilities
ns_probs = [0 for _ in range(len(x_val))]  # no skill classifier
f_prob = forest.predict_proba(x_val)
c_prob = cat.predict_proba(x_val)
l_prob = lgbm.predict_proba(x_val)# keep probabilities for the positive outcome only
f_prob = f_prob[:, 1]
c_prob = c_prob[:, 1]
l_prob = l_prob[:, 1]# calculate scores then print them
f_auc = roc_auc_score(y_val, f_prob)
c_auc = roc_auc_score(y_val, c_prob)
l_auc = roc_auc_score(y_val, l_prob)
ns_auc = roc_auc_score(y_val, ns_probs)print('RandomForest:', f_auc)
print('CatBoost: ', c_auc)
print('LGBM:', l_auc)# calculate roc curves
f_fpr, f_tpr, _ = roc_curve(y_val, f_prob)
c_fpr, c_tpr, _ = roc_curve(y_val, c_prob)
l_fpr, l_tpr, _ = roc_curve(y_val, l_prob)
ns_fpr, ns_tpr, _ = roc_curve(y_val, ns_probs)# plot the roc curve for the model
plt.figure(figsize = (12,7))
plt.plot(f_fpr, f_tpr, marker='.', label='random forest')
plt.plot(l_fpr, l_tpr, marker='.', label='lgbm')
plt.plot(c_fpr, c_tpr, marker='.', label='catboost')
plt.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')plt.legend()
plt.title('ROC curves for different models')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

Output:

输出：

Image by author. 图片由作者提供。

The bigger the area under the curve, the higher the accuracy.

曲线下的面积越大，精度越高。

A no-skill classifier is one that cannot discriminate between the classes and would predict a random class or a constant class in all cases. (Our benchmark).

无技能分类器是一种不能区分类别的分类器，并且会在所有情况下预测随机类别或恒定类别。 (我们的基准)。

预测 (Predictions)

Using MLP, our best performer from the 3 initial models.

使用MLP，这是3种初始模型中表现最好的。

# mlp
pred = mlp.predict(x_test)
pred_df = pd.DataFrame(pred)
pred_df.columns = ['y']
pred_df.to_csv('pred_df.csv') # export to a csv file

结论 (Conclusion)

We could go on and on with our classification problem exploring distinct techniques like Dimensionality reduction solely in an attempt to achieve a better performing model. But for now, we know that having imbalanced data doesn’t hinder one from performing predictions, it simply beckons one to use the appropriate techniques to evade poor forecasts on the minority class.

我们可以继续进行分类问题，探索诸如降维之类的独特技术，仅是为了尝试获得更好的性能模型。但就目前而言，我们知道拥有不平衡的数据并不会妨碍人们进行预测，它只是在召唤一个人使用适当的技术来逃避对少数群体的不良预测。

PS: Respect to all my fellow learners and tutors at 10academy.org for their endless support.

附注：敬请10academy.org上所有我的学习者和补习老师给予的无尽支持。

翻译自: https://towardsdatascience.com/classification-framework-for-imbalanced-data-9a7961354033