python绘制决策树_在Python中为ML绘制决策面的动手指南

python绘制决策树

介绍 (Introduction)

Lately, I have been struggling for a while to visualize the generated model of a classification model. I relied only on the classification report and the confusion matrix to weigh the model performance.

最近，我一直在努力使分类模型的生成模型可视化。我仅依靠分类报告和混淆矩阵来权衡模型性能。

However, visualize the results of the classification has its charm and makes more sense of it. So, I built a decision surface, and when I succeeded, I decided to write about it as a learning process and for anyone who might have stuck on the same issue.

但是，可视化分类的结果具有其魅力并使其更具意义。因此，我建立了决策面，当我成功的时候，我决定将它写成一个学习过程，供那些可能坚持同一问题的人学习。

教程内容 (Tutorial content)

In this tutorial, I will start with the built-in dataset package within the Sklearn library to focus on the implementation steps. After that, I will use a pre-processed data (without missing data or outliers) to plot the decision surface after applying the standard scaler.

在本教程中，我将从Sklearn库中的内置数据集包开始，重点介绍实现步骤。之后，我将使用预处理的数据(没有丢失的数据或离群值)在应用标准缩放器后绘制决策面。

Decision Surface
决策面
Importing important libraries
导入重要的库
Dataset generation
数据集生成
Generating decision surface
生成决策面
Applying for real data
申请真实数据

决策面 (Decision Surface)

Classification in machine learning means to train your data to assign labels to the input examples.

机器学习中的分类意味着训练您的数据，以便为输入示例分配标签。

Each input feature is defining an axis on a feature space. A plane is characterized by a minimum of two input features, with dots representing input coordinates in the input space. If there were three input variables, the feature space would be a three-dimensional volume.

每个输入要素都在要素空间上定义一个轴。平面的特征在于至少两个输入要素，点表示输入空间中的输入坐标。如果有三个输入变量，则特征空间将是三维体积。

The ultimate goal of classification is to separate the feature space so that labels are assigned to points in the feature space as correctly as possible.

分类的最终目标是分离要素空间，以便将标签尽可能正确地分配给要素空间中的点。

This method is called a decision surface or decision boundary, and it works as a demonstrative tool for explaining a model on a classification predictive modeling task. We can create a decision surface for each pair of input features if you have more than two input features.

该方法称为决策面或决策边界，它用作说明工具，用于解释有关分类预测建模任务的模型。如果您有两个以上输入要素，我们可以为每对输入要素创建一个决策图。

导入重要的库 (Importing important libraries)

import numpy as np
import pandas as pdimport matplotlib.pyplot as plt
from matplotlib.colors import ListedColormapfrom sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split

生成数据集 (Generate dataset)

I will use the make_blobs()function within the datasets class from the Sklearn library to generate a custom dataset. Doing so would focus on the implementations rather than cleaning the data. However, the steps are the same and are a typical pattern.Let’s start by defining the dataset variables with 1000 samples and only two features and a standard deviation of 3 for simplicity’s sake.

我将在Sklearn库的数据集类中使用make_blobs()函数来生成自定义数据集。这样做将专注于实现而不是清理数据。但是，这些步骤是相同的并且是典型的模式。为了简单起见，让我们先定义具有1000个样本，只有两个特征和标准偏差为3的数据集变量。

X, y = datasets.make_blobs(n_samples = 1000, 
                           centers = 2, 
                           n_features = 2, 
                           random_state = 1, 
                           cluster_std = 3)

Once the dataset is generated, hence we can plot a scatter plot to see the variability between variables.

生成数据集后，因此我们可以绘制散点图以查看变量之间的变异性。

# create scatter plot for samples from each class
for class_value in range(2):    # get row indexes for samples with this class
    row_ix = np.where(y == class_value)    # create scatter of these samples
    plt.scatter(X[row_ix, 0], X[row_ix, 1])# show the plot
plt.show()

Here we looped over the dataset and plotted points between each Xand y colored by a class label. In the next step, we need to build a predictive classification model to predict the class of unseen points. A logistic regression could be used in this case since we have only two categories.

在这里，我们遍历数据集，并在每个X和y之间绘制由类标签着色的点。下一步，我们需要建立一个预测性分类模型来预测未见点的类别。在这种情况下，可以使用逻辑回归，因为我们只有两个类别。

scatter_plot_1 scatter_plot_1

开发逻辑回归模型 (Develop the logistic regression model)

regressor = LogisticRegression()# fit the regressor into X and y
regressor.fit(X, y)# apply the predict method 
y_pred = regressor.predict(X)

All y_predcould be evaluated using the accuracy_scoreclass from thesklearn library.

所有y_pred可以使用评价accuracy_score从类sklearn库。

accuracy = accuracy_score(y, y_pred)
print('Accuracy: %.3f' % accuracy)## Accuracy: 0.972

生成决策面 (Generating decision surface)

matplotlib provides a handy function called contour(), which can insert the colors between points. However, as the documentation suggested, we need to define the grid of points Xof yin the feature space. The beginning point would be to find the maximum value and minimum value of each feature then increase by one to make sure that the whole space is covered.

matplotlib提供了一个方便的功能，称为contour() ，可以在点之间插入颜色。但是，正如文档所建议的，我们需要在特征空间中定义y点X的网格。起点是找到每个特征的最大值和最小值，然后再增加一个以确保覆盖整个空间。

min1, max1 = X[:, 0].min() - 1, X[:, 0].max() + 1 #1st feature
min2, max2 = X[:, 1].min() - 1, X[:, 1].max() + 1 #2nd feature

Then we can define the scale of the coordinates using arange() function from the numpy library with a0.01 resolution to get the scale range.

然后，我们可以使用numpy库中的arange()函数以0.01分辨率定义坐标的比例，以获取比例范围。

x1_scale = np.arange(min1, max1, 0.1)
x2_scale = np.arange(min2, max2, 0.1)

The next step would be converting x1_scale and x2_scale into a grid. The function meshgrid() within the numpy library is what we need.

下一步是将x1_scale和x2_scale转换为网格。我们需要numpy库中的函数meshgrid() 。

x_grid, y_grid = np.meshgrid(x1_scale, x2_scale)

The generated x_gridis a 2-D array. To be able to use it, we need to reduce the size to a one dimensional array using the flatten() method from thenumpy library.

生成的x_grid是二维数组。为了能够使用它，我们需要使用numpy库中的flatten()方法将大小减小到一维数组。

# flatten each grid to a vector
x_g, y_g = x_grid.flatten(), y_grid.flatten()
x_g, y_g = x_g.reshape((len(x_g), 1)), y_g.reshape((len(y_g), 1))

Finally, stacking the vectors side-by-side as columns in an input dataset, like the original dataset, but at a much higher resolution.

最后，将向量作为列并排堆叠在输入数据集中，例如原始数据集，但分辨率要高得多。

grid = np.hstack((x_g, y_g))

Now, we can fit into the model to predict values.

现在，我们可以拟合模型来预测值。

# make predictions for the grid
y_pred_2 = model.predict(grid)#predict the probability
p_pred = model.predict_proba(grid)# keep just the probabilities for class 0
p_pred = p_pred[:, 0]# reshaping the results
p_pred.shape
pp_grid = p_pred.reshape(x_grid.shape)

Now, a grid of values and the predicted class label across the feature space has been generated.

现在，已生成跨特征空间的值网格和预测的类标签。

Subsequently, we will plot those grids as a contour plot using contourf(). The contourf()function needs separate grids per axis. To achieve that, we can utilize the x_gridand y_gridand reshape the predictions (y_pred)to have the same shape.

随后，我们将使用contourf()将这些网格绘制为轮廓图。每个轴的contourf()函数需要单独的网格。为此，我们可以利用x_grid和y_grid并对预测(y_pred)进行(y_pred)以使其具有相同的形状。

# plot the grid of x, y and z values as a surface
surface = plt.contourf(x_grid, y_grid, pp_grid, cmap='Pastel1')
plt.colorbar(surface)# create scatter plot for samples from each class
for class_value in range(2):
# get row indexes for samples with this class
    row_ix = np.where(y == class_value)    # create scatter of these samples
    plt.scatter(X[row_ix, 0], X[row_ix, 1], cmap='Pastel1')# show the plot
plt.show()

decision_surface for two features 两个特征的decision_surface

适用于真实数据 (Apply to real data)

Now it is time to apply the previous steps to real data to connect everything. As I mentioned earlier, this dataset is already cleaned with no missing points. The dataset represents car purchase history for a sample of people according to their age and salary per year.

现在是时候将前面的步骤应用于实际数据以连接所有内容了。如前所述，该数据集已经清理干净，没有丢失的点。该数据集根据人们的年龄和每年的工资来代表他们的汽车购买历史。

dataset = pd.read_csv('../input/logistic-reg-visual/Social_Network_Ads.csv')dataset.head()

Social_Network_Ads dataset Social_Network_Ads数据集

The dataset has two features Ageand EstimatedSalaryand one dependent variable purchased as a binary column. Value 0 represents the person with similar age, and salary that didn’t make a car purchase. However, one means that the person did purchase the car. The next step would be to separate the dependent variable from features as X and y

该数据集具有两个特征Age和EstimatedSalary并购买了一个作为二元列的因变量。值0代表年龄相似且未购买汽车的薪水的人。但是，这意味着该人确实购买了汽车。下一步是将因变量与特征分离为X和y

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values# splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(
                                               X, y, 
                                               test_size = 0.25,
                                               random_state = 0)

功能缩放 (Feature scaling)

We need this step because Age and salary is not on the same scale

我们需要这一步，因为Age和salary不在同一个比例上

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

建立Logistic模型并拟合训练数据 (Building the Logistic model and fit the training data)

classifier = LogisticRegression(random_state = 0)# fit the classifier into train data
classifier.fit(X_train, y_train)# predicting the value of y 
y_pred = classifier.predict(X_test)

绘制决策面-训练结果 (Plot the decision surface — training results)

#1. reverse the standard scaler on the X_train
X_set, y_set = sc.inverse_transform(X_train), y_train#2. Generate decision surface boundaries
min1, max1 = X_set[:, 0].min() - 10, X_set[:, 0].max() + 10 # for Age
min2, max2 = X_set[:, 1].min() - 1000, X_set[:, 1].max() + 1000 # for salary#3. Set coordinates scale accuracy
x_scale ,y_scale = np.arange(min1, max1, 0.25), np.arange(min2, max2, 0.25)#4. Convert into vector 
X1, X2 = np.meshgrid(x_scale, y_scale)#5. Flatten X1 and X2 and return the output as a numpy array
X_flatten = np.array([X1.ravel(), X2.ravel()])#6. Transfor the results into it's original form before scaling
X_transformed = sc.transform(X_flatten.T)#7. Generate the prediction and reshape it to the X to have the same shape
Z_pred = classifier.predict(X_transformed).reshape(X1.shape)#8. set the plot size
plt.figure(figsize=(20,10))#9. plot the contour function
plt.contourf(X1, X2, Z_pred,
                     alpha = 0.75, 
                     cmap = ListedColormap((#10. setting the axes limit
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())#11. plot the points scatter plot ( [salary, age] vs. predicted classification based on training set)for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], 
                X_set[y_set == j, 1], 
                c = ListedColormap(('red', 'green'))(i), 
                label = j)
    
#12. plot labels and adjustments
plt.title('Logistic Regression (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

decision surface — training set 决策面-训练集

测试集的决策图 (Decision plot for test set)

It is exactly the same as the previous code, but instead of using train use test set.

它与先前的代码完全相同，但是不是使用Train而是使用测试集。

Decision plot — test set 决策图-测试集

结论 (Conclusion)

Finally, I hope this boilerplate could help in visualizing the classification model results. I recommend applying the same steps using another classification model, for example, SVM with more than two features. Thanks for reading, I am looking forward to any constructive comments.

最后，我希望这个样板可以帮助可视化分类模型结果。我建议使用其他分类模型(例如，具有两个以上功能的SVM)应用相同的步骤。感谢您的阅读，我期待任何建设性的意见。

翻译自: https://towardsdatascience.com/hands-on-guide-to-plotting-a-decision-surface-for-ml-in-python-149710ee2a0e