Machine Learning Methods: Decision trees and forests
This post contains our crib notes on the basics of decision trees and forests. We first discuss the construction of individual trees, and then introduce random and boosted forests. We also discuss efficient implementations of greedy tree construction algorithms, showing that a single tree can be constructed in
Follow @efavdb
Follow us on twitter for new submission alerts!
Decision trees constitute a class of simple functions that are frequently used for carrying out regression and classification. They are constructed by hierarchically splitting a feature space into disjoint regions, where each split divides into two one of the already existing regions. In most common implementations, the splits are always taken along one of the feature axes, which causes the regions to be rectangular in shape. An example is shown in Fig. 1 below. In this example, a two-dimensional feature space is first split by a tree on
Once a decision tree is constructed, it can be used for making predictions on unlabeled feature vectors — i.e., points in feature space not included in our training set. This is done by first deciding which of the regions a new feature vector belongs to, and then returning as its hypothesis label an average over the training example labels within that region: The mean of the region’s training labels is returned for regression problems and the mode for classification problems. For instance, the tree in Fig. 1 would return an average of the five training examples in
The art and science of tree construction is in deciding how many splits should be taken and where those splits should take place. The goal is to find a tree that provides a reasonable, piece-wise constant approximation to the underlying distribution or function that has generated the training data provided. This can be attempted through choosing a tree that breaks space up into regions such that the examples in any given region have identical — or at least similar — labels. We discuss some common approaches to finding such trees in the next section.
Individual trees have the important benefit of being easy to interpret and visualize, but they are often not as accurate as other common machine learning algorithms. However, individual trees can be used as simple building blocks with which to construct more complex, competitive models. In the third section of this note, we discuss three very popular constructions of this sort: bagging, random forests (a variant on bagging), and boosting. We then discuss the runtime complexity of tree/forest construction and conclude with a summary, exercises, and an appendix containing example python code.
Regression tree construction typically proceeds by attempting to minimize a squared error cost function: Given a training set
Unfortunately, actually minimizing (
In classification problems, the training labels take on a discrete set of values, often having no numerical significance. This means that a squared-error cost function, like that in (
Although
Decision trees that are allowed to split indefinitely will have low bias but will over-fit their training data. Placing different stopping criteria on a tree’s growth can ameliorate this latter effect. Two typical conditions often used for this purpose are given by a) placing an upper bound on the number of levels permitted in the tree, or b) requiring that each region (tree leaf) retains at least some minimum number of training examples. To optimize over such constraints, one can apply cross-validation.
Another approach to alleviating the high-variance, over-fitting issue associated with decision trees is to average over many of them. This approach is motivated by the observation that the sum of
Bootstrap aggregation, or “bagging”, provides one common method for constructing ensemble tree models. In this approach, one samples with replacement to obtain
One nice thing about bagging methods, in general, is that one can train on the entire set of available labeled training data and still obtain an estimate of the generalization error. Such estimates are obtained by considering the error on each point in the training set, in each case averaging only over those trees that did not train on the point in question. The resulting estimate, called the out-of-bag error, typically provides a slight overestimate to the generalization error. This is because accuracy generally improves with growing ensemble size, and the full ensemble is usually about three times larger than the sub-ensemble used to vote on any particular training example in the out-of-bag error analysis.
Random forests provide a popular variation on the bagging method. The individual decision trees making up a random forest are, again, each fit to an independent, bootstrapped subsample of the training data. However, at each step in their recursive construction process, these trees are restricted in that they are only allowed to split on
The final method we’ll discuss is boosting, which again consists of a set of individual trees that collectively determine the ultimate prediction returned by the model. However, in the boosting scenario, one fits each of the trees to the full data set, rather than to a small sample. Because they are fit to the full data set, these trees are usually restricted to being only two or three levels deep, so as to avoid over-fitting. Further, the individual trees in a boosted forest are constructed sequentially. For instance, in regression, the process typically works as follows: In the first step, a tree is fit to the full, original training set
Boosted classification tree ensembles are constructed in a fashion similar to that above. However, in contrast to the regression scenario, the same, original training labels are used to fit each new tree in the ensemble (as opposed to an evolving residual). To bring about a similar, gradual learning process, boosted classification ensembles instead sample from the training set with weights that are sample-dependent and that change over time: When constructing a new tree for the ensemble, one more heavily weights those examples that have been poorly fit in prior iterations. AdaBoost is a popular algorithm for carrying out boosted classification. This and other generalizations are covered in the text Elements of Statistical Learning.
Before concluding, we take here a moment to consider the runtime complexity of tree construction. This exercise gives one a sense of how tree algorithms are constructed in practice. We begin by considering the greedy construction of a single classification tree. The extension to regression trees is straightforward.
Consider the problem of greedily training a single classification tree on a set of
Focus on an intermediate moment in the construction process where one particular node has just been split, resulting in two new regions,
The left side of Fig. 4 illustrates one method for efficiently carrying out these test cuts: For each feature direction, we proceed sequentially through that direction’s ordered list, considering one cut at a time. In the first cut, we take only one example in the left sub-region induced, and all others on the right. In the second cut, we have the first two examples in the left sub-region, etc. Proceeding in this way, it turns out that the cost function of each new candidate split considered can always be evaluated in
The above analysis gives the time needed to search for the optimal split within
In summary, we see that achieving
If a forest of
In this note, we’ve quickly reviewed the basics of tree-based models and their constructions. Looking back over what we have learned, we can now consider some of the reasons why tree-based methods are so popular among practitioners. First — and very importantly — individual trees are often useful for gaining insight into the geometry of datasets in high dimensions. This is because tree structures can be visualized using simple diagrams, like that in Fig. 1. In contrast, most other machine learning algorithm outputs cannot be easily visualized — consider, e.g., support-vector machines, which return hyper-plane decision boundaries. A related point is that tree-based approaches are able to automatically fit non-linear decision boundaries. In contrast, linear algorithms can only fit such boundaries if appropriate non-linear feature combinations are constructed. This requires that one first identify these appropriate feature combinations, which can be a challenging task for feature spaces that cannot be directly visualized. Three additional positive qualities of decision trees are given by a) the fact that they are insensitive to feature scale, which reduces the need for related data preprocessing, b) the fact that they can make use of data missing certain feature values, and c) that they are relatively robust against outliers and noisy-labeling issues.
Although boosted and random forests are not as easily visualized as individual decision trees, these ensemble methods are popular because they are often quite competitive. Boosted forests typically have a slightly lower generalization error than their random forest counterparts. For this reason, they are often used when accuracy is highly-valued — see Fig. 5 for an example learning curve consistent with this rule of thumb: Generalization error rate versus training set size for a hand-written digits learning problem. However, the individual trees in a bagged forest can be constructed in parallel. This benefit — not shared by boosted forests — can favor random forests as a go-to, out-of-box approach for treating large-scale machine learning problems.
Exercises follow that detail some further points of interest relating to decision trees and their construction.
[1] Elements of Statistical Learning, by Hastie, Tibshirani, Friedman
[2] An Introduction to Statistical Learning, by James, Witten, Hastie, and Tibshirani
[3] Random Forests, by Breiman (Machine Learning, 45, 2001).
[4] Sk-learn documentation on runtime complexity, see section 1.8.4.
a) Consider a real function
b) Consider binary tree classification guided by the minimization of the error rate (
c) How about if (
Suppose one has constructed an approximately balanced decision tree, where each node contains one of the
a) Consider a region
b) Show that a region’s Gini coefficient (
Consider a region
Adapted from [3].
a) Let
b) Consider a binary classification problem aimed at fitting a sampled function
c) Show that
d) Writing,
Here, we provide the python/sk-learn code used to construct Fig. 5 in the body of this note: Learning curves on sk-learn’s “digits” dataset for a single tree, a random forest, and a boosted forest.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
|
from
sklearn.datasets
import
load_digits
from
sklearn.tree
import
DecisionTreeClassifier
from
sklearn.ensemble
import
RandomForestClassifier
from
sklearn.ensemble
import
GradientBoostingClassifier
import
numpy as np
#load data: digits.data and digits.target,
#array of features and labels, resp.
digits
=
load_digits(n_class
=
10
)
n_train
=
[]
t1_accuracy
=
[]
t2_accuracy
=
[]
t3_accuracy
=
[]
#below, we average over "trials" num of fits for each sample
#size in order to estimate the average generalization error.
trials
=
25
clf
=
DecisionTreeClassifier()
clf2
=
GradientBoostingClassifier(max_depth
=
3
)
clf3
=
RandomForestClassifier()
num_test
=
500
#loop over different training set sizes
for
num_train
in
range
(
2
,
len
(digits.target)
-
num_test,
25
):
acc1, acc2, acc3
=
0
,
0
,
0
for
j
in
range
(trials):
perm
=
[
0
]
while
len
(
set
(digits.target[perm[:num_train]]))<
2
:
perm
=
np.random.permutation(
len
(digits.data))
clf
=
clf.fit(digits.data[perm[:num_train]], \
digits.target[perm[:num_train]])
acc1
+
=
clf.score(digits.data[perm[
-
num_test:]], \
digits.target[perm[
-
num_test:]])
clf2
=
clf2.fit(digits.data[perm[:num_train]],\
digits.target[perm[:num_train]])
acc2
+
=
clf2.score(digits.data[perm[
-
num_test:]],\
digits.target[perm[
-
num_test:]])
clf3
=
clf3.fit(digits.data[perm[:num_train]],\
digits.target[perm[:num_train]])
acc3
+
=
clf3.score(digits.data[perm[
-
num_test:]],\
digits.target[perm[
-
num_test:]])
n_train.append(num_train)
t1_accuracy.append(acc1
/
trials)
t2_accuracy.append(acc2
/
trials)
t3_accuracy.append(acc3
/
trials)
%
pylab inline
plt.plot(n_train,t1_accuracy, color
=
'red'
)
plt.plot(n_train,t2_accuracy, color
=
'green'
)
plt.plot(n_train,t3_accuracy, color
=
'blue'
)
|