Since the regression problem is being considered recently, the main point of this article is to consider the support vector machine to push the regression problem while maintaining its sparsity. In a simple linear regression model, we most minimize a regularized error function
To obtain a sparse solution, the secondary error function is replaced by a ϵ-insensitive error function which means If the absolute value of the difference between the prediction y (x) and the target t is less than ϵ, then the error given by this error function is equal to zero
The introduction of kernel function can solve the non-linear situation: mapping the sample from the original space to a higher feature space, so that the linearly separable plane of the sample in this feature space is the hyperplane, and the elliptic space surface is mapped to the high dimension The distribution of samples after the space, in order to transform the sample space or map to a high-dimensional space, we can refer to a mapping function, and map the sample points to get the hyperplane. This technique is not only used in SVM, but also other statistical tasks.
But the mapping function is not the most important, the kernel function is important, see the concepts mentioned in “Statistical Learning Method”:
There is a functional relationship between the mapping function and the kernel function. Generally, we show the definition of the kernel function, but not the definition of the mapping function. On the one hand, because the calculation of the kernel function is simpler than the mapping function, we map a two-dimensional space and choose The new space of is the combination of all the first and second orders of the original space, and five dimensions are obtained; if the original space is three-dimensional, then we will get a 19-dimensional new space. This number is explosively increasing, which gives the calculation It brings great difficulties, and if it encounters an infinite dimension, it is impossible to calculate. Therefore, Kernel is required. In this way, a certain kernel function cannot determine the feature space and the mapping function. Similarly, if a feature space is determined, the mapping function may also be different. for example:
Commonly used kernel functions and comparison:
-Linear Kernel
k ( x i , x j ) = x i T x j k (x_i, x_j) = x_i ^ {T} x_j k(xi,xj)=xiTxj
The linear kernel function is the simplest kernel function, which is mainly used for linear separability. It finds the optimal linear classifier in the original space and has the advantages of less parameters and faster speed. If we apply the linear kernel function in KPCA, we will find that the derivation is exactly the same as the original PCA algorithm. This is just that the linear kernel function occasionally has an equivalent form.
-Polynomial Kernel
k ( x i , y j ) = ( x i T x j ) d k (x_i, y_j) = (x_i ^ {T} x_j) ^ d k(xi,yj)=(xiTxj)d
There are also complex forms:
k ( x i , x j ) = ( a x i T x j + b ) d k (x_i, x_j) = (ax_i ^ {T} x_j + b) ^ d k(xi,xj)=(axiTxj+b)d
Among them, $ d \ ge1 $ is the degree of polynomial, and the parameters become more. The polynomial verifies a non-standard kernel function, which is very suitable for orthogonal normalized data. The polynomial kernel function belongs to the global kernel function, which can realize low-dimensional Of the input space is mapped to a high-dimensional feature space. The larger the parameter d, the higher the dimension of the mapping, and the larger the element value of the matrix. Therefore, it is prone to overfitting.
-Radial basis function Radial Basis Function (RBF)
k ( x i , x j ) = e x p ( − f r a c ∣ ∣ x i − x j ∣ ∣ 2 2 s i g m a 2 ) k (x_i, x_j) = exp (-\ frac {|| x_i-x_j || ^ 2} {2 \ sigma ^ 2}) k(xi,xj)=exp(− frac∣∣xi−xj∣∣22 sigma2)
$ \ sigma> 0 $ is the Gaussian kernel bandwidth, which is a classic robust radial basis kernel, that is, the Gaussian kernel function. The robust radial basis kernel has good anti-interference ability to the noise in the data, and its parameters Determine the scope of the function, beyond this range, the role of data “basically disappears.” The Gaussian kernel function is an excellent representative of this family of kernel functions, and it is also a kernel function that must be tried. For both large and small samples, it has better performance, so in most cases, it is not known what kernel function to use, and the radial basis kernel function is preferred.
-Laplacian Kernel
k ( x i , x j ) = e x p ( − f r a c ∣ ∣ x i − x j ∣ ∣ s i g m a ) k (x_i, x_j) = exp (-\ frac {|| x_i-x_j ||} {\ sigma}) k(xi,xj)=exp(− frac∣∣xi−xj∣∣ sigma)
-Sigmoid Kernel Sigmoid kernel
k ( x i , x j ) = t a n h ( a l p h a x T x j + c ) k (x_i, x_j) = tanh (\ alpha x ^ Tx_j + c) k(xi,xj)=tanh( alphaxTxj+c)
Using Sigmoid kernel function, the support vector machine implements a multilayer perceptron neural network.
As in the case of classification problems, there is another form of SVM for regression. In this form of SVM, the control is complex
The parameter of degree has a more intuitive meaning (Schölkopf et al., 2000). In particular, we do not fix the insensitive area ϵ
Width, is an example of fixing data points located outside the pipe ν. This involves maximization