机器学习简单回顾:5 SVM

Since the regression problem is being considered recently, the main point of this article is to consider the support vector machine to push the regression problem while maintaining its sparsity. In a simple linear regression model, we most minimize a regularized error function

To obtain a sparse solution, the secondary error function is replaced by a ϵ-insensitive error function which means If the absolute value of the difference between the prediction y (x) and the target t is less than ϵ, then the error given by this error function is equal to zero

机器学习简单回顾:5 SVM_第1张图片

Kernel

The introduction of kernel function can solve the non-linear situation: mapping the sample from the original space to a higher feature space, so that the linearly separable plane of the sample in this feature space is the hyperplane, and the elliptic space surface is mapped to the high dimension The distribution of samples after the space, in order to transform the sample space or map to a high-dimensional space, we can refer to a mapping function, and map the sample points to get the hyperplane. This technique is not only used in SVM, but also other statistical tasks.
But the mapping function is not the most important, the kernel function is important, see the concepts mentioned in “Statistical Learning Method”:
核函数定义

There is a functional relationship between the mapping function and the kernel function. Generally, we show the definition of the kernel function, but not the definition of the mapping function. On the one hand, because the calculation of the kernel function is simpler than the mapping function, we map a two-dimensional space and choose The new space of is the combination of all the first and second orders of the original space, and five dimensions are obtained; if the original space is three-dimensional, then we will get a 19-dimensional new space. This number is explosively increasing, which gives the calculation It brings great difficulties, and if it encounters an infinite dimension, it is impossible to calculate. Therefore, Kernel is required. In this way, a certain kernel function cannot determine the feature space and the mapping function. Similarly, if a feature space is determined, the mapping function may also be different. for example:
核函数与映射函数

It is proved that if you map the vectors into high dimensional space, that the vectors will finally been linear divided. So the kernel method is a great way.

Commonly used kernel functions and comparison:
-Linear Kernel
k ( x i , x j ) = x i T x j k (x_i, x_j) = x_i ^ {T} x_j k(xi,xj)=xiTxj
The linear kernel function is the simplest kernel function, which is mainly used for linear separability. It finds the optimal linear classifier in the original space and has the advantages of less parameters and faster speed. If we apply the linear kernel function in KPCA, we will find that the derivation is exactly the same as the original PCA algorithm. This is just that the linear kernel function occasionally has an equivalent form.
-Polynomial Kernel
k ( x i , y j ) = ( x i T x j ) d k (x_i, y_j) = (x_i ^ {T} x_j) ^ d k(xi,yj)=(xiTxj)d
There are also complex forms:
k ( x i , x j ) = ( a x i T x j + b ) d k (x_i, x_j) = (ax_i ^ {T} x_j + b) ^ d k(xi,xj)=(axiTxj+b)d
Among them, $ d \ ge1 $ is the degree of polynomial, and the parameters become more. The polynomial verifies a non-standard kernel function, which is very suitable for orthogonal normalized data. The polynomial kernel function belongs to the global kernel function, which can realize low-dimensional Of the input space is mapped to a high-dimensional feature space. The larger the parameter d, the higher the dimension of the mapping, and the larger the element value of the matrix. Therefore, it is prone to overfitting.

-Radial basis function Radial Basis Function (RBF)

k ( x i , x j ) = e x p ( −   f r a c ∣ ∣ x i − x j ∣ ∣ 2 2   s i g m a 2 ) k (x_i, x_j) = exp (-\ frac {|| x_i-x_j || ^ 2} {2 \ sigma ^ 2}) k(xi,xj)=exp( fracxixj22 sigma2)

$ \ sigma> 0 $ is the Gaussian kernel bandwidth, which is a classic robust radial basis kernel, that is, the Gaussian kernel function. The robust radial basis kernel has good anti-interference ability to the noise in the data, and its parameters Determine the scope of the function, beyond this range, the role of data “basically disappears.” The Gaussian kernel function is an excellent representative of this family of kernel functions, and it is also a kernel function that must be tried. For both large and small samples, it has better performance, so in most cases, it is not known what kernel function to use, and the radial basis kernel function is preferred.

-Laplacian Kernel
k ( x i , x j ) = e x p ( −   f r a c ∣ ∣ x i − x j ∣ ∣   s i g m a ) k (x_i, x_j) = exp (-\ frac {|| x_i-x_j ||} {\ sigma}) k(xi,xj)=exp( fracxixj sigma)

-Sigmoid Kernel Sigmoid kernel
k ( x i , x j ) = t a n h (   a l p h a x T x j + c ) k (x_i, x_j) = tanh (\ alpha x ^ Tx_j + c) k(xi,xj)=tanh( alphaxTxj+c)
Using Sigmoid kernel function, the support vector machine implements a multilayer perceptron neural network.

It’s more interesting that what SVM does is just like the Relu and L2-regularization, so it might be easy to make a neural network by SVM.

As in the case of classification problems, there is another form of SVM for regression. In this form of SVM, the control is complex
The parameter of degree has a more intuitive meaning (Schölkopf et al., 2000). In particular, we do not fix the insensitive area ϵ
Width, is an example of fixing data points located outside the pipe ν. This involves maximization

你可能感兴趣的:(机器学习简单回顾:5 SVM)