特征工程——数据降维

基于特征选择的降维

基于特征选择的降维指的是根据一定规则和经验, 直接选取原有维度的一部分参与到后续的计算和建模过程, 用选择的维度代替所有维度, 整个过程不产生新的维度。
基于特征选择的降维方法有4种思路:
· 经验法: 根据业务专家或数据专家的以往经验、 实际数据情况、业务理解程度等进行综合考虑。 业务经验依靠的是业务背景, 从众多维度特征中选择对结果影响较大的特征; 而数据专家则依靠的是数据工作经验, 基于数据的基本特征以及对后期数据处理和建模的影响来选择或排除维度, 例如去掉缺失值较多的特征。
· 测算法: 通过不断测试多种维度选择参与计算, 通过结果来反复验证和调整并最终找到最佳特征方案。
· 基于统计分析的方法: 通过相关性分析不同维度间的线性相关
性, 在相关性高的维度中进行人工去除或筛选; 或者通过计算不同维度间的互信息量, 找到具有较高互信息量的特征集, 然后把其中的一个特征去除或留下。
**·**机器学习算法: 通过机器学习算法得到不同特征的特征值或权重, 然后再根据权重来选择较大的特征。 图3-2所示是通过CART决策树模型得到不同变量的重要程度, 然后根据实际权重值进行选择。
特征工程——数据降维_第1张图片

基于维度转换的降维

基于维度转换的降维是按照一定的数学变换方法, 把给定的一组相关变量(维度) 通过数学模型将高维空间的数据点映射到低维度空间中, 然后利用映射后变量的特征来表示原有变量的总体特征。 这种方式是一种产生新维度的过程, 转换后的维度并非原有维度的本体, 而是其综合多个维度转换或映射后的表达式。

drtoolbox工具箱中的compute_mapping函数帮助文档

compute_mapping.m
 COMPUTE_MAPPING Performs dimensionality reduction on a dataset A
 
    mappedA = compute_mapping(A, type)
    mappedA = compute_mapping(A, type, no_dims)
    mappedA = compute_mapping(A, type, no_dims, ...)
 
  Performs a technique for dimensionality reduction on the data specified 
  in A, reducing data with a lower dimensionality in mappedA.
  The data on which dimensionality reduction is performed is given in A
  (rows correspond to observations, columns to dimensions). A may also be a
  (labeled or unlabeled) PRTools dataset.
  The type of dimensionality reduction used is specified by type. Possible
  values are 'PCA', 'LDA', 'MDS', 'ProbPCA', 'FactorAnalysis', 'GPLVM', 
  'Sammon', 'Isomap', 'LandmarkIsomap', 'LLE', 'Laplacian', 'HessianLLE', 
  'LTSA', 'MVU', 'CCA', 'LandmarkMVU', 'FastMVU', 'DiffusionMaps', 
  'KernelPCA', 'GDA', 'SNE', 'SymSNE', 'tSNE', 'LPP', 'NPE', 'LLTSA', 
  'SPE', 'Autoencoder', 'LLC', 'ManifoldChart', 'CFA', 'NCA', 'MCML', and 'LMNN'. 
  The function returns the low-dimensional representation of the data in the 
  matrix mappedA. If A was a PRTools dataset, then mappedA is a PRTools 
  dataset as well. For some techniques, information on the mapping is 
  returned in the struct mapping.
  The variable no_dims specifies the number of dimensions in the embedded
  space (default = 2). For the supervised techniques ('LDA', 'GDA', 'NCA', 
  'MCML', and 'LMNN'), the labels of the instances should be specified in 
  the first column of A (using numeric labels). 
 
    mappedA = compute_mapping(A, type, no_dims, parameters)
    mappedA = compute_mapping(A, type, no_dims, parameters, eig_impl)
 
  Free parameters of the techniques can be defined as well (on the place of
  the dots). These parameters differ per technique, and are listed below.
  For techniques that perform spectral analysis of a sparse matrix, one can 
  also specify in eig_impl the eigenanalysis implementation that is used. 
  Possible values are 'Matlab' and 'JDQR' (default = 'Matlab'). We advice
  to use the 'Matlab' for datasets of with 10,000 or less datapoints; 
  for larger problems the 'JDQR' might prove to be more fruitful. 
  The free parameters for the techniques are listed below (the parameters 
  should be provided in this order):
 
    PCA:            - none
    LDA:            - none
    MDS:            - none
    ProbPCA:        - <int> max_iterations -> default = 200
    FactorAnalysis: - none
    GPLVM:          - <double> sigma -> default = 1.0
    Sammon:         - none
    Isomap:         - <int> k -> default = 12
    LandmarkIsomap: - <int> k -> default = 12
                    - <double> percentage -> default = 0.2
    LLE:            - <int> k -> default = 12
                    - <char[]> eig_impl -> {
   ['Matlab'], 'JDQR'}
    Laplacian:      - <int> k -> default = 12
                    - <double> sigma -> default = 1.0
                    - <char[]> eig_impl -> {
   ['Matlab'], 'JDQR'}
    HessianLLE:     - <int> k -> default = 12
                    - <char[]> eig_impl -> {
   ['Matlab'], 'JDQR'}
    LTSA:           - <int> k -> default = 12
                    - <char[]> eig_impl -> {
   ['Matlab'], 'JDQR'}
    MVU:            - <int> k -> default = 12
                    - <char[]> eig_impl -> {
   ['Matlab'], 'JDQR'}
    CCA:            - <int> k -> default = 12
                    - <char[]> eig_impl -> {
   ['Matlab'], 'JDQR'}
    LandmarkMVU:    - <int> k -> default = 5
    FastMVU:        - <int> k -> default = 5
                    - <logical> finetune -> default = true
                    - <char[]> eig_impl -> {
   ['Matlab'], 'JDQR'}
    DiffusionMaps:  - <double> t -> default = 1.0
                    - <double> sigma -> default = 1.0
    KernelPCA:      - <char[]> kernel -> {
   'linear', 'poly', ['gauss']} 
                    - kernel parameters: type HELP GRAM for info
    GDA:            - <char[]> kernel -> {
   'linear', 'poly', ['gauss']} 
                    - kernel parameters: type HELP GRAM for info
    SNE:            - <double> perplexity -> default = 30
    SymSNE:         - <double> perplexity -> default = 30
    tSNE:           - <int> initial_dims -> default = 30
                    - <double> perplexity -> default = 30
    LPP:            - <int> k -> default = 12
                    - <double> sigma -> default = 1.0
                    - <char[]> eig_impl -> {
   ['Matlab'], 'JDQR'}
    NPE:            - <int> k -> default = 12
                    - <char[]> eig_impl -> {
   ['Matlab'], 'JDQR'}
    LLTSA:          - <int> k -> default = 12
                    - <char[]> eig_impl -> {
   ['Matlab'], 'JDQR'}
    SPE:            - <char[]> type -> {
   ['Global'], 'Local'}
                    - if 'Local': <int> k -> default = 12
    Autoencoder:    - <double> lambda -> default = 0

你可能感兴趣的:(python,算法,机器学习,数据分析)