Programming Collective Intelligence 读书总结

  • Making Recommendations (Collaborative Filtering)
    • User-based
      • Finding similar users
        • User as vector based on item score
          • Euclidean distance
          • Pearson correlation
        • Reverse users and items, we can find similar items to a given item
      • Sort and recommend items based on
        • sum(user similarity * user’s item score) for each other user
    • Item-based
      • Find item similarities
        • These results can be cached and periodically updated
      • Sort and recommend items based on
        • sum((item similarity * user’s item score) / sum(item similarity)) for each user’s item
      • Significantly faster and better for sparse dataset
  • Discovering Groups (Clustering)
    • Supervised Learning
      • use example inputs and outputs
      • neural networks, decision trees, support-vector machines, and Bayesian filtering
    • Word Vectors of texts
    • Hierarchical Clustering
      • choose two nearest vectors to combine
      • results in binary tree
    • Can cluster articles or words
      • transpose the matrix
    • Dendrogram drawing
    • K-Means clustering
      • randomly place k centroids
      • assign every item to the nearest centroid, and move the centroid to the average location of all items assigned to them
  • Searching and Ranking
    • word index stored in relational database
    • ranking
      • content-based
        • various metrics: word frequency, document location, word distance
      • use inbound links
        • simple count
        • PageRank algorithm
          • random walk
          • sparse matrix multiplication iterations
        • use link text
      • learning from clicks
        • click-tracking neuro-network (multilayer perception network, i.e. MLP network)
          • one hidden layer
  • Optimization
    • stochastic optimization
      • numerical solution
      • cost function
    • random searching
    • hill climbing
      • increase the most promising dimension of a vector
    • simulated annealing
      • variable: temperature, starts very high and gradually gets lower
      • worse solution being accepted depending on temperature
    • generic algorithms
      • mutate, crossover, …
  • Document Filtering (to be expanded…)
    • use words as features
    • naive Bayesian classifier
    • the Fisher method
  • Modeling with Decision Trees
    • Algorithm: CART (Classification and Regression Trees)
      • choose the best split from all possible splits
        • Gini impurity
        • information entropy
          • sum of p(x)log(p(x))
      • recursively build the whole tree
      • then can be used to classify new observations
      • pruning the tree
        • when it becomes overfitted
        • checking pairs of nodes that have a common parent to see if merging them would increase the entropy by less than a specified threshold
    • Dealing with
      • missing data
        • use both branches
      • numerical outcomes
        • use variance instead of entropy
  • Building Price Models
    • k-nearest neighbors (kNN)
      • weighted
      • may need scaling or normalizing
      • to estimate the probability density
    • cross-validation
      • divide data into training sets and test sets
  • Advanced Classification: Kernel Methods and SVMs
    • basic linear classification
      • using dot-products to determine distance
    • kernel methods
      • define another dot-product == move the points into different space
    • support-vector machines
      • find the line that is as far away as possible from classes
  • Finding Independent Features
    • non-negative matrix factorization
      • factor the article-word matrix into two matrix
        • the features matrix: row for features, column for words
        • the weight matrix: row for articles, column for features
  • Evolving Intelligence
    • creating an algorithm that creating algorithms
    • mutation, crossover/breeding
    • use trees to represent algorithm to enable evolving
      • use to guess numerical functions or, game AI
  • Algorithm Summary
    • Supervised Learning
      • Bayesian Classifier
      • Decision Tree Classifier
      • Neural Networks
      • Support-Vector Machines
    • Unsupervised Learning
      • k-Nearest Neighbors
      • Clustering
      • Multidimensional Scaling
      • Non-Negative Matrix Factorization
    • Optimization

你可能感兴趣的:(程序园)