csdn:https://blog.csdn.net/linxid/article/details/80494922
Answer: used to describe a massive structured and unstructured data that is so large that it is difficult to process using traditional database and software techniques.
Answer: Volume:大量; Velocity:快速;Variety:多样; Veracity:真实准确。
Answer: Under acceptable computational efficiency limitations, applying data analysis
and discovery algorithms, to produce a particular enumeration of patterns over the
data
Answer:
1. Association rule mining (Two Steps: Find All Frequent itemsets, Generate strong association rules from frequent itemsets),
Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.
Answer:
supervised learning; Unsupervised learning; Semi-supervised learning
Answer:
Increase Sample; Remove outliers; Decrease model complexity,train-validation-test (cross validation),regularization
KNN;Naive Bayes;Decision Tree;SVM;
Bagging -> Random Forest;Boosting -> AdaBoost;Stacking;
K-means;Hierarchical Clustering;DBSCAN;Apriori;
To resolve the challenge, like the curse of dimensionality, storage cost, and query speed.
Definition: the number of the first row in which column
How to compute the Signature matrix
https://blog.csdn.net/linxid/article/details/79745964
sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population.
Two advantages of sampling:
the cost is lower and data collection is faster than measuring the entire population.
The Problem: generating values if X which are distributed according to the distribution.
samling based on the inverse of Cumulative Distribution Function
Algorithm:
* 1.Generate a random number u from the standard uniform distribution in the interval [0,1].
* 2.Compute the value x such that Fx(x) = u.
* 3.Take x to be the random number drawn from the distribution described by Fx.
Drawbacks: It is hard to get the inverse function.
The problem of Rejection Sampling can be solved by using Adaptive Rejection sampling.
Algorithm:
* 1.Obtain a sample y from proposal distribution Y and a sample u from Uniform(0,1)(the uniform distribution over the unit interval).
* 2.If u
Not reject but assign weight to each instance so that the correct distribution is targeted.
Markov property:
If the conditional probability distribution of future states of the process depends only upon the present state.
Markov chain:
A Markov chain is a sequence of random variables x1,x2,x3…with Markov property.
MH Sampling:
* Doesn’t have a 100% acceptance rate
* Doesn’t require to know the full conditionals
* Acceptance threshold would be very crucial
Gibbs Sampling:
* 100% acceptance rate
* Need to know the full conditionals
https://blog.csdn.net/itplus/article/details/19168937
https://blog.csdn.net/google19890102/article/details/51755242
https://en.wikipedia.org/wiki/Sampling_(statistics)
https://blog.csdn.net/ustbxy/article/details/45458725
https://blog.csdn.net/baimafujinji/article/details/51407703
《Pattern Recognition and Machine Learning》Bishop
Concept drift means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways.
the probability distribution changes.
Real concept drift:
p(y|x) changes
Virtual concept drift:
p(x)changes,but not p(y|x)
Monitoring the change of data distributions
based on the change of the classification performance.
Algorithm:
* calculate the information gain for the attributes and determines the best two attributes
* At each node,check for the condition: delta(G) = G(a) - G(b) > e
* if condition satisfied, create child nodes based on the test at the node.
* if not, stream in more examples and perform calculations till condition satisfied.
Strengths:
* Scale better than traditional methods
* incremental
Weakness:
* Could spend a lot of time with times
* Memory used with tree expansion
* Number of candidate a
Summarize the data into memory-efficient data structures
Use a clustering algorithm to find the data partition
https://machinelearningmastery.com/gentle-introduction-concept-drift-machine-learning/
http://www.liaad.up.pt/kdus/products/datasets-for-concept-drift
https://www.hindawi.com/journals/tswj/2015/235810/
The average distance between two random individuals in the USA: 6
The average distance between two randomly users in Facebook(721 million active users, 69 billion links): 4.74
Advantage:
Disadvantage:
Explation:
Prune all the nodes with degree 1 till no degree 1 nodes left in the network, the nodes pruned have ks=1. Similarly, prune other nodes having degree 2 and assign them ks =2. Repeat, till the graph becomes empty.
If a page is linked with many high-cited pages, then it will gain high PageRank score.
We assume a customer can use URL to link to any pages, to solve the problem that a node has no outlinks.
Explanation:
http://blog.jobbole.com/23286/
https://www.cnblogs.com/rubinorth/p/5799848.html
https://en.wikipedia.org/wiki/PageRank#Algorithm
Minimum cut:
may return an imbalanced partition.
Ratio Cut & Normalized cut:
How to calculate Ratio Cut and Normalized Cut.We can use spectral clustering algorithm to calculate it.
Modularity Maximization:
measure the strength of a community by taking into account the degree distribution.
http://blog.sciencenet.cn/blog-3075-982948.html
Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
The HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework
Namenode(名称结点):
Datanodes(数据结点):
Hadoop is a software framework for distributed processing of large datasets across large clusters of computers
Automatic parallelization & distribution, fault tolerance and automatic recovery, clean and simple programming
abstraction
* Main properties of HDFS
Large, replication, failure, fault tolerance
* Hadoop vs other systems
Distributed database | Hadoop |
---|---|
Computing model | Transactions, concurrency control |
Data model | Sturcture data, read/write mode |
Cost model | Expensive |
Fault tolerance | Rare |
Key characteristics | Efficiency, optimizations, fine-tuning |
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
Users only provide the “Map” and “Reduce” functions.
multi-pass
algorithms.Generalize MapReduce to support new apps within same engine
multi-pass
algorithms. MapReduce | Spark |
---|---|
Great at one-pass computation, but inefficient for multi-pass algorithims |
Extends programming languages with a distributed collection data-structure (RDD) |
No efficient primitives for data sharing | Clean APIs in Java, Scala, Python, R |
https://www.yiibai.com/hadoop/hadoop_introduction_to_hadoop.html
https://blog.csdn.net/qq_26437925/article/details/78467216