Spark RDD Operations

Reproduce from:

  • Spark RDD Operations-Transformation & Action with Example
  • Spark RDD编程(Python和Scala版本)

RDD Transformation

Spark Transformation is a function that produces new RDD from the existing RDDs. It takes RDD as input and produces one or more RDD as output.

Applying transformation built and RDD lineage, with the entire parent RDDs of the final RDD(s). RDD lineage, also known as RDD operator graph or RDD dependency graph. It is a logical execution plan i.e., it is Directed Acyclic Graph (DAG) of the entire parent RDDs of RDD.

Transformations are lazy in nature i.e., they get execute when we call an action. They are not executed immediately. Two most basic type of transformations is map(), filter().

After the transformation, the resultant RDD is always different from its parent RDD. It can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap(), union(), Cartesian()) or the same size(e.g. map).

There are two types of transformations:

  • Narrow transformation
    In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. A limited subset of partition is used to calculate the result. Narrow transformations are the result of map(), filter().
  • Wide transformation
    In Wide transformation, all the elements that are required to compute the record in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD. Wide transformations are the result of groupbyKey()join(),and reducebyKey().
Spark RDD Operations_第1张图片
image.png

转化操作返回一个新的RDD对象

RDD Action

Transformations create RDDs from each other, but when we want to work with the actual dataset, at that point action is performed. When the action is triggered after the result, new RDD is not formed like transformation. Thus, Actions are Spark RDD operations that give non-RDD values. The values of action are sorted to drivers or to the external storage system. It brings laziness of RDD into motion.

当触发 Action 时,RDD 的表现与 transformation 不同。因此,Actions 是返回非 RDD 值的 Spark RDD 操作,即作为 Spark RDD 的操作之一,Actions 返回的并非 RDD 对象。Action 的返回值存储在 drivers 或外部存储系统中。Action 导致了 RDD 的「懒惰」特性。

An action is one of the ways of sending data from Executer to the driver. Executors are agents that are responsible for executing a task. While the driver is a JVM process that coordinates workers and execution of the task. Some of the actions of Spark are count()collect()top()reduce(),and foreach()
行动操作则会对 RDD 产生一个计算结果,并把结果返回到驱动器程序中或者把结果存储到外部存储系统(如HDFS)。

conclusion

In conclusion, applying a transformation to an RDD creates another RDD. As a result of this RDDs are immutable in nature. On the introduction of an action on an RDD, the result gets computed. Thus this lazy evaluation decreases the overhead of computation and makes the system more efficient.

你可能感兴趣的:(Spark RDD Operations)