[Spark源码学习] reduceByKey和groupByKey实现与combineByKey的关系

groupByKey和reduceByKey是spark中十分常用的两个功能函数。

正常情况下两个函数都能得出正确的且相同的结果, 但reduceByKey函数更适合使用在大数据集上,而大多数人建议尽量少用groupByKey,这是为什么呢?(这是较早时候大家的建议)因为Spark在执行时,reduceByKey先在同一个分区内组合数据,然后在移动。groupByKey则是先移动后组合,所以移动的工作量相对较大。

实际最新版的源码(spark2.4):
groupByKey, reduceByKey这两个函数的核心都是调用combineByKey,差别在于:

groupByKey与combineByKey实现的逻辑是一样的,差别是groupByKey内置好了三个函数,但是combineByKey可以自己设定函数。

reduceByKey则是没有这些函数,它也是调用的combineByKey,但是传给combineByKey的函数就有些简单了,所以实现的功能是有局限的。因此,如果你有自己的要求,则需要自己根据,combineByKey,的要求,写三个组合函数。

所以在用的时候建议看看源码,蛮有意思的哈!

reduceByKey源码:

    def reduceByKey(self, func, numPartitions=None, partitionFunc=portable_hash):
        """
        Merge the values for each key using an associative and commutative reduce function.

        This will also perform the merging locally on each mapper before
        sending results to a reducer, similarly to a "combiner" in MapReduce.

        Output will be partitioned with C{numPartitions} partitions, or
        the default parallelism level if C{numPartitions} is not specified.
        Default partitioner is hash-partition.

        >>> from operator import add
        >>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
        >>> sorted(rdd.reduceByKey(add).collect())
        [('a', 2), ('b', 1)]
        """
        return self.combineByKey(lambda x: x, func, func, numPartitions, partitionFunc)

groupByKey源码:

   def groupByKey(self, numPartitions=None, partitionFunc=portable_hash):
        """
        Group the values for each key in the RDD into a single sequence.
        Hash-partitions the resulting RDD with numPartitions partitions.

        .. note:: If you are grouping in order to perform an aggregation (such as a
            sum or average) over each key, using reduceByKey or aggregateByKey will
            provide much better performance.

        >>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
        >>> sorted(rdd.groupByKey().mapValues(len).collect())
        [('a', 2), ('b', 1)]
        >>> sorted(rdd.groupByKey().mapValues(list).collect())
        [('a', [1, 1]), ('b', [1])]
        """
        def createCombiner(x):
            return [x]

        def mergeValue(xs, x):
            xs.append(x)
            return xs

        def mergeCombiners(a, b):
            a.extend(b)
            return a

        memory = self._memory_limit()
        serializer = self._jrdd_deserializer
        agg = Aggregator(createCombiner, mergeValue, mergeCombiners)

        def combine(iterator):
            merger = ExternalMerger(agg, memory * 0.9, serializer)
            merger.mergeValues(iterator)
            return merger.items()

        locally_combined = self.mapPartitions(combine, preservesPartitioning=True)
        shuffled = locally_combined.partitionBy(numPartitions, partitionFunc)

        def groupByKey(it):
            merger = ExternalGroupBy(agg, memory, serializer)
            merger.mergeCombiners(it)
            return merger.items()

        return shuffled.mapPartitions(groupByKey, True).mapValues(ResultIterable)

combineByKey()源码:

def combineByKey(self, createCombiner, mergeValue, mergeCombiners,
                     numPartitions=None, partitionFunc=portable_hash):
        """
        Generic function to combine the elements for each key using a custom
        set of aggregation functions.
 		通用函数,使用一组自定义聚合函数实现每个key对应元素的组合;
 		
        Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined
        type" C
        将一个RDD[(K, V)]类型的数据,对相同的key的元素组合后,构成RDD[(K, C)],这里的C是组合后的类型。

        Users provide three functions:
        使用者需要提供的三个功能函数:

            - C{createCombiner}, which turns a V into a C (e.g., creates
              a one-element list)(定义:value向C类型的转变函数)
            - C{mergeValue}, to merge a V into a C (e.g., adds it to the end of
              a list)(定义:单个Value的类型,加入到C类型的函数。)
            - C{mergeCombiners}, to combine two C's into a single one (e.g., merges
              the lists)(定义两个C类型数据的组合函数)

        To avoid memory allocation, both mergeValue and mergeCombiners are allowed to
        modify and return their first argument instead of creating a new C.
为了避免内存的再次分配,函数mergeValue和 mergeCombiners都允许对
输入的第一个参数进行修改,将第二个参数追加到第一个参数中,避免重复创建。

        In addition, users can control the partitioning of the output RDD.
除此之外,使用者也可以修改输出RDD的划分

举例:
        .. note:: V and C can be different -- for example, one might group an RDD of type
            (Int, Int) into an RDD of type (Int, List[Int]).

        >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
        >>> def to_list(a):
        ...     return [a]
        ...
        >>> def append(a, b):
        ...     a.append(b)
        ...     return a
        ...
        >>> def extend(a, b):
        ...     a.extend(b)
        ...     return a
        ...
        >>> sorted(x.combineByKey(to_list, append, extend).collect())
        [('a', [1, 2]), ('b', [1])]
        """
        if numPartitions is None:
            numPartitions = self._defaultReducePartitions()

        serializer = self.ctx.serializer
        memory = self._memory_limit()
        agg = Aggregator(createCombiner, mergeValue, mergeCombiners)

        def combineLocally(iterator):
            merger = ExternalMerger(agg, memory * 0.9, serializer)
            merger.mergeValues(iterator)
            return merger.items()

        locally_combined = self.mapPartitions(combineLocally, preservesPartitioning=True)
        shuffled = locally_combined.partitionBy(numPartitions, partitionFunc)

        def _mergeCombiners(iterator):
            merger = ExternalMerger(agg, memory, serializer)
            merger.mergeCombiners(iterator)
            return merger.items()

        return shuffled.mapPartitions(_mergeCombiners, preservesPartitioning=True)

具体分析待续

你可能感兴趣的:(spark,Spark,python,reduceBykey,groupBykey)