Hadoop分布式系统架构-MapReduce-02

1、MapReduce介绍

MapReduce思想在生活中处处可见。或多或少都曾接触过这种思想。MapReduce的思想核心是“分而治之”，适用于大量复杂的任务处理场景（大规模数据处理场景）。
Map负责“分”，即把复杂的任务分解为若干个“简单的任务”来并行处理。可以进行拆分的前提是这些小任务可以并行计算，彼此间几乎没有依赖关系。
Reduce负责“合”，即对map阶段的结果进行全局汇总。
MapReduce运行在yarn集群

ResourceManager
NodeManager

2、MapReduce编程规范

MapReduce 的开发一共有八个步骤, 其中 Map 阶段分为 2 个步骤，Shuffle 阶段 4 个步骤，Reduce 阶段分为 2 个步骤。

Map 阶段 2 个步骤：

设置 InputFormat 类, 将数据切分为 Key-Value(K1和V1) 对，输入到第二步
自定义 Map 逻辑, 将第一步的结果转换成另外的 Key-Value（K2和V2）对，输出结果

Shle 阶段 4 个步骤：

对输出的 Key-Value 对进行分区
对不同分区的数据按照Key 排序
(可选) 对分组过的数据初步规约, 降低数据的网络拷贝
对数据进行分组，相同 Key 的 Value 放入一个集合中

Reduce 阶段 2 个步骤：

对多个 Map 任务的结果进行排序以及合并, 编写 Reduce 函数实现自己的逻辑, 对输入的
Key-Value 进行处理, 转为新的 Key-Value（K3和V3）输出
设置 OutputFormat 处理并保存 Reduce 输出的 Key-Value 数据

2.2、单词计数案例

自定义map阶段代码：

/**
 * @author : HaiLiang Huang
 * @author : Always Best Sign X
 */
public class wordCountMapper extends Mapper {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] strings = value.toString().split(" ");
        for (String string : strings) {
            context.write(new Text(string),new LongWritable(1));
        }
    }
}

自定义Reduce阶段代码

/**
 * @author : HaiLiang Huang
 * @author : Always Best Sign X
 */
public class wordCountReduce extends Reducer {
    @Override
    protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
        long count = 0;
        for (LongWritable value : values) {
            count += value.get();
        }
        context.write(key,new LongWritable(count));
    }
}

编写主方法：
在编写主方法的时候，时刻记住上面MapReduce编程规范中的八个阶段的步骤（代码中的注解说明）；


/**
 * @author : HaiLiang Huang
 * @author : Always Best Sign X
 */
public class JobMain extends Configured implements Tool {

    @Override
    public int run(String[] strings) throws Exception {
        Job job = Job.getInstance(super.getConf(), JobMain.class.getSimpleName());
        //打包程序到集群上面运行的时候，需要指定出程序的main方法。
        job.setJarByClass(JobMain.class);
        // 第一步：读取文件解析成key value对
        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job,new Path("hdfs://master:8020/a.txt"));
        // 第二步：设置自定的Mapper类
        job.setMapperClass(wordCountMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
       //第三步，第四步，第五步，第六步，省略
       //第七步：设置自定义的reduce类
        job.setReducerClass(wordCountReduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        //第八步：设置输出类以及输出路径
        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job, new Path("hdfs://master:8020/wordCount_out"));

        boolean b = job.waitForCompletion(true);

        return b?0:1;
    }

    public static void main(String[] args) throws Exception {
        Configuration configuration = new Configuration();
        Tool tool = new JobMain();
        int run = ToolRunner.run(configuration, tool, args);
        System.exit(run);
    }
}

2.3、分区Partition

可以在自定义的Partition中定义一些分区的逻辑，然后在程序实际运行的过程中，符合分区逻辑的会被分到一个Reduce中。

public class Mypartition extends Partitioner {
    @Override
    public int getPartition(Text text, NullWritable nullWritable, int i) {
        String number = text.toString().split("\t")[5];
        if (Integer.parseInt(number) > 15){
            return 1;
        } else {
            return 0;
        }
    }
}

在编写完自定义的Partition之后，在Main方法中还需要添加setPartition方法以及setNumReduceTasks方法。

        job.setPartitionerClass(Mypartition.class);
        job.setNumReduceTasks(2);

2.4、MapReduce排序和序列化

1、序列化 (Serialization) 是指把结构化对象转化为字节流
2、反序列化 (Deserialization) 是序列化的逆过程. 把字节流转为结构化对象. 当要在进程间传递对象或持久化对象的时候, 就需要序列化对象成字节流, 反之当要将接收到或从磁盘读取的字节流转换为对象, 就要进行反序列化。
3、Java 的序列化 (Serializable) 是一个重量级序列化框架, 一个对象被序列化后, 会附带很多额外的信息 (各种校验信息, header, 继承体系等）, 不便于在网络中高效传输. 所以, Hadoop自己开发了一套序列化机制(Writable), 精简高效. 不用像 Java 对象类一样传输多层的父子关系, 需要哪个属性就传输哪个属性值, 大大的减少网络传输的开销
4、Writable 是 Hadoop 的序列化格式, Hadoop 定义了这样一个 Writable 接口. 一个类要支持可序列化只需实现这个接口即可
5、另外 Writable 有一个子接口是 WritableComparable, WritableComparable 是既可实现序列化, 也可以对key进行比较, 我们这里可以通过自定义 Key 实现 WritableComparable 来实现我们的排序功能。

2.4.2、小案例

现有一数据集，描述如下：
a 1
a 9
b 3
a 7
b 8
b 10
a 5
要求：
1、第一列按照字典顺序进行排列。
2、第一列相同的时候，按照第二列升序进行排列。

分析：
1、首先Map阶段，读取文件中数据K1是文本偏移量，V1是读取一行的内容；
2、Map将转化成，K2是PairWritable，V2是second；
3、我们在PairWritable类中定义了比较器，所以在shuffle阶段中的排序，会自动对K2进行排序。
4、Reduce阶段将转化成输出，这里面的K3,V3就是我们想要的 “字符串数字”格式。

实现：
Step1、自定义类型和比较器

package MapReduce.SerializationTest;

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * @author : HaiLiang Huang
 * @author : Always Best Sign X
 */
public class PairWritable implements WritableComparable {

    private String first;
    private int second;
    public PairWritable(){

    }
    public PairWritable(String first,int second){
        this.set(first,second);
    }
    public void set(String first,int second){
        this.first = first;
        this.second = second;
    }


    // 序列化
    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeUTF(first);
        dataOutput.writeInt(second);
    }


    //反序列化
    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.first = dataInput.readUTF();
        this.second = dataInput.readInt();
    }

    @Override
    public int compareTo(PairWritable o) {
        
        int comp = this.first.compareTo(o.first);
        if (comp!=0){
            return comp;
        }else {
            return Integer.valueOf(this.second).compareTo(Integer.valueOf(o.getSecond()));
        }
    }

    public String getFirst() {
        return first;
    }

    public void setFirst(String first) {
        this.first = first;
    }

    public int getSecond() {
        return second;
    }

    public void setSecond(int second) {
        this.second = second;
    }

    @Override
    public String toString() {
        return "PairWritable{" +
                "first='" + first + '\'' +
                ", second=" + second +
                '}';
    }
}

Step2、编写Map方法

package MapReduce.SerializationTest;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @author : HaiLiang Huang
 * @author : Always Best Sign X
 */
public class SortMapperTest extends Mapper {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] s = value.toString().split("\t");

        context.write(new PairWritable(s[0], Integer.valueOf(s[1])), new IntWritable(Integer.valueOf(s[1])));
    }
}

Step3、编写Reduce方法

package MapReduce.SerializationTest;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 * @author : HaiLiang Huang
 * @author : Always Best Sign X
 */
public class SortReducerTest extends Reducer {
    @Override
    protected void reduce(PairWritable key, Iterable values, Context context) throws IOException, InterruptedException {
        for (IntWritable value : values) {
            context.write(new Text(key.getFirst()),value);
        }
    }
}

Step4、编写主方法

package MapReduce.SerializationTest;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 * @author : HaiLiang Huang
 * @author : Always Best Sign X
 */
public class SortMain extends Configured implements Tool {
    @Override
    public int run(String[] strings) throws Exception {
        Configuration conf = super.getConf();
        conf.set("mapreduce.framework.name","local");
        Job job = Job.getInstance(conf, SortMain.class.getSimpleName());
        job.setJarByClass(SortMain.class);

        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job,new Path("D:\\input\\Sort"));

        job.setMapperClass(SortMapperTest.class);
        job.setMapOutputKeyClass(PairWritable.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setReducerClass(SortReducerTest.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        job.setNumReduceTasks(1);
        
        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job,new Path("D:\\input\\SortResult"));

        boolean b = job.waitForCompletion(true);

        return b?0:1;

    }



    public static void main(String[] args) {
        try {
            int run = ToolRunner.run(new Configuration(), new SortMain(), args);
            System.exit(run);
        } catch (Exception e) {
            e.printStackTrace();
        }

    }
}

2.5、Partition分区

为了数据的统计, 可以把一批类似的数据发送到同一个 Reduce 当中, 在同一个 Reduce 当中统计相同类型的数据, 就可以实现类似的数据分区和统计等其实就是相同类型的数据, 有共性的数据, 送到一起去处理。其中控制如何将相同类型的数据发给同一Reduce，由Partition来控制逻辑。
Reduce 当中默认的分区只有一个。

案例：
需求：
将partition.csv 这个文本文件中第六个字段也就是开奖结果数值，现在需求将15以上的结果以及15以下的结果进行分开成两个文件进行保存。

实现：
Step 1、定义Mapper

/**
 * @author : HaiLiang Huang
 * @author : Always Best Sign X
 */
public class ParitionMapper extends Mapper {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] split = value.toString().split("\t");
        context.write(new IntWritable(Integer.valueOf(split[5])),value);
    }
}

Step 2、定义Partitioner

/**
 * @author : HaiLiang Huang
 * @author : Always Best Sign X
 */
public class MyPartition extends Partitioner {
    @Override
    public int getPartition(IntWritable intWritable, Text text, int i) {
        if (intWritable.get() > 15) return 1;
        else return 0;
    }
}

Step 1、定义Reducer

/**
 * @author : HaiLiang Huang
 * @author : Always Best Sign X
 */
public class PartitionReducer extends Reducer {
    @Override
    protected void reduce(IntWritable key, Iterable values, Context context) throws IOException, InterruptedException {
        for (Text value : values) {
            context.write(value,NullWritable.get());
        }

    }
}

Step 1、定义Main

/**
 * @author : HaiLiang Huang
 * @author : Always Best Sign X
 */
public class PartitionMain extends Configured implements Tool {
    @Override
    public int run(String[] strings) throws Exception {
        Job job = Job.getInstance(super.getConf(), "partition");

        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.setInputPaths(job, new Path("file:///D:\\input\\partition"));

        job.setMapperClass(ParitionMapper.class);
        job.setMapOutputKeyClass(IntWritable.class);
        job.setMapOutputValueClass(Text.class);

        job.setReducerClass(PartitionReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        job.setPartitionerClass(MyPartition.class);
        job.setNumReduceTasks(2);

        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job, new Path("file:///D:\\output_out\\partition22"));

        boolean b = job.waitForCompletion(true);

        return b ? 0 : 1;

    }
    public static void main(String[] args) throws Exception {
        Configuration configuration = new Configuration();
        PartitionMain partitionMain = new PartitionMain();
        int run = ToolRunner.run(configuration, partitionMain, args);
        System.exit(run);
    }
}

2.6、规约Combiner

2.6.1、概念

每一个 map 都可能会产生大量的本地输出，Combiner 的作用就是对 map 端的输出先做一次合并，以减少在 map 和 reduce 节点之间的数据传输量，以提高网络IO 性能，是 MapReduce的一种优化手段之一。

（1）combiner是 MR 程序中 Mapper 和 Reducer 之外的一种组件
（2）combiner 组件的父类就是 Reducer
（3）combiner 和 reducer 的区别在于运行的位置
    （3.1）Combiner 是在每一个 maptask 所在的节点运行
    （3.2）Reducer 是接收全局所有 Mapper 的输出结果
（4）combiner 的意义就是对每一个 maptask 的输出进行局部汇总，以减小网络传输量

2.6.2、实现

（1）自定义一个 combiner 继承 Reducer，重写 reduce 方法
（2）在 job 中设置 job.setCombinerClass(CustomCombiner.class)

3、MapReduce运行机制

3.1、MapTask工作机制

image.png

整个Map阶段流程大体：

（1）读取数据组件 InputFormat (默认 TextInputFormat) 会通过 getSplits 方法对输入目录中文件进行逻辑切片规划得到 block , 有多少个 block 就对应启动多少个 MapTask .
（2）将输入文件切分为 block 之后, 由 RecordReader 对象 (默认是LineRecordReader) 进行读取, 以 \n 作为分隔符, 读取一行数据, 返回 . Key 表示每行首字符偏移值, Value 表示这一行文本内容
（3）读取 block 返回 , 进入用户自己继承的 Mapper 类中，执行用户重写的 map 函数,RecordReader 读取一行这里调用一次
（4）Mapper 逻辑结束之后, 将 Mapper 的每条结果通过 context.write 进行collect数据收集. 在 collect 中, 会先对其进行分区处理，默认使用 HashPartitioner。

MapReduce 提供 Partitioner 接口, 它的作用就是根据 Key 或 Value 及Reducer 的数量来决定当前的这对输出数据最终应该交由哪个 Reducetask 处理, 默认对 Key Hash 后再以 Reducer 数量取模. 默认的取模方式只是为了平均 Reducer 的处理能力, 如果用户自己对 Partitioner 有需求, 可以订制并设置到 Job 上。

（5）接下来, 会将数据写入内存, 内存中这片区域叫做环形缓冲区, 缓冲区的作用是批量收集Mapper 结果, 减少磁盘 IO 的影响. 我们的 Key/Value 对以及 Partition 的结果都会被写入缓冲区. 当然, 写入之前，Key 与 Value 值都会被序列化成字节数组。

1、环形缓冲区其实是一个数组, 数组中存放着 Key, Value 的序列化数据和 Key,Value 的元数据信息, 包括 Partition, Key 的起始位置, Value 的起始位置以及 Value的长度. 环形结构是一个抽象概念。
2、缓冲区是有大小限制, 默认是 100MB. 当 Mapper 的输出结果很多时, 就可能会撑爆内存, 所以需要在一定条件下将缓冲区中的数据临时写入磁盘, 然后重新利用这块缓冲区. 这个从内存往磁盘写数据的过程被称为 Spill, 中文可译为溢写. 这个溢写是由单独线程来完成, 不影响往缓冲区写 Mapper 结果的线程. 溢写线程启动时不应该阻止 Mapper 的结果输出, 所以整个缓冲区有个溢写的比例spill.percent . 这个比例默认是 0.8, 也就是当缓冲区的数据已经达到阈值buffer size * spill percent = 100MB * 0.8 = 80MB , 溢写线程启动,锁定这80MB 的内存, 执行溢写过程. Mapper 的输出结果还可以往剩下的 20MB内存中写, 互不影响。

（6）当溢写线程启动后, 需要对这 80MB 空间内的 Key 做排序 (Sort). 排序是 MapReduce 模型默认的行为, 这里的排序也是对序列化的字节做的排序

1、如果 Job 设置过 Combiner, 那么现在就是使用 Combiner 的时候了. 将有相同Key 的 Key/Value 对的 Value 加起来, 减少溢写到磁盘的数据量. Combiner 会优化MapReduce 的中间结果, 所以它在整个模型中会多次使用
2、那哪些场景才能使用 Combiner 呢? 从这里分析, Combiner 的输出是 Reducer 的输入, Combiner 绝不能改变最终的计算结果. Combiner 只应该用于那种 Reduce的输入 Key/Value 与输出 Key/Value 类型完全一致, 且不影响最终结果的场景. 比如累加, 最大值等. Combiner 的使用一定得慎重, 如果用好, 它对 Job 执行效率有帮助, 反之会影响 Reducer 的最终结果

（7）合并溢写文件, 每次溢写会在磁盘上生成一个临时文件 (写之前判断是否有 Combiner), 如果 Mapper 的输出结果真的很大, 有多次这样的溢写发生, 磁盘上相应的就会有多个临时文件存在. 当整个数据处理结束之后开始对磁盘中的临时文件进行 Merge 合并, 因为最终的文件只有一个, 写入磁盘, 并且为这个文件提供了一个索引文件, 以记录每个reduce对应数据的偏移量。

3.2、ReduceTask工作机制

ReduceTask.png

（1）Copy阶段 ，简单地拉取数据。Reduce进程启动一些数据copy线程(Fetcher)，通过HTTP方式请求maptask获取属于自己的文件。
（2）Merge阶段 。这里的merge如map端的merge动作，只是数组中存放的是不同map端copy来的数值。Copy过来的数据会先放入内存缓冲区中，这里的缓冲区大小要比map端的更为灵活。merge有三种形式：内存到内存；内存到磁盘；磁盘到磁盘。默认情况下第一种形式不启用，当内存中的数据量到达一定阈值，就启动内存到磁盘的merge。与map端类似，这也是溢写的过程，这个过程中如果你设置有Combiner，也是会启用的，然后在磁盘中生成了众多的溢写文件。第二种merge方式一直在运行，直到没有map端的数据时才结束，然后启动第三种磁盘到磁盘的merge方式生成最终的文件。
（3）合并排序 。把分散的数据合并成一个大的数据后，还会再对合并后的数据排序。
（4）对排序后的键值对调用reduce方法 ，键相等的键值对调用一次reduce方法，每次调用会产生零个或者多个键值对，最后把这些输出的键值对写入到HDFS文件中。

3.3、Shuffle过程

Shuffle过程是整个MapReduce的核心，Shuffle过程的核心操作有：数据分区，排序，分组，规约等过程。MapReduce三个阶段八个步骤

（1）Collect阶段 ：将 MapTask 的结果输出到默认大小为 100M 的环形缓冲区，保存的是key/value，Partition 分区信息等。
（2）Spill阶段 ：当内存中的数据量达到一定的阀值的时候，就会将数据写入本地磁盘，在将数据写入磁盘之前需要对数据进行一次排序的操作，如果配置了 combiner，还会将有相同分区号和 key 的数据进行排序。
（3）Merge阶段 ：把所有溢出的临时文件进行一次合并操作，以确保一个 MapTask 最终只产生一个中间数据文件。
（4）Copy阶段 ：ReduceTask 启动 Fetcher 线程到已经完成 MapTask 的节点上复制一份属于自己的数据，这些数据默认会保存在内存的缓冲区中，当内存的缓冲区达到一定的阀值的时候，就会将数据写到磁盘之上。
（5） Merge阶段 ：在 ReduceTask 远程复制数据的同时，会在后台开启两个线程对内存到本地的数据文件进行合并操作。
（6）Sort阶段 ：在对数据进行合并的同时，会进行排序操作，由于 MapTask 阶段已经对数据进行了局部的排序，ReduceTask 只需保证 Copy 的数据的最终整体有效性即可。Shuffle 中的缓冲区大小会影响到mapreduce 程序的执行效率，原则上说，缓冲区越大，磁盘io的次数越少，执行速度就越快。

4、MapReduce案例

以下各种案例说明以及代码都在github中(hailiang9615/big-data-Classic-contact-case-: 分享大数据计算框架的各种经典小型案例 (github.com))

4.1、MapReduce案例---流量统计案

(1)、需求一：统计求和

统计每个手机号的上行数据包总和，下行数据包总和，上行总流量之和，下行总流量之和。