Hadoop入门案例WordCount

wordcount可以说是hadoop的入门案例,也是基础案例

主要体现思想就是mapreduce核心思想

原始文件为hadoop.txt,内容如下:

hello,java
hello,java,linux,hadoop
hadoop,java,linux
hello,java,linux
linux,c,java
c,php,java

在整个文件中单词所出现的次数

Hadoop思维:

Mapreduce  -----》

        Map:

                Mapper类-------》获取一个切片数据

                        map()-------》获取一个切片中的一行数据

                        {把一行数据切分成若干个单词---》[k,1]}

shuffle过程,会把key值相同聚合,根据key进行默认排序,根据key%numR分配到不同的reduce上

        Reduce:

                Reducer类------》获取map传入reduce的数据

                        reduce()------》处理一个key的数据

                        {遍历values值,累加,求value个数-------》写出 [k,vNum]}

目录结构:

Hadoop入门案例WordCount_第1张图片 

代码:MapReduce典型的填空式编程

package org.shixun;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WordCount {
    /**
     * 执行当前类的主方法
     * */
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        //1、获取当前hadoop环境
        Configuration configuration = new Configuration();
        //2、设置提交的job
        Job job = Job.getInstance();
        //3、配置执行的主类
        job.setJarByClass(WordCount.class);
        //4、配置Mapper类
        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        //5、配置Reducer类
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        //6、设置输入路径  默认输入类型 --   <偏移量,文件信息>
        FileInputFormat.addInputPath(job,new Path("D:\\code\\shixun\\mr\\input\\hadoop.txt"));
        //7、设置输出路径
        FileOutputFormat.setOutputPath(job,new Path("D:\\code\\shixun\\mr\\output\\out2"));
        //8、执行job
        System.out.println(job.waitForCompletion(true) ? "success":"failed");
    }
    /**
     * MapReduce核心类Mapper类
     * */
    public static class MyMapper extends Mapper{//根据分析,类型指定完了 处理一个切片的内容
        private LongWritable longWritable = new LongWritable(1);
        //处理一行的内容
        @Override
        protected void map(LongWritable key, Text value, Mapper.Context context) throws IOException, InterruptedException {
            //获取一行的数据
            String line = value.toString();
            //对数据经行变形处理
            String[] splits = line.split(",");
            for (String split : splits) {
                context.write(new Text(split),longWritable); //[hello,1]
            }
        }
    }
    /**
     * shuffle:
     *        合并key------value [v1,v2,v3,v4]
     *        key.hashcode()%reduceNum -----0
     *        自然排序
     * */
    public static class MyReducer extends Reducer{
        @Override
        protected void reduce(Text key, Iterable values, Reducer.Context context) throws IOException, InterruptedException {
            long nums=0;
            for (LongWritable value : values) {
                nums+=value.get(); //value----1
            }
            context.write(key,new LongWritable(nums));
        }
    }
}

依赖pom.xml:




  4.0.0

  org.shixun
  Hadoop
  1.0-SNAPSHOT

  Hadoop
  
  http://www.example.com

  
    UTF-8
    1.8
    1.8
  

  
    
    
    
      org.apache.hadoop
      hadoop-common
      2.7.3
    
    
    
      org.apache.hadoop
      hadoop-client
      2.7.3
      provided
    
    
    
      org.apache.hadoop
      hadoop-hdfs
      2.7.3
    
    
    
      org.apache.hadoop
      hadoop-mapreduce-client-core
      2.7.3
    

    
    
    
      org.apache.hadoop
      hadoop-mapreduce-client-common
      2.7.3
    
    
    
      log4j
      log4j
      1.2.17
    
  

  
    
      
        
        
          maven-clean-plugin
          3.1.0
        
        
        
          maven-resources-plugin
          3.0.2
        
        
          maven-compiler-plugin
          3.8.0
        
        
          maven-surefire-plugin
          2.22.1
        
        
          maven-jar-plugin
          3.0.2
        
        
          maven-install-plugin
          2.5.2
        
        
          maven-deploy-plugin
          2.8.2
        
        
        
          maven-site-plugin
          3.7.1
        
        
      
    
  

运行输出文件 part-r-00000 结果: 

Hadoop入门案例WordCount_第2张图片

你可能感兴趣的:(hadoop,mapreduce,大数据)