Hadoop - MapReduce

MapReduce Concept

MapReduce is natively Java, allowed to interface with Python under Streaming

  • Like dictionary in data structure


    Hadoop - MapReduce_第1张图片
    image.png
  • Shuffle Keys and Sort Values


    Hadoop - MapReduce_第2张图片
    image.png
  • Reducer


    Hadoop - MapReduce_第3张图片
    image.png
  • So overall operation is like:


    Hadoop - MapReduce_第4张图片
    image.png
  • Handle of failure


    Hadoop - MapReduce_第5张图片
    image.png

MapReduce Coding

  • Problem:


    Hadoop - MapReduce_第6张图片
    image.png
  • Map Function:

def mapper_get_ratings(self, _, line):
    (userID, movieID,rating,Timestamp)=line.split('\t')
     yield rating,1

  • Reduce Funtion:

def reducer_count_ratings(self, key, values):
    yield key, sum(values)

  • After putting them together:


    image.png

Installation & Preparation

  • After log into the command line interface:


    Hadoop - MapReduce_第7张图片
    image.png
  • Run locally:

python RatingsBreakdown.py u.data

  • Run with Hadoop:

python RatingsBreakdown.py -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar u.data

  • Result should be :


    Hadoop - MapReduce_第8张图片
    image.png
  • Plus, codes for more complex problems (sorted for movie numbers):


    Hadoop - MapReduce_第9张图片
    image.png

你可能感兴趣的:(Hadoop - MapReduce)