利用mincemeat编写简单的MapReduce程序



利用mincemeat编写简单的MapReduce程序_第1张图片

问题描述:

提供若干个需要分析的源文件,每行都是如下的形式:

journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2.

代表:

paper-id:::author1::author2::…. ::authorN:::title

需要计算出每个作者对应的文章标题每个词项的数量,例如:

作者Alberto Pettorossi的结果为: program:3, transformation:2, transforming:2, using:2, programs:2, logic:2.

注意:每个文档id对应与多个作者,每个作者对应多个词项。词项不包含停用词,单个字母、连字符同样不计。


源代码如下:

# -*- coding: utf-8 -*-
#!/usr/bin/env python
import glob
import mincemeat

text_files=glob.glob('E:\\hw3data\\/*')

def file_contents(file_name):
    f=open(file_name)
    try:
        return f.read()
    finally:
        f.close()

source=dict((file_name,file_contents(file_name))
            for file_name in text_files)

# setup map and reduce functions

def mapfn(key, value):
    stop_words=['all', 'herself', 'should', 'to', 'only', 'under', 'do', 'weve',
            'very', 'cannot', 'werent', 'yourselves', 'him', 'did', 'these',
            'she', 'havent', 'where', 'whens', 'up', 'are', 'further', 'what',
            'heres', 'above', 'between', 'youll', 'we', 'here', 'hers', 'both',
            'my', 'ill', 'against', 'arent', 'thats', 'from', 'would', 'been',
            'whos', 'whom', 'themselves', 'until', 'more', 'an', 'those', 'me',
            'myself', 'theyve', 'this', 'while', 'theirs', 'didnt', 'theres',
            'ive', 'is', 'it', 'cant', 'itself', 'im', 'in', 'id', 'if', 'same',
            'how', 'shouldnt', 'after', 'such', 'wheres', 'hows', 'off', 'i',
            'youre', 'well', 'so', 'the', 'yours', 'being', 'over', 'isnt',
            'through', 'during', 'hell', 'its', 'before', 'wed', 'had', 'lets',
            'has', 'ought', 'then', 'them', 'they', 'not', 'nor', 'wont',
            'theyre', 'each', 'shed', 'because', 'doing', 'some', 'shes',
            'our', 'ourselves', 'out', 'for', 'does', 'be', 'by', 'on',
            'about', 'wouldnt', 'of', 'could', 'youve', 'or', 'own', 'whats',
            'dont', 'into', 'youd', 'yourself', 'down', 'doesnt', 'theyd',
            'couldnt', 'your', 'her', 'hes', 'there', 'hed', 'their', 'too',
            'was', 'himself', 'that', 'but', 'hadnt', 'shant', 'with', 'than',
            'he', 'whys', 'below', 'were', 'and', 'his', 'wasnt', 'am', 'few',
            'mustnt', 'as', 'shell', 'at', 'have', 'any', 'again', 'hasnt',
            'theyll', 'no', 'when','other', 'which', 'you', 'who', 'most',
            'ours ', 'why', 'having', 'once','a','-','.',',']
    for line in value.splitlines():
        word=line.split(':::')
        authors=word[1].split('::')
        title=word[2]
        for author in authors:
            for term in title.split():
                if term not in stop_words:
                    if term.isalnum():
                        yield author,term.lower()
                    elif len(term)>1:
                        temp=''
                        for ichar in term:
                            if ichar.isalpha() or ichar.isdigit():
                                temp+=ichar
                            elif ichar=='-':
                                temp+=' '
                        yield author,temp.lower()

def reducefn(key, value):
    terms = value
    result={}
    for term in terms:
        if term in result:
            result[term]+=1
        else:
            result[term]=1
    return result

# start the server

s = mincemeat.Server()
s.datasource = source
s.mapfn = mapfn
s.reducefn = reducefn

results = s.run_server(password="changeme")
#print results

result_file=open('hw3_result','w')
for result in results:
    result_file.write(result+' : ')
    for term in results[result]:
        result_file.write(term+':'+str(results[result][term])+',')
    result_file.write('\n')
result_file.close()

mapfn为Map部分,读入文档文件,分析出(作者,词项),并传递给Reduce部分。reducefn为Reduce,计算出作者的所有词项频次。

课程中运行程序的截图如下:

利用mincemeat编写简单的MapReduce程序_第2张图片


附件:

[1] 需要处理的源文档 http://pan.baidu.com/share/link?shareid=475665&uk=1812733245

[2] mincemeat下载地址 https://github.com/michaelfairley/mincemeatpy/zipball/v0.1.2


2013.4.17补充:

对每个作者的词项频率排序,mapfn不变,reducefn及主程序改变后如下:

def reducefn(key, value):
    terms = value
    counts={}
    for term in terms:
        if term in counts:
            counts[term]+=1
        else:
            counts[term]=1
    items=counts.items() # sort the counts
    reverse_items=[ [v[1],v[0]] for v in items]
    reverse_items.sort(reverse=True)
    result=[]
    for i in reverse_items:
        result.append([i[1],i[0]])
    return result

# start the server

s = mincemeat.Server()
s.datasource = source
s.mapfn = mapfn
s.reducefn = reducefn

results = s.run_server(password="changeme")

#save results
result_file=open('hw3_result_sorted','w')
for result in results:
    result_file.write(result+' : ')
    for term in results[result]:
        result_file.write(term[0]+':'+str(term[1])+',')
    result_file.write('\n')
result_file.close()

因为字典类型在python中是以哈希表的方式存储的,其输出结果是乱序的,所以将字典转为列表后排序输出。
先将字典转化成列表:
[['creating', 2], ['lifelong', 1],['quality', 3],['learners', 5],   ['assurance', 1]]
然后交换列表中每个子列表前后的值:
[[2, 'creating'], [1, 'lifelong'],[3, 'quality'],[5, 'learners'],   [1, 'assurance']]
按照列表中子列表第一个值从大到小排序:
[[5, 'learners'], [3, 'quality'], [2, 'creating'], [1, 'lifelong'], [1, 'assurance']]
最后将列表归位:
[['learners', 5], ['quality', 3], ['creating', 2], ['lifelong', 1], ['assurance', 1]]



2013.4.19 补充,字典排序

result={'lifelong':1,'learners':3,'quality':5,'creating':1,'assurance':1}
a=sorted(result.items(),key=lambda result:result[1],reverse=True)


2103.5.19

一个在线计算mapreduce的网站:

http://jsmapreduce.com/



你可能感兴趣的:(mapreduce,算法)