问题描述:
提供若干个需要分析的源文件,每行都是如下的形式:
journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2.
代表:
paper-id:::author1::author2::…. ::authorN:::title
需要计算出每个作者对应的文章标题每个词项的数量,例如:
作者Alberto Pettorossi的结果为: program:3, transformation:2, transforming:2, using:2, programs:2, logic:2.
注意:每个文档id对应与多个作者,每个作者对应多个词项。词项不包含停用词,单个字母、连字符同样不计。
源代码如下:
# -*- coding: utf-8 -*- #!/usr/bin/env python import glob import mincemeat text_files=glob.glob('E:\\hw3data\\/*') def file_contents(file_name): f=open(file_name) try: return f.read() finally: f.close() source=dict((file_name,file_contents(file_name)) for file_name in text_files) # setup map and reduce functions def mapfn(key, value): stop_words=['all', 'herself', 'should', 'to', 'only', 'under', 'do', 'weve', 'very', 'cannot', 'werent', 'yourselves', 'him', 'did', 'these', 'she', 'havent', 'where', 'whens', 'up', 'are', 'further', 'what', 'heres', 'above', 'between', 'youll', 'we', 'here', 'hers', 'both', 'my', 'ill', 'against', 'arent', 'thats', 'from', 'would', 'been', 'whos', 'whom', 'themselves', 'until', 'more', 'an', 'those', 'me', 'myself', 'theyve', 'this', 'while', 'theirs', 'didnt', 'theres', 'ive', 'is', 'it', 'cant', 'itself', 'im', 'in', 'id', 'if', 'same', 'how', 'shouldnt', 'after', 'such', 'wheres', 'hows', 'off', 'i', 'youre', 'well', 'so', 'the', 'yours', 'being', 'over', 'isnt', 'through', 'during', 'hell', 'its', 'before', 'wed', 'had', 'lets', 'has', 'ought', 'then', 'them', 'they', 'not', 'nor', 'wont', 'theyre', 'each', 'shed', 'because', 'doing', 'some', 'shes', 'our', 'ourselves', 'out', 'for', 'does', 'be', 'by', 'on', 'about', 'wouldnt', 'of', 'could', 'youve', 'or', 'own', 'whats', 'dont', 'into', 'youd', 'yourself', 'down', 'doesnt', 'theyd', 'couldnt', 'your', 'her', 'hes', 'there', 'hed', 'their', 'too', 'was', 'himself', 'that', 'but', 'hadnt', 'shant', 'with', 'than', 'he', 'whys', 'below', 'were', 'and', 'his', 'wasnt', 'am', 'few', 'mustnt', 'as', 'shell', 'at', 'have', 'any', 'again', 'hasnt', 'theyll', 'no', 'when','other', 'which', 'you', 'who', 'most', 'ours ', 'why', 'having', 'once','a','-','.',','] for line in value.splitlines(): word=line.split(':::') authors=word[1].split('::') title=word[2] for author in authors: for term in title.split(): if term not in stop_words: if term.isalnum(): yield author,term.lower() elif len(term)>1: temp='' for ichar in term: if ichar.isalpha() or ichar.isdigit(): temp+=ichar elif ichar=='-': temp+=' ' yield author,temp.lower() def reducefn(key, value): terms = value result={} for term in terms: if term in result: result[term]+=1 else: result[term]=1 return result # start the server s = mincemeat.Server() s.datasource = source s.mapfn = mapfn s.reducefn = reducefn results = s.run_server(password="changeme") #print results result_file=open('hw3_result','w') for result in results: result_file.write(result+' : ') for term in results[result]: result_file.write(term+':'+str(results[result][term])+',') result_file.write('\n') result_file.close()
课程中运行程序的截图如下:
附件:
[1] 需要处理的源文档 http://pan.baidu.com/share/link?shareid=475665&uk=1812733245
[2] mincemeat下载地址 https://github.com/michaelfairley/mincemeatpy/zipball/v0.1.2
2013.4.17补充:
对每个作者的词项频率排序,mapfn不变,reducefn及主程序改变后如下:
def reducefn(key, value): terms = value counts={} for term in terms: if term in counts: counts[term]+=1 else: counts[term]=1 items=counts.items() # sort the counts reverse_items=[ [v[1],v[0]] for v in items] reverse_items.sort(reverse=True) result=[] for i in reverse_items: result.append([i[1],i[0]]) return result # start the server s = mincemeat.Server() s.datasource = source s.mapfn = mapfn s.reducefn = reducefn results = s.run_server(password="changeme") #save results result_file=open('hw3_result_sorted','w') for result in results: result_file.write(result+' : ') for term in results[result]: result_file.write(term[0]+':'+str(term[1])+',') result_file.write('\n') result_file.close()
2013.4.19 补充,字典排序
result={'lifelong':1,'learners':3,'quality':5,'creating':1,'assurance':1} a=sorted(result.items(),key=lambda result:result[1],reverse=True)
2103.5.19
一个在线计算mapreduce的网站:
http://jsmapreduce.com/