简述:
版本:Python 1.6.3.2010100513
eclipse上pydev 插件
解析本地和web端网页
1. 解析一个本地网页,统计各个标签出现的个数
知识点:
1) 打开文件的方法
2)文件解析策略
3)字典,列表数据结构及调用方法
代码:
# coding=gbk #统计某个页面指定的标签(<a .../>.<script .../>数目 #用到re正则、字典 import re def ParserHtml(): input_file = open("baidu.html","r"); pageDoc = input_file.read(); #正则处理,生成结果列表 pattern = re.compile(r"<[a-zA-Z]* "); #匹配诸如"<a ","<div ",”<script "类型 list = pattern.findall(pageDoc); #做统计,并把结果输入到字典里的count值 calc = { "div" : 0, "script" : 0, "a" : 0 } for listItem in list: tag = listItem[1 : len(listItem) - 1]; if tag in calc.keys(): calc[tag] += 1; for tagName,count in calc.items(): print ("标签:",tagName,"\t出现次数:" , count); if __name__ == '__main__': ParserHtml();
输出:
2. 之后想直接打开一个网页来做,比如"www.baidu.com"(分别使用两种python版本)
注意:
1) 这个装在eclipse的python插件是3.2版本的所以在urlopen的import的时候出现不同,他是从urllib.request导入urlopen方法的
2) 输出print语句是否加括号上有区别
知识点:
1) 打开网页,并作简要解析
2)字符编码设定 decode(“gbk)函数
3)动态向子典插入元素
代码:
# coding=gbk # python 3.2 #统计某个页面指定的标签(<a .../>.<script .../>数目 #用到re正则、字典 import re from urllib.request import urlopen def ParserWebHtml(): url = "http://www.baidu.com"; pageDoc = urlopen(url).read().decode("gbk"); #正则处理,生成结果列表 pattern = re.compile(r"<[a-zA-Z]* "); #匹配诸如"<a ","<div ",”<script "类型 list = pattern.findall(pageDoc); #解析同时统计 calc = { "div" : 0, "script" : 0, "a" : 0 } for listItem in list: tag = listItem[1 : len(listItem) - 1]; if tag in calc.keys(): calc[tag] += 1; elif tag is not "": calc[tag] = 1; #输出 for tagName,count in calc.items(): print ("标签:",tagName,"\t出现次数:" , count); if __name__ == '__main__': ParserWebHtml();
代码 python 2.5:
# coding=gbk # python 2.5 #calculate tags in a web page ,such as calc the num of certain tags(<a .../>.<script .../> #USE: urlopen function , regular expression, dictionary structure. import re import urllib #use python 2.5 ,if use python 3.0+ ,then from urllib.request import urlopen def ParserWebHtml(): url = "http://www.baidu.com"; pageDoc = urllib.urlopen(url).read().decode("gbk"); #Regular Expression get Tag Info pattern = re.compile(r"<[a-zA-Z]* "); #match strings like"<a ","<div ",”<script " list = pattern.findall(pageDoc); #analysis and calculates calc = { "div" : 0, "script" : 0, "a" : 0 } for listItem in list: tag = listItem[1 : len(listItem) - 1]; if tag in calc.keys(): calc[tag] += 1; elif tag is not "": calc[tag] = 1; #Output for tagName,count in calc.items(): print "tag: " , tagName,"; appearance: ", count; if __name__ == '__main__': ParserWebHtml();
tag: a ; appearance: 26
tag: map ; appearance: 1
tag: span ; appearance: 5
tag: form ; appearance: 1
tag: img ; appearance: 2
tag: script ; appearance: 3
tag: area ; appearance: 1
tag: li ; appearance: 1
tag: p ; appearance: 6
tag: meta ; appearance: 1
tag: input ; appearance: 4
tag: div ; appearance: 4
tag: ; appearance: 3
tag: ul ; appearance: 1