Python解析本地和web端网页

简述:

版本:Python 1.6.3.2010100513

eclipse上pydev 插件

解析本地和web端网页


1. 解析一个本地网页,统计各个标签出现的个数


知识点:

1) 打开文件的方法

2)文件解析策略

3)字典,列表数据结构及调用方法


代码:

# coding=gbk

#统计某个页面指定的标签(<a .../>.<script .../>数目
#用到re正则、字典

import re

def ParserHtml():
    input_file = open("baidu.html","r");
    pageDoc = input_file.read();

    #正则处理,生成结果列表    
    pattern = re.compile(r"<[a-zA-Z]* "); #匹配诸如"<a ","<div ",”<script "类型
    list = pattern.findall(pageDoc);
    
    #做统计,并把结果输入到字典里的count值
    calc = {
             "div" : 0,
             "script" : 0,
             "a" : 0
           }
    for listItem in list:
        tag = listItem[1 : len(listItem) - 1];
        if tag in calc.keys():
            calc[tag] += 1;
    for tagName,count in calc.items():
        print ("标签:",tagName,"\t出现次数:" , count);
        
if __name__ == '__main__':
    ParserHtml();



输出:


2. 之后想直接打开一个网页来做,比如"www.baidu.com"(分别使用两种python版本)

注意:

1) 这个装在eclipse的python插件是3.2版本的所以在urlopen的import的时候出现不同,他是从urllib.request导入urlopen方法的

2) 输出print语句是否加括号上有区别

知识点:

1) 打开网页,并作简要解析

2)字符编码设定 decode(“gbk)函数

3)动态向子典插入元素


代码:

# coding=gbk
# python 3.2

#统计某个页面指定的标签(<a .../>.<script .../>数目
#用到re正则、字典

import re
from urllib.request import urlopen

def ParserWebHtml():
    url = "http://www.baidu.com";
    pageDoc = urlopen(url).read().decode("gbk");
    
    #正则处理,生成结果列表
    pattern = re.compile(r"<[a-zA-Z]* "); #匹配诸如"<a ","<div ",”<script "类型
    list = pattern.findall(pageDoc);
    
    #解析同时统计
    calc = {    "div" : 0,
                "script" : 0,
                "a" : 0
            }
    for listItem in list:
        tag = listItem[1 : len(listItem) - 1];
        if tag in calc.keys():
            calc[tag] += 1;
        elif tag is not "":
            calc[tag] = 1;
    #输出
    for tagName,count in calc.items():
        print ("标签:",tagName,"\t出现次数:" , count);
        
if __name__ == '__main__':
    ParserWebHtml();


输出:

Python解析本地和web端网页_第1张图片

代码 python 2.5:

# coding=gbk  
# python 2.5
  
#calculate tags in a web page ,such as calc the num of certain tags(<a .../>.<script .../>
#USE: urlopen function , regular expression, dictionary structure.  
 
import re  
import urllib  #use python 2.5 ,if use python 3.0+ ,then from urllib.request import urlopen

def ParserWebHtml():
    url = "http://www.baidu.com";  
    pageDoc = urllib.urlopen(url).read().decode("gbk");
    

    #Regular Expression get Tag Info
    pattern = re.compile(r"<[a-zA-Z]* "); #match strings like"<a ","<div ",”<script "
    list = pattern.findall(pageDoc);  
           
    #analysis and calculates
    calc = {
                "div" : 0,
                "script" : 0,
                "a" : 0
           }  
    for listItem in list:
        tag = listItem[1 : len(listItem) - 1];
        if tag in calc.keys():
            calc[tag] += 1;
        elif tag is not "":
            calc[tag] = 1;

    #Output  
    for tagName,count in calc.items():
        print "tag: " , tagName,";  appearance: ", count;
          
if __name__ == '__main__':  
    ParserWebHtml();  

输出:

tag:  a ;  appearance:  26
tag:  map ;  appearance:  1
tag:  span ;  appearance:  5
tag:  form ;  appearance:  1
tag:  img ;  appearance:  2
tag:  script ;  appearance:  3
tag:  area ;  appearance:  1
tag:  li ;  appearance:  1
tag:  p ;  appearance:  6
tag:  meta ;  appearance:  1
tag:  input ;  appearance:  4
tag:  div ;  appearance:  4
tag:   ;  appearance:  3
tag:  ul ;  appearance:  1



你可能感兴趣的:(Python解析本地和web端网页)