python使用lxml库对解析后的DOM树形成的xpath计算得到平均值、中位数、方差

    这篇文章的内容是接着上一篇的内容继续的,上一篇文章中简单的得到了DOM的最大深度,这里我要接着对得到的数据进行计算,分别得到均值、中位数、方差。

    利用均值和中位数的目的是:利用统计的特征来衡量一下DOM树每一条路径的向中部数值的聚拢程度或者说是大多数路径的分布集中在什么取值的区域内,以便于接下来的分析工作。

    利用方差的目的是:利用整体的统计特征来观察整体的路径波动性是怎么样的,也就是看路径的分布是比较平缓的状况还是向高斯分布一样的中间聚拢,两边骤减的状况

    好了,简单的说了这些,简单的实现如下:

#!usr/bin/env python
#encoding:utf-8

'''
__author__:沂水寒城
功能:python处理xpath得到页面的DOM树深度
'''

from get_all_node_xpath import *


def get_tree_max_deepth(all_xpath_list):
    '''
    得到一个HTML页面形成的xpath列表中最大长度,即DOM树的最大深度
    '''
    tree_deepth_list=[]
    for one_xpath in all_xpath_list:
        tree_deepth_list.append(len(one_xpath.split('/')[1:]))
    return max(tree_deepth_list)


def calculate_fangcha(average_length_value, length_list):
    '''
    计算给定列表中数据的方差值(没开平方的值)
    输入为:平均数和列表
    输出为:方差
    '''
    total_sum=0
    for one_num in length_list:
        total_sum+=math.pow((one_num-average_length_value), 2)
    return total_sum/len(length_list)



def get_xpath_statics_features(all_xpath_list):
    '''
    输入:页面的DOM树xpath列表
    输出:xpath列表中路径的长度的平均数、中位数、方差
    '''
    length_list=[]
    for one_xpath in all_xpath_list:
        length_list.append(len(one_xpath.split('/')[1:]))
    average_length_value=sum(length_list)/len(length_list)
    length_sorted_list=sorted(length_list)
    middle_num=length_sorted_list[int(math.ceil(len(length_sorted_list)/2))]
    fangcha=calculate_fangcha(average_length_value, length_list)
    return average_length_value, middle_num, fangcha


if __name__ == '__main__':
	with open('../baidu.txt') as f:
		baidu=f.read()
	baidu_tree, baidu_xpath_list=get_clean_allnodes_xpath(baidu)
	max_tree_deepth=get_tree_max_deepth(baidu_xpath_list)
	for one_xpath in baidu_xpath_list:
		print one_xpath
	print 'max_tree_deepth is:', max_tree_deepth
	average_length_value, middle_num, fangcha=get_xpath_statics_features(baidu_xpath_list)
	print 'average_length_value is:', average_length_value
	print 'middle_num is:', middle_num
	print 'fangcha is:', fangcha

结果如下:

/html
/html/head
/html/head/meta[1]
/html/head/meta[2]
/html/head/meta[3]
/html/head/meta[4]
/html/head/title
/html/body
/html/body/p
/html/body/p/comment()[1]
/html/body/p/comment()[2]
/html/body/p/comment()[3]
/html/body/p/meta
/html/body/div[1]
/html/body/div[1]/div[1]
/html/body/div[1]/div[1]/div
/html/body/div[1]/div[1]/div/div[1]
/html/body/div[1]/div[1]/div/div[1]/div
/html/body/div[1]/div[1]/div/div[1]/div/div[1]
/html/body/div[1]/div[1]/div/div[1]/div/a
/html/body/div[1]/div[1]/div/div[1]/div/form
/html/body/div[1]/div[1]/div/div[1]/div/form/span[1]
/html/body/div[1]/div[1]/div/div[1]/div/form/span[2]
/html/body/div[1]/div[1]/div/div[1]/div/form/span[3]
/html/body/div[1]/div[1]/div/div[1]/div/form/span[3]/span
/html/body/div[1]/div[1]/div/div[1]/div/form/span[3]/span/div
/html/body/div[1]/div[1]/div/div[1]/div/form/span[3]/span/div/span
/html/body/div[1]/div[1]/div/div[1]/div/form/span[3]/span/ul
/html/body/div[1]/div[1]/div/div[1]/div/form/span[3]/span/ul/li[1]
/html/body/div[1]/div[1]/div/div[1]/div/form/span[3]/span/ul/li[1]/a
/html/body/div[1]/div[1]/div/div[1]/div/form/span[3]/span/ul/li[2]
/html/body/div[1]/div[1]/div/div[1]/div/form/span[3]/span/ul/li[2]/a
/html/body/div[1]/div[1]/div/div[1]/div/form/span[3]/span/ul/li[3]
/html/body/div[1]/div[1]/div/div[1]/div/form/span[3]/span/ul/li[4]
/html/body/div[1]/div[1]/div/div[1]/div/form/span[3]/span/ul/li[4]/a
/html/body/div[1]/div[1]/div/div[1]/div/div[2]
/html/body/div[1]/div[1]/div/div[2]
/html/body/div[1]/div[1]/div/div[2]/a[1]
/html/body/div[1]/div[1]/div/div[2]/a[2]
/html/body/div[1]/div[1]/div/div[2]/a[3]
/html/body/div[1]/div[1]/div/div[3]
/html/body/div[1]/div[1]/div/div[3]/a[1]
/html/body/div[1]/div[1]/div/div[3]/a[2]
/html/body/div[1]/div[1]/div/div[3]/a[3]
/html/body/div[1]/div[1]/div/div[3]/a[4]
/html/body/div[1]/div[1]/div/div[3]/a[5]
/html/body/div[1]/div[1]/div/div[3]/a[6]
/html/body/div[1]/div[1]/div/div[3]/a[7]
/html/body/div[1]/div[1]/div/div[3]/a[8]
/html/body/div[1]/div[1]/div/div[3]/a[9]
/html/body/div[1]/div[2]
/html/body/div[1]/div[2]/a[1]
/html/body/div[1]/div[2]/a[2]
/html/body/div[1]/div[2]/a[3]
/html/body/div[1]/div[2]/a[4]
/html/body/div[1]/div[2]/a[5]
/html/body/div[1]/div[2]/a[6]
/html/body/div[1]/div[2]/a[7]
/html/body/div[1]/div[2]/a[8]
/html/body/div[1]/div[2]/a[9]
/html/body/div[1]/div[3]
/html/body/div[1]/div[3]/div
/html/body/div[1]/div[3]/div/div
/html/body/div[1]/div[3]/div/div/div[1]
/html/body/div[1]/div[3]/div/div/div[2]
/html/body/div[1]/div[3]/div/div/div[2]/p
/html/body/div[1]/div[4]
/html/body/div[1]/div[4]/div
/html/body/div[1]/div[4]/div/div
/html/body/div[1]/div[4]/div/div/p[1]
/html/body/div[1]/div[4]/div/div/p[1]/a[1]
/html/body/div[1]/div[4]/div/div/p[1]/a[2]
/html/body/div[1]/div[4]/div/div/p[1]/a[3]
/html/body/div[1]/div[4]/div/div/p[1]/a[4]
/html/body/div[1]/div[4]/div/div/p[2]
/html/body/div[1]/div[4]/div/div/p[2]/a[1]
/html/body/div[1]/div[4]/div/div/p[2]/a[2]
/html/body/div[1]/div[4]/div/div/p[2]/a[3]
/html/body/div[1]/div[5]
/html/body/div[2]
/html/body/div[3]
/html/body/div[4]
max_tree_deepth is: 13
average_length_value is: 6
middle_num is: 7
fangcha is: 8.29268292683

    好了,DOM树路径的统计特征就说这么多,欢迎一起交流!

你可能感兴趣的:(页面更新识别,python实践,web页面计算)