从人人网获取全国中学信息(省市县)

最近有个项目需要用到全国中学的信息,自己整理肯定事很难可行了,在网上看到这一片文章:用HttpClient抓取人人网高校数据库(http://www.iteye.com/topic/826988),就想能不能从人人网上把这部分信息搞下来。下面是实现步骤:

1 首先firefox安装httpfox插件,用来监控http请求

2 登录人人网,修改个人基本信息,点击修改学校信息

3 打开httpfox,点击start,开始监控http信息,如下图



4 点击学校信息的高中,此时会弹出学校选择对话框,然后注意观察http请求

从人人网获取全国中学信息(省市县)_第1张图片

从人人网获取全国中学信息(省市县)_第2张图片



注意看选中的那一行,就是从服务器传回来的学校数据,content里面是服务器返回的数据,一段标准的html代码,拷贝链接到浏览器,出现如下界面,可以得出一个结论,这部分资源信息,人人也没有加session验证,省去了不少麻烦。

从人人网获取全国中学信息(省市县)_第3张图片


多点击几个其他的省看一下:


发现每个省请求的链接都不同,分析可以知道,每个市对应于一个html文件,省和市这两级的数据经过分析可以在cityArray.js这个文件中找到,文件的结构如下所示:

var _city_1=["110101:\u4e1c\u57ce\u533a","110102:\u897f\u57ce\u533a","110103:\u5d07\u6587\u533a","110104:\u5ba3\u6b66\u533a","110105:\u671d\u9633\u533a","110106:\u4e30\u53f0\u533a","110107:\u77f3\u666f\u5c71\u533a","110108:\u6d77\u6dc0\u533a","110109:\u95e8\u5934\u6c9f\u533a","110111:\u623f\u5c71\u533a","110112:\u901a\u5dde\u533a","110113:\u987a\u4e49\u533a","110114:\u660c\u5e73\u533a","110115:\u5927\u5174\u533a","110116:\u6000\u67d4\u533a","110117:\u5e73\u8c37\u533a","110228:\u5bc6\u4e91\u53bf","110229:\u5ef6\u5e86\u53bf"];
var _city_2=["310101:\u9ec4\u6d66\u533a","310103:\u5362\u6e7e\u533a","310104:\u5f90\u6c47\u533a","310105:\u957f\u5b81\u533a","310106:\u9759\u5b89\u533a","310107:\u666e\u9640\u533a","310108:\u95f8\u5317\u533a","310109:\u8679\u53e3\u533a","310110:\u6768\u6d66\u533a","310112:\u95f5\u884c\u533a","310113:\u5b9d\u5c71\u533a","310114:\u5609\u5b9a\u533a","310115:\u6d66\u4e1c\u65b0\u533a","310116:\u91d1\u5c71\u533a","310117:\u677e\u6c5f\u533a","310118:\u9752\u6d66\u533a","310119:\u5357\u6c47\u533a","310120:\u5949\u8d24\u533a","310230:\u5d07\u660e\u53bf"];
var _city_3=["120101:\u548c\u5e73\u533a","120102:\u6cb3\u4e1c\u533a","120103:\u6cb3\u897f\u533a","120104:\u5357\u5f00\u533a","120105:\u6cb3\u5317\u533a","120106:\u7ea2\u6865\u533a","120107:\u5858\u6cbd\u533a","120108:\u6c49\u6cbd\u533a","120109:\u5927\u6e2f\u533a","120110:\u4e1c\u4e3d\u533a","120111:\u897f\u9752\u533a","120112:\u6d25\u5357\u533a","120113:\u5317\u8fb0\u533a","120114:\u6b66\u6e05\u533a","120115:\u5b9d\u577b\u533a","120221:\u5b81\u6cb3\u53bf","120223:\u9759\u6d77\u53bf","120225:\u84df\u53bf"];
每一行代表一个省的城市数据,对中文进行了编码,可以通过程序解码出来看看,python中可以用下面的代码,解码\u897f\u57ce格式的编码

s="\u6cb3"
s.decode('unicode_escape')

由此文件可以分析出全国省市两级的结构,代码如下

def getProvinceData():
    content = open("/home/xiyang/workspace/school-data/cityArray.js")
    #分离出市级id和名称
    partten = re.compile("(\d+):([\w\d\\\\]+)")
    provinceList = []
    for line in content.readlines():
        data = partten.findall(line)
        citys = []
        province = {} 
        for s in data:
            if len(s[0]) == 4:#城市
                #print s[0],s[1].decode('unicode_escape')
                citys.append({"id":s[0],"name":s[1].decode('unicode_escape')})
            
        province_id = len(data[0][0])==4 and data[0][0] or data[0][0][0:4]
    
        #只处理列表中的几个省
        if provinceMap.has_key(int(province_id)):
            province['id'] = province_id
            province['name'] = provinceMap[int(province_id)]
            province['citys'] = citys
            provinceList.append(province)
        
    return provinceList
通过httpfox分析出来每个市对应一个html文件,包含该市所有的县区和学校的数据,请求的链接类似与http://support.renren.com/juniorschool/1101.html,返回的数据格式片段如下所示:

<ul id="schoolCityQuList" class="module-qulist"><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370102')">历下区</a></li><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370103')">市中区</a></li><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370104')">槐荫区</a></li><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370105')">天桥区</a></li><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370112')">历城区</a></li><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370113')">长清区</a></li><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370124')">平阴县</a></li><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370125')">济阳县</a></li><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370126')">商河县</a></li><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370181')">章丘市</a></li></ul>

<ul id="city_qu_370102" style="display:none;">
<li><a onclick='if(SchoolComponent.cl_school){return SchoolComponent.cl_school(event,40019572)}' href="40019572">山师大附中</a></li>
<li><a onclick='if(SchoolComponent.cl_school){return SchoolComponent.cl_school(event,40033777)}' href="40033777">济南二十四中</a></li>
<li><a onclick='if(SchoolComponent.cl_school){return SchoolComponent.cl_school(event,40033962)}' href="40033962">济宁育才学校</a></li>

获取这些数据可以使用urllib2方便的获取

#获得某个市级区域的学校列表,如果事直辖市,则是整个直辖市的学校
def getTownHtml(town_id):
    try:
        url = "http://support.renren.com/juniorschool/%s.html" % town_id
        print "请求网络数据:",url
        return urllib2.urlopen(url).read()
    except:
        print "网络错误!"
        pass

5 分析这部分数据,可以有多个思路:

  • 直接使用jquery分析html,然后使用文件相关api保存到文件,这种方式分析html是比较方便的,由于chrome和firefox的文件操作比较麻烦,没有继续尝试
  • 使用正则表达式分析出数据,这种方式提取县区还是比较简单的,但是想要分析完整的数据还是不太好用
  • 使用html解析工具,解析html的结构,提取数据,最后选用了这一种。
在python中,有多个解析html的工具,比如HTMLParser,sgmllib,htmllib,他们都是基于事件驱动的,对于这种结构的数据还是不怎么好用,最后选用了BeautifulSoup,分析这个数据格式小菜一碟,下面是代码:
#获得某个的市级区域所有县区的学校
def getCitySchool(content):
    soup = BeautifulSoup(content)
    
    #某个城市的中学列表
    citySchoolData = []
    #县区的列表
    townlist = soup.findAll('a',href="#highschool_anchor")    


    for town in townlist:
        d = {}
        d['name'] = getUnicodeStr(town.string)
        d['id'] = town['onclick'][24:38]
        townSchools = []
        #获得每个县的中学列表
        for school in soup.find('ul',id=d['id']).findChildren('a'):
            townSchools.append(getUnicodeStr(school.string))
        d['schoollist'] = townSchools
        
        citySchoolData.append(d)
    
    return citySchoolData

上面的函数,分析html的内容,并返回某个市所有的县区和学校的信息,

执行之后的结果:
从人人网获取全国中学信息(省市县)_第4张图片

这样即可根据个人的需要完成进一步的操作了。
完整的代码如下,
#!/usr/bin/env python
#-*- coding:utf-8 -*-
#============================================
# Author:[email protected]
# date:2012-12-29
# description: 解析人人网全国中学数学信息
# 思路:
# 1 首先获取全国的省市的数据(下载cityarray.js并使用正则表达式解析)
# 2 每个市(包括直辖市)对应一个html文件,包含了该市所有的县区列表和学校列表,通过urllib2模块从网上下载数据
# 3 使用BeautifulSoup分析从网上抓取的数据,然后解析出数据内容
# 4 将数据存到mongodb中
#============================================
import urllib2
import re
from BeautifulSoup import BeautifulSoup
from pymongo import MongoClient

db_host = "127.0.0.1"
db_port = 27017
db_name = "openclass"

provinceMap = {
    "北京":1101,
    "上海":3101,
    "天津":1201,
    "重庆":5001,
    "黑龙江":2301,
    "吉林":2201,
    "辽宁":2101,
    "山东":3701,
    "山西":1401,
    "陕西":6101,
    "河北":1301,
    "河南":4101,
    "湖北":4201,
    "湖南":4301,
    "海南":4601,
    "江苏":3201,
    "江西":3601,
    "广东":4401,
    "广西":4501,
    "云南":5301,
    "贵州":5201,
    "四川":5101,
    "内蒙古":1501,
    "宁夏":6401,
    "甘肃":6201,
    "青海":6301,
    "西藏":5401,
    "新疆":6501,
    "安徽":3401,
    "浙江":3301,
    "福建":3501,
    "香港":8101,
}
provinceMap = dict([[v,k] for k,v in provinceMap.items()])

#解码字符串 北京
def getUnicodeStr(s):
    name = []
    for word in s.split(";"):
        try:
            name.append(unichr(int(word[2:])))
        except:
            pass    
    return "".join(name)

#获得某个市级区域的学校列表,如果事直辖市,则是整个直辖市的学校
def getTownHtml(town_id):
    try:
        url = "http://support.renren.com/juniorschool/%s.html" % town_id
        print "请求网络数据:",url
        return urllib2.urlopen(url).read()
    except:
        print "网络错误!"
        pass
         
def getProvinceData():
    content = open("/home/xiyang/workspace/school-data/cityArray.js")
    #分离出市级id和名称
    partten = re.compile("(\d+):([\w\d\\\\]+)")
    provinceList = []
    for line in content.readlines():
        data = partten.findall(line)
        citys = []
        province = {} 
        for s in data:
            if len(s[0]) == 4:#城市
                #print s[0],s[1].decode('unicode_escape')
                citys.append({"id":s[0],"name":s[1].decode('unicode_escape')})
            
        province_id = len(data[0][0])==4 and data[0][0] or data[0][0][0:4]
    
        #只处理列表中的几个省
        if provinceMap.has_key(int(province_id)):
            province['id'] = province_id
            province['name'] = provinceMap[int(province_id)]
            province['citys'] = citys
            provinceList.append(province)
        
    return provinceList


#获得某个的市级区域所有县区的学校
def getCitySchool(content):
    soup = BeautifulSoup(content)
    
    #某个城市的中学列表
    citySchoolData = []
    #县区的列表
    townlist = soup.findAll('a',href="#highschool_anchor")    

    for town in townlist:
        d = {}
        d['name'] = getUnicodeStr(town.string)
        d['id'] = town['onclick'][24:38]
        townSchools = []
        #获得每个县的中学列表
        for school in soup.find('ul',id=d['id']).findChildren('a'):
            townSchools.append(getUnicodeStr(school.string))
        d['schoollist'] = townSchools
        
        citySchoolData.append(d)
    
    return citySchoolData

conn = MongoClient(db_host)
db = conn.openclass
juniorschool = db.juniorschool

if __name__ == "__main__":
    provinceList = getProvinceData();
    print provinceList
    
    for province in provinceList:
        citys = province['citys']
        #print province
        if citys:#有城市,说明不是直辖市
            for city in citys:
                data = {"id":city['id'],"name":city['name'],"data":getCitySchool(getTownHtml(city['id']))}     
                print "insert into mongodb:",city['name']
                juniorschool.insert(data)          
        
        else:#直辖市
            data = {"id":province['id'],"name":province['name'],"data":getCitySchool(getTownHtml(province['id']))}   
            print "insert into mongodb:",province['name']           
            juniorschool.insert(data)
执行玩代码,会把相关的数据存到mongodb,如下所示:
从人人网获取全国中学信息(省市县)_第5张图片


你可能感兴趣的:(html,html,html,python,python)