Python实战作业:第一周第三次爬取租房信息

本周的第三次作业,爬取小猪短租的租房信息。最后的效果如下

Python实战作业:第一周第三次爬取租房信息_第1张图片
QQ截图20160521182308.png

代码如下:

from bs4 import BeautifulSoup
import requests
from multiprocessing import Pool
header = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36',
    'Cookie':'abtest_ABTest4SearchDate=b; xzuuid=485d5ab9; _ga=GA1.2.549673269.1462555281; __utma=29082403.549673269.1462555281.1462688303.1463800484.5; __utmb=29082403.4.10.1463800484; __utmc=29082403; __utmz=29082403.1462555283.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); OZ_1U_2282=vid=v72cd27f8d3be5.0&ctime=1463801293<ime=1463800538; OZ_1Y_2282=erefer=-&eurl=http%3A//bj.xiaozhu.com/&etime=1463800480&ctime=1463801293<ime=1463800538&compid=2282',
    'Connection': 'keep - alive'
}

def get_page_url(path,page): #获取详情页的网址列表
    info_urls = []
    for i in range(1, page + 1):
        url = '{}search-duanzufang-p{}-0/'.format(path, i)
        wb_data = requests.get(url, headers=header)
        soup    = BeautifulSoup(wb_data.text,'lxml')
        for info_url in soup.select('#page_list > ul > li > a'):
            info_urls.append(info_url.get('href'))
    return info_urls
page_lists = get_page_url('http://bj.xiaozhu.com/',10)


def all_info_page(info_url):#多进程
    info_page(info_url)
    get_gender(info_url)

if __name__ == '__main__':
    pool = Pool()
    pool.map(all_info_page,page_lists)

def get_gender(info_url):#获取商家性别
    wb_data = requests.get(info_url,headers = header)
    soup = BeautifulSoup(wb_data.text,'lxml')
    get_host_gender = soup.select('div.member_pic div')[0].get('class')
    # print(get_host_gender)
    if get_host_gender == ['member_ico1']:
        return 'female'
    if get_host_gender == ['member_ico']:
        return  'male'

def info_page(info_url):#获取详情页的信息
    wb_data = requests.get(info_url,headers = header)
    soup = BeautifulSoup(wb_data.text,'lxml')
    title  = soup.select('div.pho_info > h4 > em')[0].text
    address = soup.select('span.pr5')[0].text
    price  = soup.select('div.day_l span')[0].text
    host_host = soup.select('a.lorder_name')[0].text
    image  = soup.select('#curBigImage')[0].get('src')
    data = {
        'title'     : title,
        'adress'    : address,
        'price'     : price,
        'host_host' : host_host,
        'host_gender' : get_gender(info_url),
        'image'     : image,
        'url'       :info_url
    }
    print(data)

总结:
因为是补的第一周的作业,实际上已经学习到第三周了,所以给自己加大难度,用多进程来爬取。
难点在于获取商家性别这里,定义一个get_gender的函数,我们发现在div.member_pic > div获得的class,当商家是男的为‘member_ico’,女的为‘member_ico1’,所以用if else语句来判断性别。

你可能感兴趣的:(Python实战作业:第一周第三次爬取租房信息)