Python爬虫的初体验——简单的例子

爬虫的简单例子

  • 网址:http://www.ci123.com/baike/nbnc/31
    输出结果:一个表(excel 或数据库)三个字段分别是 类型、标题、html富文本。
  • 爬虫代码如下:
import requests
from bs4 import BeautifulSoup
import xlwt

url = 'http://www.ci123.com/baike/nbnc/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36'
}
resp = requests.get(url, headers=headers)
# print(resp.text)
main_page = BeautifulSoup(resp.text, "html.parser")
aList = main_page.find("dl", class_="catagory").find_all('a')
# print(aList)
# cateName = []
cateSrc = {}
cateTitleSrc = {}
# cateTitle = []
index = 1
book = xlwt.Workbook(encoding='utf-8')
worksheet = book.add_sheet('sheet')
worksheet.write(0, 0, "序号")
worksheet.write(0, 1, "分类")
worksheet.write(0, 2, "标题")
worksheet.write(0, 3, "html富文本")
for a in aList:
    src = a.get('href')
    name = a.string
    # 获取分类名称和链接
    cateSrc[name] = src
for k, v in cateSrc.items():
    url1 = v
    resp1 = requests.get(url1, headers=headers)
    main_page1 = BeautifulSoup(resp1.text, "html.parser") # 指定html解析器
    aCate = main_page1.find('ul', class_="food-list").find_all('div', class_="detail") # class是python的关键字
    for cate in aCate:
        cate_src = cate.find('a').get('href')
        cate_title = cate.find('a').string
        # 获取每个分类里的标题以及点击标题进去的链接,为后续爬取链接里面的内容做准备
        cateTitleSrc[cate_title] = cate_src
    for k1, v1 in cateTitleSrc.items():
        url1_1 = v1
        resp1_1 = requests.get(url1_1, headers=headers)
        main_page1_1 = BeautifulSoup(resp1_1.text, "html.parser")
        # 爬取html富文本内容
        cateHtml = main_page1_1.find('div', class_="container")
        print("分类:" + k, '标题:' + k1)
        print('\n')
        print(cateHtml)

        worksheet.write(index, 0, index)
        worksheet.write(index, 1, k)
        worksheet.write(index, 2, k1)
        worksheet.write(index, 3, str(cateHtml))
        index += 1
        book.save('test1.xls')
  • 以上代码可以将访问网址获取到的text封装成一个方法。

你可能感兴趣的:(笔记,网络爬虫,python)