本文主要通过爬取天猫商品kindle的评论为例来说明利用python爬取ajax动态生成的数据的方式,本文使用的工具如下:
工具
- chrome浏览器【寻找评论的动态链接】
- python3.5【执行代码】
- mysql【存储爬虫获得的数据】
首先,我们要寻找到kindle商品的评论列表,如下:
和一般静态网页不同的是,动态网页的链接并不在浏览器的顶部可以看到,也就是不可以轻易获得,但是我们可以通过以下步骤找到链接:
右键点击-检查-Network
将评论列表往下翻,选择第2页,看到左边Name多出来一些动态生成的数据,可以找到红线框住的内容就是我们要找的动态链接:
链接如下:
https://rate.tmall.com/list_detail_rate.htm?itemId=522680881881&spuId=337259102&sellerId=2099020602&order=3¤tPage=2&append=0&content=1&tagId=&posi=&picture=&ua=098%23E1hvQ9vyv3wvjQCkvvvvvjiPPLFO6jtjPsLv6jljPmP9tj3vn2S9QjD8RF59Qj8CvpvZzC1XfqdNznsYWf1ftszG8a197IVCvpvZzPQHwPbNznsYLirft%2FhwBS137IQjvpvjzn147kvWEpwCvvNwzHi4UnKvRphvCvvvvvvjvpvjzn147kmbNOhCvvswjVi37%2FMwzP0UDxurvpvEvvFR9ziTEUxrRphvCvvvvvmCvpvWz2QXOHqSznQGOhC49phv2nQGV7CJzYswPh287u6CvvyvhWv21%2BOWbj%2BtvpvhvvvvvUhCvvswMvX9OYMwznsJHlItvpvhvvvvvvwCvvNwzHi4zA9KRphvCvvvvvmtvpvIvvvvvhCvvvvvvUUdphvU79vv9krvpvQvvvmm86CvmVWvvUUdphvUOgyCvvOUvvVvay7ivpvUvvmvWRwENM6EvpCWCjgqvvw1tb2XSfpAOH2%2BFOcn%2B3C1oGex6aZtn0vHfJClYb8rwZxl%2BExreCIaUExrgjZ7%2B3%2BFaNoxfX94jLVDYExrj8tMoYswtRkw5vhCvvOvCvvvphvPvpvhMMGvv8wCvvpvvUmm3QhvCvvhvvmrvpvEvvLj9zazvH9VRphvCvvvvvmrvpvEvvpz9t%2FNEmmb9phv2nQwK0HmzYswMweG7FyCvvpvvvvvCQhvCYsw7DI9yTOjvpvjzn147kvwdpwCvvNwzHi4U2sbdphvmpvhlQ%2BkUkB8SUhCvvswMWBQWYMwzPlp3DurvpvEvvQvk36aE8Uj&isg=Au_vsj0y6SFkq–P45gTBL6VfgM5PEJVJrby_QF8jd5lUA1SEmTTBu0SpnYV&needFold=0&_ksTS=1514646571651_2607&callback=jsonp2608
不同商品的链接中,itemId不一样,可以到相应商品的详情页去找这个ID。需要连续爬取不同页码的数据,只需要修改page=2即可。
本文首先通过链接下载json格式的数据,并解析数据,遍历所有数据提取需要的信息存储到mysql中,所以前提是你在mysql中建立这样一个表格:
不熟悉SQL语句可以通过navicate建立数据表格,方便可视化。
接下来开始爬取数据,以下是所有代码:
# -*- coding: utf-8 -*-
import urllib.request
import json
import time
import random
import pymysql.cursors
# 从给定链接中下载json格式数据,并解析数据,提取出重要信息存储到SQL数据库中
def crawlProductComment(url):
# 读取原始数据(注意选择gbk编码方式)
html = urllib.request.urlopen(url).read().decode('gbk')
# 从原始数据中提取出JSON格式数据(分别以'{'和'}'作为开始和结束标志)
jsondata = html[273:-29]
# 把Json格式字符串解码转换成Python对象
print(jsondata)
data = json.loads(jsondata)
# 遍历商品评论列表
for i in data:
uid = i['id']
aliMallSeller = i['aliMallSeller']
anony = i['anony']
auctionSku = i['auctionSku']
buyCount = i['buyCount']
cmsSource = i['cmsSource']
displayUserNick = i['displayUserNick']
fromMall = i['fromMall']
fromMemory = i['fromMemory']
gmtCreateTime = i['gmtCreateTime']
goldUser = i['goldUser']
rateContent = i['rateContent']
rateDate = i['rateDate']
sellerId = i['sellerId']
# 输出商品评论关键信息
print("用户评论时间:{}".format(uid))
print("-----------------------------")
# 获取数据库链接
connection = pymysql.connect(host = 'localhost',
user = 'root',
password = 'password',
db = 'jd',
charset = 'utf8')
try:
with connection.cursor() as cursor:
# 创建sql语句
sql = "insert into `tb_kindle` (`uid`,`aliMallSeller`,`anony`,`auctionSku`,`buyCount`,`cmsSource`,`displayUserNick`,`fromMall`,`fromMemory`,`gmtCreateTime`," \
"`goldUser`,`rateContent`,`rateDate`,`sellerId`) values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
# 执行sql语句
cursor.execute(sql, (uid, aliMallSeller, anony, auctionSku, buyCount, cmsSource, displayUserNick, fromMall, fromMemory, gmtCreateTime, goldUser, rateContent,
rateDate, sellerId))
# 提交数据库
connection.commit()
finally:
connection.close()
# 循环爬取页面
if __name__ == '__main__':
for i in range(1,500):
print("正在获取第{}页评论数据!".format(i))
# kindle评论链接,通过更改page参数的值来循环读取多页评论信息
url = 'https://rate.tmall.com/list_detail_rate.htm?itemId=522680881881&spuId=337259102&sellerId=2099020602&order=3¤tPage='+str(i)+'&append=0&content=1&tagId=&posi=&picture=&ua=098%23E1hv7pvovLWvUvCkvvvvvjiPPLMOQjnhPLSpgjEUPmPpQjrUR2cwtjEvn2FWtjrURphvCvvvphmCvpvWzPQ3w3cNznswO6a4dphvmpvCWomFvvv7E46Cvvyv9ET7tvvvk%2BhtvpvhvvCvpUwCvvpv9hCviQhvCvvvpZpPvpvhvv2MMqyCvm9vvhCvvvvvvvvvBBWvvvHbvvCHhQvv9pvvvhZLvvvCfvvvBBWvvvH%2BuphvmvvvpoViwCEXkphvC9hvpyPOsvyCvhACFKLyjX7re8TxEcqvaB4AdB9aUU31K39XVoE%2FlwvXeXyKnpcUA2WKK33ApO7UHd8re169kU97%2Bu04jo2v%2BboJ5E3Apf2XrqpAhjvnvphvC9mvphvvv2yCvvpvvhCv9phv2nsGM7VkqYswzPld7u6Cvvyvvog0XpvvjBUtvpvhvvCvpUhCvCLwPPC1ErMwznQyCxSSmPsSzha49p%3D%3D&isg=AqSkEzQqoiDsXtSOfIGIVQlMdaJWlclEcT_pvL7FSm88aUYz5k2YN9rbXfcK&needFold=0&_ksTS=1513608734625_1700&callback=jsonp1701'
crawlProductComment(url)
# 设置爬虫过程中休眠时间
time.sleep(random.randint(30,70))
由于淘宝设置有复杂的反爬虫机制,因此该代码虽然可以运行,但是一段时间会出现错误,只要重新设置页码,并运行代码就可以继续爬取数据了。当然,也可以通过一些巧妙的方法来优化代码应对反爬虫机制,本人才疏学浅,下次继续。
后记:爬取数据是为了更好的分析数据,因此后续会推出利用python对天猫商品评论进行文本挖掘,探索千万条评论中的奥秘。