本文摘要自Web Scraping with Python - 2015
书籍下载地址:https://bitbucket.org/xurongzhong/python-chinese-library/downloads
源码地址:https://bitbucket.org/wswp/code
演示站点:http://example.webscraping.com/
演示站点代码:http://bitbucket.org/wswp/places
推荐的python基础教程: http://www.diveintopython.net
HTML和JavaScript基础:
http://www.w3schools.com
本文博客:http://my.oschina.net/u/1433482/
本文网址:http://my.oschina.net/u/1433482/blog/620858
交流:python开发自动化测试群291184506 PythonJava单元白盒测试群144081101
为什么要进行web抓取?
网购的时候想比较下各个网站的价格,也就是实现惠惠购物助手的功能。有API自然方便,但是通常是没有API,此时就需要web抓取。
web抓取是否合法?
抓取的数据,个人使用不违法,商业用途或重新发布则需要考虑授权,另外需要注意礼节。根据国外已经判决的案例,一般来说位置和电话可以重新发布,但是原创数据不允许重新发布。
更多参考:
http://www.bvhd.dk/uploads/tx_mocarticles/S_-_og_Handelsrettens_afg_relse_i_Ofir-sagen.pdf
http://www.austlii.edu.au/au/cases/cth/FCA/2010/44.html
http://caselaw.findlaw.com/us-supreme-court/499/340.html
背景研究
robots.txt和Sitemap可以帮助了解站点的规模和结构,还可以使用谷歌搜索和WHOIS等工具。
比如:http://example.webscraping.com/robots.txt
# section 1 User-agent: BadCrawler Disallow: / # section 2 User-agent: * Crawl-delay: 5 Disallow: /trap # section 3 Sitemap: http://example.webscraping.com/sitemap.xml
更多关于web机器人的介绍参见 http://www.robotstxt.org。
Sitemap的协议: http://www.sitemaps.org/protocol.html,比如:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url><loc>http://example.webscraping.com/view/Afghanistan-1 </loc></url> <url><loc>http://example.webscraping.com/view/Aland-Islands-2 </loc></url> <url><loc>http://example.webscraping.com/view/Albania-3</loc> </url> ... </urlset>
站点地图经常不完整。
站点大小评估:
通过google的site查询 比如:site:automationtesting.sinaapp.com
站点技术评估:
# pip install builtwith # ipython In [1]: import builtwith In [2]: builtwith.parse('http://automationtesting.sinaapp.com/') Out[2]: {u'issue-trackers': [u'Trac'], u'javascript-frameworks': [u'jQuery'], u'programming-languages': [u'Python'], u'web-servers': [u'Nginx']}
分析网站所有者:
# pip install python-whois # ipython In [1]: import whois In [2]: print whois.whois('http://automationtesting.sinaapp.com') { "updated_date": "2016-01-07 00:00:00", "status": [ "serverDeleteProhibited https://www.icann.org/epp#serverDeleteProhibited", "serverTransferProhibited https://www.icann.org/epp#serverTransferProhibited", "serverUpdateProhibited https://www.icann.org/epp#serverUpdateProhibited" ], "name": null, "dnssec": null, "city": null, "expiration_date": "2021-06-29 00:00:00", "zipcode": null, "domain_name": "SINAAPP.COM", "country": null, "whois_server": "whois.paycenter.com.cn", "state": null, "registrar": "XIN NET TECHNOLOGY CORPORATION", "referral_url": "http://www.xinnet.com", "address": null, "name_servers": [ "NS1.SINAAPP.COM", "NS2.SINAAPP.COM", "NS3.SINAAPP.COM", "NS4.SINAAPP.COM" ], "org": null, "creation_date": "2009-06-29 00:00:00", "emails": null }
抓取第一个站点
简单的爬虫(crawling)代码如下:
import urllib2 def download(url): print 'Downloading:', url try: html = urllib2.urlopen(url).read() except urllib2.URLError as e: print 'Download error:', e.reason html = None return html
可以基于错误码重试。HTTP状态码:https://tools.ietf.org/html/rfc7231#section-6。4**没必要重试,5**可以重试下。
import urllib2 def download(url, num_retries=2): print 'Downloading:', url try: html = urllib2.urlopen(url).read() except urllib2.URLError as e: print 'Download error:', e.reason html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # recursively retry 5xx HTTP errors return download(url, num_retries-1) return html
http://httpstat.us/500 会返回500,可以用它来测试下:
>>> download('http://httpstat.us/500') Downloading: http://httpstat.us/500 Download error: Internal Server Error Downloading: http://httpstat.us/500 Download error: Internal Server Error Downloading: http://httpstat.us/500 Download error: Internal Server Error
设置 user agent:
urllib2默认的user agent是“Python-urllib/2.7”,很多网站会对此进行拦截, 推荐使用接近真实的agent,比如
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0
为此我们增加user agent设置:
import urllib2 def download(url, user_agent='Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0', num_retries=2): print 'Downloading:', url headers = {'User-agent': user_agent} request = urllib2.Request(url, headers=headers) try: html = urllib2.urlopen(request).read() except urllib2.URLError as e: print 'Download error:', e.reason html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code < 600: # recursively retry 5xx HTTP errors return download(url, num_retries-1) return html
爬行站点地图:
def crawl_sitemap(url): # download the sitemap file sitemap = download(url) # extract the sitemap links links = re.findall('<loc>(.*?)</loc>', sitemap) # download each link for link in links: html = download(link) # scrape html here # ...