Scrapy shell 是一个交互式的shell,一旦你习惯使用了Scrapy shell,你将会发现Scrapy shell对于开发爬虫是非常好用的一个测试工具。
在使用Scrapy shell之前,你需要先安装ipython(可以在http://www.lfd.uci.edu/~gohlke/pythonlibs/查找相应版本的ipython进行安装)。
启用shell
可以使用如下命令启用shell
scrapy shell <url>其中<url>就是你想抓取的页面url
scrapy shell http://blog.csdn.net/php_fly --nolog以上命令执行后,会使用Scrapy downloader下载指定url的页面数据,并且打印出可用的对象和函数列表
[s] Available Scrapy objects: [s] crawler <scrapy.crawler.Crawler object at 0x0000000002AEF7B8> [s] item {} [s] request <GET http://blog.csdn.net/php_fly> [s] response <200 http://blog.csdn.net/php_fly> [s] sel <Selector xpath=None data=u'<html xmlns="http://www.w3.org/1999/xhtm'> [s] settings <CrawlerSettings module=None> [s] spider <Spider 'default' at 0x4cdb940> [s] Useful shortcuts: [s] shelp() Shell help (print this help) [s] fetch(req_or_url) Fetch request (or URL) and update local objects [s] view(response) View response in a browser
In [9]: sel.xpath("//span[@class='link_title']/a/@href").extract() Out[9]: [u'/php_fly/article/details/19364913', u'/php_fly/article/details/18155421', u'/php_fly/article/details/17629021', u'/php_fly/article/details/17619689', u'/php_fly/article/details/17386163', u'/php_fly/article/details/17266889', u'/php_fly/article/details/17172381', u'/php_fly/article/details/17171985', u'/php_fly/article/details/17145295', u'/php_fly/article/details/17122961', u'/php_fly/article/details/17117891', u'/php_fly/article/details/14533681', u'/php_fly/article/details/13162011', u'/php_fly/article/details/12658277', u'/php_fly/article/details/12528391', u'/php_fly/article/details/12421473', u'/php_fly/article/details/12319943', u'/php_fly/article/details/12293587', u'/php_fly/article/details/12293381', u'/php_fly/article/details/12289803']
>>> request = request.replace(method="POST") >>> fetch(request) [s] Available Scrapy objects: [s] crawler <scrapy.crawler.Crawler object at 0x1e16b50> ...
from scrapy.spider import Spider class MySpider(Spider): name = "myspider" start_urls = [ "http://example.com", "http://example.org", "http://example.net", ] def parse(self, response): # We want to inspect one specific response. if ".org" in response.url: from scrapy.shell import inspect_response inspect_response(response) # Rest of parsing code.当你启动爬虫的时候,控制台将打印出类似如下的信息
2014-02-20 17:48:31-0400 [myspider] DEBUG: Crawled (200) <GET http://example.com> (referer: None) 2014-02-20 17:48:31-0400 [myspider] DEBUG: Crawled (200) <GET http://example.org> (referer: None) [s] Available Scrapy objects: [s] crawler <scrapy.crawler.Crawler object at 0x1e16b50> ... >>> response.url 'http://example.org'
注意:当Scrapy engine被scrapy shell占用的时候,Scrapy shell中的fetch函数是无法使用的。 然而,当你退出Scrapy shell的时候,蜘蛛将从停止的地方继续爬行
作者:曾是土木人(http://blog.csdn.net/php_fly)
原文地址:http://blog.csdn.net/php_fly/article/details/19555969
参考文章:Scrapy shell