今天有个尝试用scrapy登录了一下amazon的网站,一开始查了一些资料,主要是scrapy的官网上的doc,但是东西讲的比较零碎没有一个完整的例子,所以,我打算给大家一个比较完整的示例,希望大家不要向我一样苦逼的折腾了:
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import FormRequest, Request
from scrapy import log
class AmazonSpider(CrawlSpider):
name = 'AmazonSpider'
allowed_domains = ['amazon.cn']
#start_urls = ['http://associates.amazon.cn/gp/associates/network/main.html']
start_urls =
['https://associates.amazon.cn/gp/associates/network/reports/report.html']
def __init__(self, username, password, *args, **kwargs):
super(AmazonSpider, self).__init__(*args, **kwargs)
self.http_user = username
self.http_pass = password
#login form
self.formdata = {'create':'0',\
'email':self.http_user, \
'password':self.http_pass,\
}
self.headers = {'ccept-Charset':'GBK,utf-8;q=0.7,*;q=0.3',\
'Accept-Encoding':'gzip,deflate,sdch',\
'Accept-Language':'zh-CN,zh;q=0.8',\
'Cache-Control':'max-age=0',\
'Connection':'keep-alive',\
}
self.id = 0
def start_requests(self):
for i, url in enumerate(self.start_urls):
yield FormRequest(url, meta = {'cookiejar': i},\
#formdata = self.formdata,\
headers = self.headers,\
callback = self.login)#jump to login page
def _log_page(self, response, filename):
with open(filename, 'w') as f:
f.write("%s\n%s\n%s\n" % (response.url, response.headers, response.body))
def login(self, response):
self._log_page(response, 'amazon_login.html')
return [FormRequest.from_response(response, \
formdata = self.formdata,\
headers = self.headers,\
meta = {'cookiejar':response.meta['cookiejar']},\
callback = self.parse_item)]#success login
def parse_item(self, response):
self._log_page(response, 'after_login.html')
hxs = HtmlXPathSelector(response)
report_urls = hxs.select('//div[@id="menuh"]/ul/li[4]/div//a/@href').extract()
for report_url in report_urls:
#print "list:"+report_url
yield Request(self._ab_path(response, report_url),\
headers = self.headers,\
meta = {'cookiejar':response.meta['cookiejar'],\
},\
callback = self.parse_report)
def parse_report(self, response):
self.id ++
self._log_page(response, "%d.html" %self.id)
其中有一点需要注意,那就是每次请求都别忘了带上上次返回的cookie,这样才成保持会话不中断。
在这里直接访问https://associates.amazon.cn/gp/associates/network/reports/report.html是因为我发现https://associates.amazon.cn页面有两个表单,直接访问这个页面老是将我引向另外一个页面https://affiliate-program.amazon.com,同时我又发现在没有登录的情况下,amazon会将我引入登录界面,登入成功后又会将我印入到开始未登录前想访问的页面,所以我索性就直接访问自己想要的页面,然后对302跳转回的页面(登录页面)提交表单,这样就直接登录了,然后就可以爬去自己想要的东西了。这爬取的过程中记得要带上response的cookie,否则会话中断,又会被引到登录界面。