Python Scrapy批量爬取CSDN博客内容

今天忽然想着爬一下之前写的所有博客的内容,也是巩固练习一下scrapy,目标定位,爬取标题,url与内容:

采用 scrapy genspider -t crawl 命令创建爬虫,之后在爬虫文件中进行修改,主代码很简单:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BlogSpider(CrawlSpider):
    name = 'blog'
    allowed_domains = ['csdn.net']
    start_urls = ['https://blog.csdn.net/weixin_44521703/article/list/{}?'.format(i) for i in range(1,6)]

    rules = (
        Rule(LinkExtractor(allow=r'https://blog\.csdn\.net/weixin_44521703/article/details/\d+'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = {}
        item['name'] = response.xpath('//title/text()').extract_first()
        item['url'] = response.url
        item['text'] = response.xpath('//div[@id="content_views"]//text()').extract()
        item['text'] = ''.join(item['text'] ).replace('\n','')
        yield item

我们在settings中开启item_pipelines,并设置一下delay,不要对服务器不友好~~:

ITEM_PIPELINES = {
   'CSDN.pipelines.CsdnPipeline': 300,
}
LOG_LEVEL = 'WARNING'
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 3

之后在pipelines中接收数据,并将其存入MongoDB中:

import pymongo
import pprint
class CsdnPipeline(object):
    def __init__(self):
        self.client = pymongo.MongoClient()
        self.db = self.client['CSDN']
        self.collection = self.db['Blog']

    def process_item(self, item, spider):
        self.collection.insert_one(item)
        pprint.pprint(item)
        return item

这是爬取结果:

Python Scrapy批量爬取CSDN博客内容_第1张图片
这次爬取一共测试了一次爬虫,实际爬取了一次,应该给每个博客带来了2个访问量,不过CSDN好像是可以检测ip的吧,没有用动态代理,别拿来刷访问量哈~~

你可能感兴趣的:(Python,爬虫,基础)