记录一次完整的爬虫管理调度平台--crawlab生产环境部署

前言

如果业务规模比较小,我们写的爬虫脚本可以依赖人工的方式直接在本地单机运行。但是当业务量比较大,且需要爬虫任务自动的按时完成,有成千上万的爬虫任务需要管理时,就需要依赖爬虫管理调度平台来管理爬虫任务。

目前公司的生产环境就是部署的spiderkeeper来管理爬虫任务,spiderkeeper的主要缺点是当任务量多时就会出现不能按时执行任务的情况,并且很容易出现调度任务阻塞的情况。为了不再每天半夜起床手动的执行解决阻塞的爬虫任务,替换掉spiderkeeper迫在眉睫。

目前主要的爬虫管理平台的比较(引自crawlab,github)

Framework Technology Pros Cons
Crawlab Golang + Vue Not limited to Scrapy, available for all programming languages and frameworks. Beautiful UI interface. Naturally support distributed spiders. Support spider management, task management, cron job, result export, analytics, notifications, configurable spiders, online code editor, etc. Not yet support spider versioning
ScrapydWeb Python Flask + Vue Beautiful UI interface, built-in Scrapy log parser, stats and graphs for task execution, support node management, cron job, mail notification, mobile. Full-feature spider management platform. Not support spiders other than Scrapy. Limited performance because of Python Flask backend.
Gerapy Python Django + Vue Gerapy is built by web crawler guru Germey Cui. Simple installation and deployment. Beautiful UI interface. Support node management, code edit, configurable crawl rules, etc. Again not support spiders other than Scrapy. A lot of bugs based on user feedback in v1.0. Look forward to improvement in v2.0
SpiderKeeper Python Flask Open-source Scrapyhub. Concise and simple UI interface. Support cron job. Perhaps too simplified, not support pagination, not support node manageme

你可能感兴趣的:(爬虫相关,python,爬虫,后端)