您好,欢迎来到二三娱乐。
搜索
您的当前位置:首页python 断点续爬_scrapy爬虫之断点续爬和多个spider同时爬取

python 断点续爬_scrapy爬虫之断点续爬和多个spider同时爬取

来源:二三娱乐

from scrapy.commands import ScrapyCommand

from scrapy.utils.project import get_project_settings

#断点续爬scrapy crawl spider_name -s JOBDIR=crawls/spider_name

#运行命令scrapy crawlall

class Command(ScrapyCommand):

requires_project = True

def syntax(self):

return '[options]'

def short_desc(self):

return 'Runs all of the spiders'

def run(self, args, opts):

spider_list = self.crawler_process.spiders.list()

for name in spider_list:

self.crawler_process.crawl(name, **opts.__dict__)

self.crawler_process.start()

多个spider同时运行

scrapy crawlall 需在settings里配置 COMMANDS_MODULE = 'project.commands'

执行命令scrapy crawlall

原理:通过加载用户初始化的 crawler_process.spiders 获取列表下的所有spider的name,然后遍历list 分别crawl

断点续爬

#断点续爬 scrapy crawl spider_name -s JOBDIR=crawls/spider_name

↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑

terminnal 执行此命令

详细见开发者文档

https://doc.scrapy.org/en/latest/topics/jobs.html?highlight=jobdir

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- yule263.com 版权所有 湘ICP备2023023988号-1

违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务