这一节打算爬取猫眼电影的 top 100 的电影信息,我们首先可以访问一下我们需要爬取的网站,看一下我们需要的信息所处的位置和结构如何
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1557298171311" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image<input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
看完以后我们的思路应该就比较清晰了,我们首先使用 requests 库请求单页内容,然后我们使用正则对我们需要的信息进行匹配,然后将我们需要的每一条信息保存成一个JSON 字符串,并将其存入文件当中,然后就是开启循环遍历十页的内容或者采用 Python 多线程的方式提高爬取速度
2.代码实现
spider.py
3.运行效果
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1557298171324" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image<input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
0X02 模拟 Ajax 请求抓取今日头条街拍美图
1.分析网页确定思路
首先我们打开头条街拍的页面,我们发现我们看到的详细页链接直接在源代码中并不能找到,于是我们就需要去查看我们的 ajax 请求,看看是不是通过 ajax 加载的,我们可以打开浏览器控制台,我们过滤 XHR 请求有了一些发现,如下图:
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1557298171326" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image<input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
在 xhr 请求中 offset 为 0 的部分,页面中的 data 为 0 的 数据部分清楚地地显示了我们想要查找的详细页的数据,然后随着我们滚动条的下拉,页面会不断发起 xhr 请求,offset 会随之不断的增大,每次增大的数目为 10 ,实际上是通过 ajax 去请求索引页,每次返回的 json 结果中有10条详细页的数据,这样我们就能不断在页面中获取到街拍新闻的信息。
有了街拍新闻,自然我们还要进入新闻中获取街拍的美图,我们看一下新闻内部的图片是怎么获取的,如下图所示:
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1557298171329" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image> <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1557298171332" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image<input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
很明显,街拍真正的图片的 URL 是通过网页中的 js 变量的方式获取的,我们考虑使用 正则 来获取,另外,页面第一个 title 标签里面有该详细页面的名称,我们可以使用 BeautifulSoup 来提取出来
思路梳理:
(1)使用 requests 库去去请求网站,并获取索引网页(ajax 请求的 url)返回的 json 代码
(2)从索引网页中提取出详细页面的 URL,并进一步抓取详细页的信息
(3)通过正则匹配详细页中的图片链接,并将其下载到本地,并将页面信息和图片的 URL 保存到本地的 MongoDB
(4)对多个索引页进行循环抓取,并开启多线程的方式提高效率
2.代码实现
config.py
<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">MONGO_URL = 'localhost'
MONGO_DB = 'toutiao'
MONGO_TABLE = 'toutiao'
GROUP_STATR = 0
GROUP_END = 5
KEYWORD = '街拍'
IMAGE_DIR = 'DOWNLOADED'
</pre>
spider.py
<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import requests
import re
from bs4 import BeautifulSoup
from urllib.parse import urlencode
import json
from requests.exceptions import RequestException
from config import *
import pymongo
import os
from hashlib import md5
from multiprocessing import Pool
声明 mongodb 数据库对象
3.运行效果
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1557298171357" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image<input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
0X03 使用Selenium模拟浏览器抓取淘宝商品美食信息
众所周知,淘宝的网页是非常复杂的,我们按照上面的模拟 Ajax 的请求去获取 json 数据并且解析的方式已经不那么好用了,于是我们要祭出我们的终极杀器—-Selenium ,这个库可以调用浏览器驱动或者是 phantomjs 来模拟浏览器的请求,有了它我们就可以通过脚本去驱动浏览器,这样哪些动态加载的数据就不用我们自己去获取了,非常方便。
1.分析网页确定思路
打开淘宝,输入“美食”,回车
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1557298171361" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image<input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
我们想要获取网页上加载的图片,但是我们找到页面的原始请求的页面的结果,我们会发现当我们刚一翻就已经出现页尾的代码了,实际上页面的主体还不知道在哪,我尝试翻找了一下 XHR 请求发现依然不是很明显,这种情况下为了减轻我们的抓取负担,我们可以使用 selenium 配合 Chromedriver 去获取加载好的完整页面,然后我们再使用正则去抓取图片,这样就非常轻松容易了。
思路梳理:
(1)利用 selenium 库配合chromedriver 请求淘宝并输入“美食”搜索参数,获取商品列表
(2)获取页码,并模拟鼠标点击操作获取后面页码的商品信息
(3)使用 PyQuery 分析源码,得到商品的详细信息
(4)将商品信息存储到 MongoDB 数据库中
2.代码实现
config.py
<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">MONGO_URL = 'localhost'
MONGO_DB = 'taobao'
MONGO_TABLE = 'product'
</pre>
spider.py
3.运行效果
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1557298171384" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image<input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
4.存在问题
事实上这个脚本并不能完全实现自动化,因为由我们 selenium + chromdriver 打开的淘宝在搜索的时候回弹出登录提示框,我们还需要手动去登录一下才能进行下面的爬取工作,听起来似乎不是很要紧,现在登陆一下只要扫描以下二维码就可以了,但是这样我们就没法使用 chrome headless 模式进行静默访问,很是不爽,于是我们还需要对这段代码进行改进。
5.尝试解决
对于 headless 问题,我的解决思路是这样的,因为我们想要用二维码登录,那样的话我们必须要求出现界面,但是这个界面的作用仅仅是一个登录,于是我考虑使用两个 driver ,一个专门用来登录,然后将登录后的 cookie 保存起来,存储在文件中,另一个负责爬取数据的 driver 使用 Headless 模式,然后循环读取本地存储好的 cookie 访问网站,这样就很优雅的解决了我们的问题,下面是我改进后的代码:
spiser.py
数据库配置信息
client = pymongo.MongoClient(MONGO_URL)
db = client[MONGO_DB]
全局设置
0X04 Flask + Redis 维护代理池
1.为什么需要维护代理池
我们知道很多网站都是由反爬虫的机制的,于是我们就需要对我们的 ip 进行伪装,也是因为这个原因,网上也有很多的免费代理 IP 可以使用,但是这些 ip 质量参差不齐,于是我们就需要对其进行进一步的过滤,所以我们需要自己维护一个自己的好用的代理池,这就是我们这一节的目的,我们使用的 Redis 就是用来存储我们的代理 ip 信息的,flask 主要为我们提供一个方便的调用接口
2.代理池的基本要求
(1)多占抓取,异步检测
(2)定时筛选持续更新
(3)提供接口,易于获取
3.代理池的架构
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1557298171417" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image<input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
4.代码实现
注:
(1)入口文件 run.py
<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import...
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
def main():
try:
# 这里调用了调度器来运行起来整个代理池框架
s = Scheduler()
s.run()
except:
main()
if name == 'main':
main()
</pre>
(2)调度中心 scheduler.py
<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import...
class Scheduler():
def schedule_tester(self, cycle=TESTER_CYCLE):
"""
定时测试代理
"""
tester = Tester()
while True:
print('测试器开始运行')
tester.run()
time.sleep(cycle)
def schedule_getter(self, cycle=GETTER_CYCLE):
"""
定时获取代理
"""
getter = Getter()
while True:
print('开始抓取代理')
getter.run()
time.sleep(cycle)
def schedule_api(self):
"""
开启API
"""
app.run(API_HOST, API_PORT)
def run(self):
print('代理池开始运行')
#使用多进程对三个重要函数进行调用
if TESTER_ENABLED:
#调用tester 测试 ip 的可用性
tester_process = Process(target=self.schedule_tester)
tester_process.start()
if GETTER_ENABLED:
#调用 getter 函数从网站中爬取代理 ip
getter_process = Process(target=self.schedule_getter)
getter_process.start()
if API_ENABLED:
#调用 api 函数,提供对外的接口并开启对数据库的接口
api_process = Process(target=self.schedule_api)
api_process.start()
</pre>
(3)代理ip获取
getter.py
<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import...
class Getter():
def init(self):
self.redis = RedisClient()
self.crawler = Crawler()
def is_over_threshold(self):
"""
判断是否达到了代理池限制
"""
if self.redis.count() >= POOL_UPPER_THRESHOLD:
return True
else:
return False
def run(self):
print('获取器开始执行')
if not self.is_over_threshold():
#通过我们元类设置的属性(方法列表和方法个数)循环调用不同的方法获取代理 ip
for callback_label in range(self.crawler.CrawlFuncCount):
callback = self.crawler.CrawlFunc[callback_label]
# 获取代理
proxies = self.crawler.get_proxies(callback)
sys.stdout.flush()
for proxy in proxies:
self.redis.add(proxy)
</pre>
crawler.py
<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">#定义一个元类来拦截类的创建,给类添加了一个CrawlFunc属性记录所有的爬虫方法名
CrawlFuncCount属性记录已经设置好的爬虫方法
关键技术解释:
虽然我在注释中大概把关键的点都说了一下,但是这个技术非常重要,于是我还想再写一下
(1)解决很多爬虫配合运行的问题
因为我们的获取代理 ip 的网站有很多,这样我们就需要些很多的爬虫,那么这些爬虫应该怎样被我们调度就成了一个比较重要的问题,我们最好的想法就是每次调用一个网站,每次从这个网站中返回一个代理 ip 存入数据库,那我们第一个想到的应该就是 用 yield 作为每个爬虫的返回值的形式,这样不仅能实现按照我们自定义的统一格式返回的目的,而且还能完美实现我们每次返回一个然后下一次还能接着继续返回的目的
除此之外,想要配合运行我们还需要一个统一的函数调用接口,这个的实现方法是使用的 callback 回调函数作为我们函数调用的参数,然后传入我们的函数名,并通过 eval() 去执行我们的函数
(2)解决动态获取方法名和方法个数问题
这个问题就比较神奇了,也是我们需要学习的重点,这里使用的是 元类 来劫持类的构建并且为其添加对应的属性的方法来解决这个问题,Python 中一切皆对象,元类简单的说就是创建类的对象,我们还是重点再看一下代码
<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">class ProxyMetaclass(type):
def new(cls, name, bases, attrs):
count = 0
attrs['CrawlFunc'] = []
for k, v in attrs.items():
if 'crawl_' in k:
attrs['CrawlFunc'].append(k)
count += 1
attrs['CrawlFuncCount'] = count
return type.new(cls, name, bases, attrs)
</pre>
解释
new 是在 init 之前被调用的特殊方法,它用来创建对象并返回创建后的对象,各个参数说明如下:
<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"># cls: 当前准备创建的类
name: 类的名字
bases: 类的父类集合
attrs: 类的属性和方法,是一个字典。
</pre>
attrs 可以获取到类的所有属性和方法,于是我们只要给我们想要的方法一个统一的命名规范就可以了,在这里的命名规范是方法名前都有 crawl_ 这个字符串,这样我们就能快速对其进行收集并且计数
(4)测试模块 test.py
解释:
对于类似爬虫这种延时的IO操作,协程是个大利器,优点很多,他可以在一个阻塞发生时,挂起当前程序,跑去执行其他程序,把事件注册到循环中,实现多程序并发,据说超越了10k限制,不过我没有试验过极限。
现在讲一讲协程的简单的用法,当你爬一个网站,有100个网页,正常是请求一次,回来一次,这样效率很低,但协程可以一次发起100个请求(其实也是一个一个发),不同的是协程不会死等返回,而是发一个请求,挂起,再发一个再挂起,发起100个,挂起100个,然后同时等待100个返回,效率提升了100倍。可以理解为同时做100件事,相对于多线程,做到了由自己调度而不是交给CPU,程序流程可控,节约资源,效率极大提升。
具体的使用方法,我在上面代码中的注释部分已经写了,下面对关键步骤再简单梳理一下:
1.定义连接器并取消ssl安全验证
2.创建一个session对象
3.使用创建的 session 对象请求具体的网站
<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">async with session.get(TEST_URL, proxy=real_proxy, timeout=15, allow_redirects=False) as response:
</pre>
4.asyncio.get_event_loop方法创建一个事件循环
<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">loop = asyncio.get_event_loop()
</pre>
5.将多个任务封装到一起
<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">tasks = [self.test_single_proxy(proxy) for proxy in test_proxies]
</pre>
6.run_until_complete将协程注册到事件循环,并启动事件循环,多任务并发执行
<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">loop.run_until_complete(asyncio.wait(tasks))
</pre>
(5)对外接口 api.py
<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import...
all = ['app']
app = Flask(name)
def get_conn():
if not hasattr(g, 'redis'):
g.redis = RedisClient()
return g.redis
@app.route('/')
def index():
return '<h2>Welcome to Proxy Pool System</h2>'
对外接口直接调用数据库返回随机值
@app.route('/random')
def get_proxy():
"""
Get a proxy
:return: 随机代理
"""
conn = get_conn()
return conn.random()
对外接口调用数据库返回代理个数
@app.route('/count')
def get_counts():
"""
Get the count of proxies
:return: 代理池总量
"""
conn = get_conn()
return str(conn.count())
if name == 'main':
app.run()
</pre>