一法破万法的爬虫

steveay 字数: 1686 阅读耗时: 4 分钟 2020/01/17 博客独享热度: 286 评论: 0

我入门神秘瑰丽的代码世界就是缘于Python，而Python之中我最先接触的技术就是爬虫。

爬虫技术的迭代几乎是跟随者前端和附带的反爬构架一起的。如同网络的黑客和白客一样，矛与盾之间的攻防。

但技术是无罪的，我这篇文章介绍下，我心目中，一法破万法的爬虫方法。

selenium的大名在爬虫领域算是响当当的，自动化控制浏览器爬取信息，可以说是非常便捷的方式，缺点就是速度慢，但若只是用于得到cookies和一些内容稍小的必要数据，则是上上选。

但树大招风，大厂的前端工程师怎会不注意你的selenium爬虫，关于selenium的反爬，是越来越严了，单纯使用selenium的简单方法，已经不太能成功爬取数据。不过，盾的技术日新月异，矛的锋利亦是如此。

这里介绍一种利用selenium去接管一个自己已经打开的浏览器方式，这样可以有效避开前端网页检测到selenium而封杀爬虫。

先打开终端，输入

Google\ Chrome --remote-debugging-port=9222 --user-data-dir="~/ChromeProfile"

这是在自己计算机里新建一个存放新环境的文件夹，并映射到浏览器中。

需要自己找到电脑里安装webDriver的路径地址。
然后在自己的爬虫文件里，导入以下设置：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
options = webdriver.ChromeOptions() #我用的是Chrome浏览器
options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
# options.add_experimental_option('excludeSwitches', ['enable-automation']) #注释后可以不报错
chromedriver_path = "/usr/local/bin/chromedriver" #安装你的webdriver的驱动路径
browser = webdriver.Chrome(executable_path=chromedriver_path, options=options)
url=''
browser.get(url)

然后就可以欢快的爬取内容了。

比如写入cookies，然后自己再用其他常规爬虫方式去爬取数据，这样就规避了selenium效率慢的问题。

#写入cookies
cookies = {}
cks = browser.get_cookies()
for ck in cks:
    cookies[ck['name']] = ck['value']
ck1 = json.dumps(cookies)
with open('ck.txt','w') as f :
    f.write(ck1)

print(browser.page_source) #爬取网页源代码