Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the simply-static domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /var/www/html/wp-includes/functions.php on line 6121
python反爬虫模块fake_useragent

python反爬虫模块fake_useragent

Python3 fake_useragent 模块的使用和报错解决方案

https://blog.csdn.net/yilovexing/article/details/89044980?spm=1001.2101.3001.6650.1&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-1-89044980-blog-105354439.pc_relevant_default&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-1-89044980-blog-105354439.pc_relevant_default&utm_relevant_index=1

在使用 Python 做爬虫的时候,我们需要伪装头部信息骗过网站的防爬策略,Python 中的第三方模块 fake_useragent 就很好的解决了这个问题,它将给我们返回一个随机封装了好的头部信息,我们直接使用即可

fake_useragent 的使用

安装 fake_useragent

pip install fake_useragent
1
示例:

from fake_useragent import UserAgent

实例化 UserAgent 类

ua = UserAgent()

对应浏览器的头部信息

print(ua.ie)
print(ua.opera)
print(ua.chrome)
print(ua.firefox)
print(ua.safari)

随机返回头部信息,推荐使用

print(ua.random)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
运行结果:

(adnice) adnice:Downloads zhangyi$ python3 fake.py
Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4322)
Opera/9.80 (Windows NT 6.1; U; fi) Presto/2.7.62 Version/11.00
Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:23.0) Gecko/20131011 Firefox/23.0
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27
Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.2117.157 Safari/537.36
1
2
3
4
5
6
7
fake_useragent 报错及解决方案

报错信息:

socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “d:\programdata\anaconda3\lib\site-packages\fake_useragent\utils.py”, lin
e 166, in load
verify_ssl=verify_ssl,
File “d:\programdata\anaconda3\lib\site-packages\fake_useragent\utils.py”, lin
e 122, in get_browser_versions
verify_ssl=verify_ssl,
File “d:\programdata\anaconda3\lib\site-packages\fake_useragent\utils.py”, lin
e 84, in get
raise FakeUserAgentError(‘Maximum amount of retries reached’)
fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
首先找出关键报错信息:

fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached
1
大概意思是:这个模块尝试请求一个东西已达到最大重试次数

打开这个模块的源码进行查看发现这个库会引用在线资源,所以这个模块是进行几次尝试请求一个网站的 Json 数据,但是因为各种原因请求超时,所以就会报这个错误

fake_useragent\settings.py

– coding: utf-8 –

from future import absolute_import, unicode_literals

import os
import tempfile

version = ‘0.1.11’

DB = os.path.join(
tempfile.gettempdir(),
‘fake_useragent_{version}.json’.format(
version=version,
),
)

CACHE_SERVER = ‘https://fake-useragent.herokuapp.com/browsers/{version}’.format(
version=version,
)

BROWSERS_STATS_PAGE = ‘https://www.w3schools.com/browsers/default.asp’

BROWSER_BASE_PAGE = ‘http://useragentstring.com/pages/useragentstring.php?name={browser}’ # noqa

BROWSERS_COUNT_LIMIT = 50

REPLACEMENTS = {
‘ ‘: ”,
‘_’: ”,
}

SHORTCUTS = {
‘internet explorer’: ‘internetexplorer’,
‘ie’: ‘internetexplorer’,
‘msie’: ‘internetexplorer’,
‘edge’: ‘internetexplorer’,
‘google’: ‘chrome’,
‘googlechrome’: ‘chrome’,
‘ff’: ‘firefox’,
}

OVERRIDES = {
‘Edge/IE’: ‘Internet Explorer’,
‘IE/Edge’: ‘Internet Explorer’,
}

HTTP_TIMEOUT = 5

HTTP_RETRIES = 2

HTTP_DELAY = 0.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
解决方案:

首先第一步要进行更新 fake_useragent

pip install –upgrade fake_useragent
1

  1. 在实例化的时候指定一些参数

禁用服务器缓存

ua = UserAgent(use_cache_server=False)
1
不缓存数据

ua = UserAgent(cache=False)
1
忽略ssl验证

ua = UserAgent(verify_ssl=False)
1
一般的话,通过上述解决方案都能解决了,但是我就比较悲催了,还是没解决…

  1. 使用临时 Json 文件

在 fake_useragent\settings.py 发现了几个 URL,其中有一些是打不开的,所以,我们将能打开的 URL 的 Json 文件保存在本地

wget https://fake-useragent.herokuapp.com/browsers/0.1.11
1
这时我们就会得到一个 0.1.11 的文件,将文件名改为 fake_useragent_0.1.11.json

mv 0.1.11 fake_useragent_0.1.11.json
1
然后找到我们的临时文件目录(每个系统都不一样,例如 Ubuntu 在 /tmp 下)

(edison) adnice:T zhangyi$ python3
Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:52:21)
[Clang 6.0 (clang-600.0.57)] on darwin
Type “help”, “copyright”, “credits” or “license” for more information.

import tempfile
tempfile.gettempdir()
‘/var/folders/6_/p67xz49j5wd5lzx7s2cz1cdr0000gn/T’

1
2
3
4
5
6
7
8
最后将文件拷贝到临时目录中即可

cp fake_useragent_0.1.11.json /var/folders/6_/p67xz49j5wd5lzx7s2cz1cdr0000gn/T/
1
当我们再次实例化 UserAgent 的时候,就会先读取本地的临时文件,这样实例化的时候就不会报错了

参考文章:https://blog.csdn.net/huiyanshizhu/article/details/84952093