urllib库的操作（三）-----python3异常处置

原创

小哥 3年前 (2022-11-16) 阅读数 32 #大杂烩

异常处理

urllib的error该模块定义request模块生成的异常。如果出现问题，request模块被抛出error模块中定义的异常。

1. URLError：

（1）来自urllib库的error模块，他继承自OSError类，是error异常模块的基类，request模块生成的异常可以由它处理。
（2）reason属性，返回错误的原因。

from urllib import request,error
try:
    response = request.urlopen(https://blog.asdn.net/hy592070616)
except error.URLError as e:
    print(e.reason)

操作结果：

[Errno 11001] getaddrinfo failed

异常处理：避免程序异常终止，并有效处理异常。

2. HTTPError：

（1）是URLError子类的，专门用于处理HTTP请求错误，如身份验证请求失败等。
（2)三个特性：
code：返回http状态码,比如404指示网页不存在，500表示内部服务器错误等。
reason：返回错误原因
headers：返回请求标头

HTTPError因此，您可以选择在捕获父类的错误之前捕获子类的错误。

from urllib import request, error

try:
    response = request.urlopen(https://blog.asdn.net/hy592070616)
except error.HTTPError as e:
    print(e.reason, e.code, e.headers)   # 输出reason、code、headers属性

# 这是处理异常的更好方法。
# 您可以在捕获父类异常之前捕获子类异常。
from urllib import request, error

try:
    response = request.urlopen(https://blog.csdn.net/Daycym/article/details/11)
except error.HTTPError as e:
    print(e.reason, e.code, e.headers)
except error.URLError as e:
    print(e.reason)
else:
    print(Request Successfully)

解析链接

urllibCurry还提供parse模块，用于定义处理URL标准接口，如实现。URL每个部分的提取、合并和链接转换。它支持以下协议。URL处理：file、ftp、gopher、hd、http、https、imap、mailto、mms、news、nntp、prospero、rsync、rtsp、rtspu、sftp、sip、sips、snews、svn、svn+ssh、telnet和wais。

urlparse：将URL拆分为不同含义的字典。

urllib.parse.urlparse(urlstring, scheme=’’, allow_fragments=True)

url: 必需，即。URL
scheme: 这是默认协议（例如。http和https）,如果此链接没有协议信息，这将是默认协议。
allow_fragments: 也就是说，是否忽略fragment,如果已设置False，fragment部分将被忽略并将被分析path、parameters或者query部分，而fragment部分为空。

URL、协议类型、锚链接

from urllib.parse import urlparse
#解析网址
result = urlparse(http://www.baidu.com/index.html;user?id=5#comment)
print(type(result), result)

from urllib.parse import urlparse
result = urlparse(www.baidu.com/index.html;user?id=5#comment, scheme=https)
print(result)

from urllib.parse import urlparse
result = urlparse(http://www.baidu.com/index.html;user?id=5#comment, scheme=https)
print(result)

from urllib.parse import urlparse
result = urlparse(http://www.baidu.com/index.html;user?id=5#comment, allow_fragments=False)
print(result)

from urllib.parse import urlparse
result = urlparse(http://www.baidu.com/index.html#comment, allow_fragments=True)

urlunparse：urlparse的反编译

它接受的参数是一个可迭代对象，。其长度必须为6，否则，将引发参数不足或过多的问题。

from urllib.parse import urlunparse
#urlparse的反函数
data = [http, www.baidu.com, index.html, user, a=6, comment]  # 长度必须为6
print(urlunparse(data))

操作结果：

http://www.baidu.com/index.html;user?a=6#comment

urlsplit()

urlsplit()方法和urlparse()该方法非常相似，只是不再单独解析。params这部分，而将params会合并到path仅返回。5个结果。

urlunsplit()

urlunsplit()与urlunparse()同样，它也是将链接的各个部分组合成一个完整链接的一种方式。发送方的参数也是一个迭代对象，如列表、元组等。唯一的区别是 长度必须为5。

urljoin：拼接网站，实现链接的解析、扁平化和生成。

提供一个base_url（基本链路）作为第一个参数，新链路作为第二个参数，该方法分析base_url的 scheme、netloc和path 这3内容并补充新链接的缺失部分，最后返回结果。可以发现，base_url（提供三项scheme、netloc和path。如果这3如果新链接中不存在该项，则将添加该项。如果存在新链接，请使用新链接的部分。和base_url（中的params、query和fragment不工作）。
[注]两者都是最新的。如果它们不完整，它们将相互补充。

from urllib.parse import urljoin
#用来拼接url
print(urljoin(http://www.baidu.com, FAQ.html))
print(urljoin(http://www.baidu.com, https://cuiqingcai.com/FAQ.html))
print(urljoin(http://www.baidu.com/about.html, https://cuiqingcai.com/FAQ.html))
print(urljoin(http://www.baidu.com/about.html, https://cuiqingcai.com/FAQ.html?question=2))
print(urljoin(http://www.baidu.com?wd=abc, https://baidu.com/index.php))
print(urljoin(http://www.baidu.com, ?category=2#comment))
print(urljoin(www.baidu.com, ?category=2#comment))
print(urljoin(www.baidu.com#comment, ?category=2))

操作结果：

http://www.baidu.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html?question=2
https://baidu.com/index.php
http://www.baidu.com?category=2#comment
www.baidu.com?category=2#comment
www.baidu.com?category=2

urlencode：转换字典对象。get请求参数

在构造GET它在请求参数、序列化字典时很有用GET请求参数。有时，为了更方便地构造参数，我们提前使用字典。待转化URL在这些参数中，您只需要调用该方法。

from urllib.parse import urlencode
#转换字典对象get请求参数
params = {
    name: germey,
    age: 22
}
base_url = http://www.baidu.com?
url = base_url + urlencode(params)
print(url)

操作结果：

http://www.baidu.com?name=germey&age=22

parse_qs() 将一串GET请求参数，转换为字典

from urllib.parse import parse_qs
query = name=germey&age=22
print(parse_qs(query))

操作结果：

{name: [germey], age: [22]}

parse_qsl() 这些参数被转换为元组列表。

from urllib.parse import parse_qsl
query = name=germey&age=22
print(parse_qsl(query))

操作结果：

[(name, germey), (age, 22)]

quote()

正在转换内容URL编码格式，因为URL当有中文编码格式时，可能会有随机码，可以转换

unquote()

进行URL解码

分析Robots协议

Robots协议

也称为网络爬虫协议、机器人协议，其全称为：网络爬虫排除标准(Robots Exclusion Protocol)，用于告诉爬虫程序和搜索引擎哪些页面可以爬网，哪些页面不能爬网。它通常被称为robots.txt文本文件通常放在网站的根目录下。

robotparse()

from urllib.robotparser import RobotFileParser
robotparser = RobotFileParser()#首先创建RobotFileParser对象，然后传递set_url()方法集“robots.txt”的链接
# 或 robotparser = RobotFileParser(http://www.jianshu.com/robots.txt)
robotparser.set_url(http://www.jianshu.com/robots.txt)
robotparser.read()
print(robotparser.can_fetch(*, http://www.jianshu.com/p/b67554025d7d))  # 确定是否可以爬网
print(robotparser.can_fetch(*, "http://www.jianshu.com/search?q=python&page=1&type=collections"))

操作结果：
False
False

看一个“robots.txt例子：

User-agent: *
Disallow: /
Allow: /public/

User-agent:描述搜索爬虫的名称，User-agent: *意味着该协议对任何爬虫都有效，
User-agent:Baiduspider如果有多个爬虫，我们设置的规则对百度股份有限公司爬虫有效User-agent记录，将有多个爬网程序被限制为爬网，但至少需要指定一个。
Disallow指定不允许爬网的目录，例如上面的示例。/这意味着不允许对所有页面进行爬网。
Allow一般和Disallow一起使用，通常不是单独使用，以排除某些限制。对于所有搜索爬网程序，设置/public/，表示不允许对所有页面进行爬网，但可以对其进行爬网。public目录。

将以上内容另存为“robots.txt文件，放在站点的根目录下，以及站点的门户文件（例如。index.php、index.html和index.jsp等等）放在一起。

禁止所有爬虫访问任何目录的代码如下：

User-agent:*
Disallow:/

允许所有爬虫访问任何目录的代码如下：

User-agent:*
Disallow：

禁止所有爬虫访问网站特定目录的代码如下：

User-agent:*
Disallow:/private/
Disallow:/tmp/

只允许一个爬虫访问的代码如下：

User-agent:webCrawler Disallow：User-agent:*
Disallow:/

obotparser模块，它提供RobotFileParser类
声明

urllib.robotparser.RobotFileParser(url=)

声明时不要传递，默认值为空，然后再次使用。set_url()方法设置

这种类型的常用方法：

set_url()　　设置robots.txt指向文件的链接（如果已创建）。RobotFileParser如果向对象传递了链接，则无需再次使用此方法。
read()　　读取robots.txt文件和分析，不返回任何内容，但执行读取和分析操作。，如果不调用此方法，则后面的判断将是False，请务必调用此方法。
parse()　　解析robots.txt如果传入参数为“robots.txt“某些行的内容，然后它将跟随”robots.txt要分析的语法规则。
can_fetch()传入两个参数，第一个是 User-Agent ，第二个已爬网 URL ，返回是否可以爬网，返回值为 True 或 False 。
mtime()返回最后一次抓取和分析”robots.txt如果你想长时间分析和爬网搜索爬虫，请使用它们。 mtime() 此时，需要定期检查以获取最新的“robots.txt”。
modified()如果要长时间分析和爬网搜索爬虫，请将当前时间设置为最后一次爬网和分析“robots.txt“文件的时间。