# 遇到的问题

response = requests.get('http://www.jjwxc.net/fenzhan/noyq/')
soup = bs4.BeautifulSoup(response.text, "html.parser")
print soup.title

# 分析

<meta http-equiv="Content-Type" content="text/html; charset=gb18030" />

When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property.

Beautiful Soup uses a sub-library called Unicode, Dammit to detect a document’s encoding and convert it to Unicode. The autodetected encoding is available as the .original_encoding attribute of the BeautifulSoup object.Unicode, Dammit guesses correctly most of the time, but sometimes it makes mistakes. Sometimes it guesses correctly, but only after a byte-by-byte search of the document that takes a very long time. If you happen to know a document’s encoding ahead of time, you can avoid mistakes and delays by passing it to the BeautifulSoup constructor as from_encoding.

OK，那让我们看一下requests和beautifulsoup是否猜对了原文编码。

response = requests.get('http://www.jjwxc.net/fenzhan/noyq/')
print response.encoding
soup = bs4.BeautifulSoup(response.text, "html.parser")
print soup.original_encoding

ISO-8859-1
None

response = requests.get('http://www.jjwxc.net/fenzhan/noyq/')
response.encoding = 'gb18030'
soup = bs4.BeautifulSoup(response.text, "html.parser")
print soup.title

BS_encoding.png

response = requests.get('http://www.jjwxc.net/fenzhan/noyq/')
response.encoding = 'gb18030'
soup = bs4.BeautifulSoup(response.text, "html.parser")
print soup.title.prettify('gb18030')
print soup.title.encode('gb18030')

<title>

</title>
<title>非言情小说网|同人小说|古代纯爱小说|现代纯爱小说|同人纯爱小说|动漫同人小说【晋江文学城】bl小说站</title>

BS_prettify.png

• 1
点赞
• 1
评论
• 6
收藏
• 一键三连
• 扫一扫，分享海报

12-20 6134
05-16 5734
12-10 6915
09-17 4639
08-05 2万+
05-04 1649