查看“Beautiful Soup”的源代码

Beautiful Soup是一个[[Python]]库，能够方便的从[[HTML]]或[[XML]]文件中提取数据。

==简介==
===时间轴===

===安装===
可以通过下面3中方式安装Beautiful Soup 4。
====通过pip安装====
Beautiful Soup 4 发布在[[PyPI]]平台上，所以可以使用[[pip]]来安装：
 pip install beautifulsoup4

====通过软件包管理安装====
如果你用的是新版的[[Debain]]或[[Ubuntu]],那么也可以通过系统的软件包管理来安装:
 apt-get install Python-bs4

====下载源代码安装====
[https://www.crummy.com/software/BeautifulSoup/bs4/download/ 下载BS4的源码]，然后通过setup.py来安装。
 Python setup.py install

{{了解更多
|[https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id9 Beautiful Soup 4 文档：安装Beautiful Soup]
|}}

==基础知识==
===获取页面===
获取页面可以使用Python的urllib标准库下的[https://docs.python.org/zh-cn/3/library/urllib.request.html request]模块或[[requests]]库。

下面使用[[requests]]库获取页面，然后使用BeautifulSoup提取body标签。
<syntaxhighlight lang="python">
import requests

response = requests.get('https://www.baidu.com')

bs = BeautifulSoup(response.content, 'html.parser')
tag = bs.body 
</syntaxhighlight>

使用Python的request模块获取页面，然后使用BeautifulSoup提取标题字符串。示例如下：
<syntaxhighlight lang="python">
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://www.baidu.com')

bs = BeautifulSoup(html.read(), 'html.parser')
title = str(bs.title.string)  #标题内容
</syntaxhighlight>


===文档解析器===
Beautiful Soup支持Python标准库的HTML解析器 [https://docs.python.org/zh-cn/3/library/html.parser.html html.parser]，还支持一些第三方的解析器。如果未通过参数设置指定的解析器，Beautiful Soup会自动选择一个已安装的，优先数序：[[lxml]]，[[html5lib]]，[https://docs.python.org/zh-cn/3/library/html.parser.html html.parser]。不同的解析器得到的结果可能不同，所以防止程序的不稳定，最好指定解析器。支持解析的文档格式有[[HTML]]、HTML5和[[XML]]，其中只有lxml解析器支持解析XML文档。下表为当前支持的解析器：
{| class="wikitable"  style="width: 100%;
! 解析器
! 描述
! 安装
! 用法
|-
| Python标准库 html.parser
| 速度中，Python的内置标准库
| 不需要安装
| BeautifulSoup(markup, "html.parser")
|- 
| lxml 的 HTML解析器
| 速度快
| pip install lxml
| BeautifulSoup(markup, "lxml")
|- 
| lxml 的 XML 解析器
| 速度快，唯一支持XML的解析器
| pip install lxml
| BeautifulSoup(markup, ["lxml-xml"]) <br \>BeautifulSoup(markup, "xml")
|- 
| html5lib
| 速度慢，最好的容错性，以浏览器的方式解析文档，生成HTML5格式的文档
| pip install html5lib
| BeautifulSoup(markup, "html5lib")
|- 
|}


{{了解更多
|[https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id13 BeautifulSoup 4 文档：安装解析器]
|[https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id53 BeautifulSoup 4 文档：指定文档解析器]
}}

==对象==
Beautiful Soup将复杂[[HTML]]文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: 
{| class="wikitable"  style="width: 100%;
! 类
! 描述
! 对象生成
! 对象操作
|-
|BeautifulSoup
|BeautifulSoup 对象表示的是一个文档的全部内容。基本上可以当作Tag对象，支持遍历文档树和搜索文档树中描述的大部分的方法。 
|如<code>BeautifulSoup(markup, "html.parser")</code>
|
|-
|Tag
|Tag对象与XML或HTML原生文档中的tag（标签）相同。
|通过BeautifulSoup对象遍历文档树或搜索文档树生成。如bs表示一个BeautifulSoup对象，<code>bs.p</code>
|标签名称通过对象<code>.name</code>获取，如tag.name。<br \>标签的某个属性可以通过对象的<code>.属性名</code>或<code>[属性名]</code>操作获取，如tag['class']或tag.class <br \>标签的所有属性可以通过对象的<code>.attrs</code>获取，返回字典类型。
|-
|NavigableString
|tag中的字符串被封装在NavigableString类中。一个 NavigableString 字符串与Python中的Unicode字符串相同,并且还支持包含在 遍历文档树 和 搜索文档树 中的一些特性。
|通过Tag对象string属性生成。<br \>如<code>tag.string</code>
|
|-
|Comment
|文档的注释部分包装在Comment类中，Comment 对象是一个特殊类型的 NavigableString 对象。
|通过Tag对象string属性生成。<br \>如<code>tag.string</code>
|
|-
|}

{{了解更多
|[https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id15  BeautifulSoup 4 文档：对象的种类]
}}
===BeautifulSoup ===
BeautifulSoup 对象表示的是一个文档的全部内容。大部分时候，可以把它当作 Tag 对象，它支持 遍历文档树 和 搜索文档树 中描述的大部分的方法。因为 BeautifulSoup 对象并不是真正的HTML或XML的tag，所以它没有name和attribute属性。

如下生成一个BeautifulSoup 对象 bs，其中markup表示文档内容：
 bs = BeautifulSoup(markup, "html.parser")
 

===Tag===
Tag对象与[[XML]]或[[HTML]]原生文档中的tag相同，Tag有很多方法和属性，最重要的两个属性为name和attributes。示例如下：
<syntaxhighlight lang="python">
bs = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = bs.b    #生成一个tag对象
type(tag)     # <class 'bs4.element.Tag'>

#获取该标签的名字
tag.name     #'b'
tag.name = "blockquote"  也可以改变标签名称

#获取属性
tag.attrs      # 获取标签的所有属性，{'class': 'boldest'}
tag['class']   # 获取某个属性的值，['boldest']

#添加，删除，修改操作方法和字典一样。
tag['class'] = 'verybold'   #修改tag的class属性值为verybold
tag['id'] = 1               #tag添加一个id="1"的属性
del tag['id']               #删除tag的id属性

</syntaxhighlight>



===NavigableString ===
字符串常被包含在标签tag内。Beautiful Soup用 NavigableString 类来包装tag中的字符串。使用.string一个 NavigableString 字符串与Python中的Unicode字符串相同,并且还支持包含在遍历文档树和搜索文档树中的一些特性。通过Python的 str() 方法可以直接将 NavigableString 对象转换成Unicode字符串:

<syntaxhighlight lang="python">
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.baidu.com')
bs = BeautifulSoup(html.read(), 'html.parser')
tag = bs.title

tag.string         #'百度一下，你就知道'
type(tag.string)   # <class 'bs4.element.NavigableString'>

title = str(tag.string)    #'百度一下，你就知道'
type(title)   # 字符串
</syntaxhighlight>

===Comment ===
Comment是文档的注释及特殊字符串，Comment 对象是一个特殊类型的 NavigableString 对象。
<syntaxhighlight lang="python">
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)

comment1 = soup.b.string
type(comment1)    # <class 'bs4.element.Comment'> 
comment1          # 'Hey, buddy. Want to buy a used parser'
</syntaxhighlight>

==遍历文档树==
===子节点===
一个Tag可能包含多个字符串或其它的Tag，这些都是这个Tag的子节点。Beautiful Soup提供了许多操作和遍历子节点的属性。操作文档树最简单的方法就是告诉它你想获取的tag的name。

下表中bs为一个BeautifulSoup对象,tag为一个Tag对象：
{| class="wikitable"  style="width: 100%;
! 属性
! 描述
! 示例
|-
| .标签名称
| 获取对象第一个该名称标签。
| <code>bs.p</code>获取bs的第一个名称为p的标签。<br \><code>bs.p.b</code>
|- 
| .contents
| 以列表形式输出对象的子节点。只包括子代节点。
| <code>bs.contents</code>
|- 
| .children
| 返回一个所有子节点list_iterator迭代器。只包括子代节点。
| <code>bs.children</code>
|- 
| .descendants
| 返回一个递归的子孙节点生成器。
| <code>bs.descendants</code>
|-
| .string
| 返回节点NavigableString对象，当该节点有多个NavigableString对象，返回空。 
| <code>tag.string</code>
|-
| .strings
| 返回节点所有NavigableString对象生成器。
| <code>tag.strings</code>
|-
| .stripped_strings 
| 与.strings一样，但会去掉字符串中多余的空格或空行
| <code>tag.stripped_strings</code>
|}


===父节点===
每个Tag或NavigableString都有父节点，包含在该节点中。如下：

{| class="wikitable"  style="width: 100%;
! 属性
! 描述
! 示例
|-
|.parent
|获取某个元素的父节点
|<code>tag.parent</code>
|-
|.parents
|
|
|-
|}
==搜索文档树==



==资源==
===网站===
*[https://www.crummy.com/software/BeautifulSoup/ Beautiful Soup]
*[https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ Beautiful Soup 4 文档]

==参考文献==
*[https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser) 维基百科：Beautiful Soup]
*[https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ Beautiful Soup 4 文档]

[[分类:数据获取]]