查看“网络爬虫”的源代码

网络爬虫（web crawler），也叫网络蜘蛛（spider），是一种用来自动抓取Web信息数据的程序或者脚本。搜索引擎和其他一些网站使用网路爬虫更新他们的内容。

==简介==


==爬虫基础==
以下是基于[[Python]] 3 编程语言的爬虫。
===获取页面===
====request====
request模块是Python的urllib标准库下的模块，直接使用即可。request模块可以方便获取URL内容。  

<syntaxhighlight lang="python">
from urllib.request import urlopen 
html = urlopen('https://www.baidu.com') 
print(html.read())
</syntaxhighlight>

{{了解更多 |[https://docs.python.org/zh-cn/3/library/urllib.request.html Python文档：标准库urllib - request 用于打开 URL 的可扩展库 ] }}

====requests====
[[requests]]是一个Python第三方HTTP库。该软件的目的是使HTTP请求更简单，更人性化。

可以使用[[pip]]安装requests：
 pip install requests

使用 Requests 发送网络请求非常简单。如获取某个网页，使用get函数，会返回一个Response对象，我们可以从这个对象中获取所有我们想要的信息。
<syntaxhighlight lang="python">
import requests
response = requests.get('https://www.baidu.com')
print(response.content)
</syntaxhighlight>

{{了解更多 
| [https://requests.readthedocs.io/zh_CN/latest/ Requests文档]
| [https://github.com/psf/requests Requests源代码]
}}

===页面解析===
====BeautifulSoup====
[[Beautiful Soup]]是一个Python第三方库，可以用来解析html文档，方便提取需要的数据。

可以使用pip安装：
 pip install beautifulsoup4

如使用BeautifulSoup提取一个页面标题
<syntaxhighlight lang="python">
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://www.baidu.com')

bs = BeautifulSoup(html.read(), 'html.parser')
bs.title
</syntaxhighlight>

====XPath====
[[XPath]]即XML路径语言（XML Path Language），一种查询语言，用于在XML中选择节点。在Python中，可以使用lxml这个第三方库来解析页面，然后通过XPath来选取内容。

可以使用pip安装lxml：
 pip install lxml



==参考资料==
*[https://zh.wikipedia.org/wiki/网络爬虫 维基百科：网络爬虫]
*[https://zh.wikipedia.org/wiki/网页抓取 维基百科：网页抓取]
*[https://en.wikipedia.org/wiki/Web_crawler  维基百科：Web crawler]
*[https://en.wikipedia.org/wiki/XPath 维基百科：XPath]

[[分类:数据获取]]