网络爬虫

网络爬虫（web crawler），也叫网络蜘蛛（spider），是一种用来自动抓取Web信息数据的程序或者脚本。搜索引擎和其他一些网站使用网路爬虫更新他们的内容。

简介

爬虫基础

以下是基于Python 3 编程语言的爬虫。

获取页面

request

request模块是Python的urllib标准库下的模块，直接使用即可。request模块可以方便获取URL内容。

from urllib.request import urlopen 
html = urlopen('https://www.baidu.com') 
print(html.read())

requests

requests是一个Python第三方HTTP库。该软件的目的是使HTTP请求更简单，更人性化。

可以使用pip安装requests：

pip install requests

使用 Requests 发送网络请求非常简单。如获取某个网页，使用get函数，会返回一个Response对象，我们可以从这个对象中获取所有我们想要的信息。

import requests
response = requests.get('https://www.baidu.com')
print(response.content)

了解更多 >> Requests文档 Requests源代码

页面解析

BeautifulSoup

Beautiful Soup是一个Python第三方库，可以用来解析html文档，方便提取需要的数据。

可以使用pip安装：

pip install beautifulsoup4

如使用BeautifulSoup提取一个页面标题

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://www.baidu.com')

bs = BeautifulSoup(html.read(), 'html.parser')
bs.title

XPath

XPath即XML路径语言（XML Path Language），一种查询语言，用于在XML中选择节点。在Python中，可以使用lxml这个第三方库来解析页面，然后通过XPath来选取内容。

可以使用pip安装lxml：

pip install lxml

参考资料