Scrapy：修订间差异

2023年11月10日 (五) 08:07的最新版本

Scrapy 是一个开源的网络爬虫框架。

简介

时间轴

快速入门

架构

了解更多 >> Scrapy 文档：Architecture overview

选择器

Scrapy支持使用XPath或CSS进行选择。其中CSS选择器底层也转为XPath来实现。可以使用scrapy shell来进行交互式测试选取。

了解更多 >> Scrapy 文档：selectors

选择器生成

选择器可以嵌套使用。

名称	描述	示例
response.selector.xpath()	简写`response.xpath()` ,注意对于某个节点下再搜索应该使用`.`表示当前节点如`somenode.xpath('.//picture')`，而`somenode.xpath('//picture')`还是会从文档的根节点开始	`response.xpath("//span/text()")` `response.css("img").xpath("@src")`选择所有含有src属性的img
response.selector.css()	简写`response.css()`	`response.css("span::text")`

选择器属性方法

名称	描述	示例
get()	提取匹配第一个的数据，没有返回None，等同于之前版本的`extract_first()`。	`response.xpath("//title/text()").get()`返回标题，没有返回None。 `response.xpath("//title/text()").get().get(default="默认值")`返回标题，没有返回“默认值”。
getall()	返回列表，所有匹配元素的数据。等同于之前版本的`extract()`。	`response.css("img").xpath("@src").getall()`
attrib	返回匹配元素的属性，当用于列表上，返回第一个元素的属性。	`response.css("img").attrib["src"]` `response.css("img").attrib["src"]`

了解更多 >> Scrapy 文档：selectors

蜘蛛

项目管道

当蜘蛛抓取到item后，会发送到项目管道（Item Pipeline），按项目管道设置的值，按从小到大依次进入不同管道处理。项目管道的典型用途包括：

清理 HTML 数据
验证抓取的数据（检查项目是否包含某些字段）
检查重复项（并删除它们）
将抓取的项目存储在数据库中

了解更多 >> Scrapy 文档： Item Pipeline

编写项目管道

图片下载

内置下载管道

可以使用内置的ImagesPipeline方便下载图片，它会自动处理下载item中image_urls图片链接。

1.在项目settings中开启ImagesPipeline管道。

ITEM_PIPELINES = {
    "scrapy.pipelines.images.ImagesPipeline": 1,
    }
# 设置图片下载路径，绝对路径或相对路径
IMAGES_STORE = "images"

2.爬取图片链接，返回给引擎。

import scrapy

# 可以放入item.py 再导入
class Product(scrapy.Item):
    product_name = scrapy.Field()
    image_urls = scrapy.Field()
    #images = scrapy.Field()  # 记录存储位置文件名信息等

class ProductSpider(scrapy.Spider):
    name = 'sample'
    start_urls = ["https://exsample.com"]

    # 也可以为这爬虫设置自定义存储位置。
    #custom_settings = {
    #    'IMAGES_STORE': 'images/sample'
    #}        
    
    def parse(self, response):
        item = Product()
        item['product_name'] = response.xpath('//h1[@class="title"]/text()').get()
        item['image_urls'] = response.xpath('//img[@class="product-img"]/@src').get() 
        yield item

    ## 在图片下载完成后被调用，将图片存储地址等保存到item中。
    # def item_completed(self, results, item, info):
    #     for success, image_info in results:
    #         if success:
    #             item['images'] = [{'url': image_info['url'], 'path': image_info['path']}]
    #     return item

3.启动爬虫

scrapy crawl sample

# 启动爬虫，并将item存储到sample.json
# scrapy crawl sample -o sample.json

了解更多 >> Scrapy 文档：文件和图片的下载处理

自定义下载管道

文件下载

内置下载管道

了解更多 >> Scrapy 文档：文件和图片的下载处理

下载器

脚本测试

scrapy shell

在终端运行，爬取网站，交互式运行。

scrapy shell https://www.example.com

接下来，就可以输入测试，如：

response.xpath("//a[contains(@class,'item')]")

了解更多 >> Scrapy 文档：Scrapy shell

Jupyter中

如果要测试提取数据，可以使用requests，再用scrapy的TextResponse解析。

import requests
from scrapy.http import TextResponse

url = "https://www.example.com"
r = requests.get(url)
response = TextResponse(r.url,body=r.text,encoding="utf-8")
response.xpath('//title')

脚本运行

一般使用scrapy crawl命令运行爬虫，也可以从脚本运行。

import scrapy
from scrapy.crawler import CrawlerProcess


class MySpider(scrapy.Spider):
    # Your spider definition
    ...


process = CrawlerProcess(
    settings={
        "FEEDS": {
            "items.json": {"format": "json"},
        },
    }
)

process.crawl(MySpider)
process.start()  # the script will block here until the crawling is finished

@@ 第25行： / 第25行： @@
 |-
 | response.selector.xpath()
-| 简写<code>response.xpath()</code>
+| 简写<code>response.xpath()</code> ,注意对于某个节点下再搜索应该使用<code>.</code>表示当前节点如<code>somenode.xpath('.//picture')</code>，而<code>somenode.xpath('//picture')</code>还是会从文档的根节点开始
 | <code>response.xpath("//span/text()")</code> <br /><code>response.css("img").xpath("@src")</code>选择所有含有src属性的img
 |-
@@ 第138行： / 第138行： @@
 ==下载器==
+== 脚本测试 ==
+=== scrapy shell ===
+在终端运行，爬取网站，交互式运行。
+<syntaxhighlight lang="bash" >
+scrapy shell https://www.example.com
+</syntaxhighlight>
+接下来，就可以输入测试，如：
+<syntaxhighlight lang="python" >
+response.xpath("//a[contains(@class,'item')]")
+</syntaxhighlight>
+ {{了解更多
+|[https://docs.scrapy.org/en/latest/topics/shell.html Scrapy 文档：Scrapy shell]
+}}
+=== Jupyter中 ===
+如果要测试提取数据，可以使用[[requests]]，再用scrapy的TextResponse解析。
+<syntaxhighlight lang="python" >
+import requests
+from scrapy.http import TextResponse
+url = "https://www.example.com"
+r = requests.get(url)
+response = TextResponse(r.url,body=r.text,encoding="utf-8")
+response.xpath('//title')
+</syntaxhighlight>
+{{了解更多
+|[https://docs.scrapy.org/en/latest/topics/request-response.html#textresponse-objects Scrapy 文档：Requests and Responses ]
+}}
+=== 脚本运行 ===
+一般使用<code>scrapy crawl</code>命令运行爬虫，也可以从脚本运行。
+<syntaxhighlight lang="python" >
+import scrapy
+from scrapy.crawler import CrawlerProcess
+class MySpider(scrapy.Spider):
+    # Your spider definition
+    ...
+process = CrawlerProcess(
+    settings={
+        "FEEDS": {
+            "items.json": {"format": "json"},
+        },
+    }
+)
+process.crawl(MySpider)
+process.start()  # the script will block here until the crawling is finished
+</syntaxhighlight>
+{{了解更多
+|[https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script Scrapy 文档：从脚本运行 Scrapy]
+}}
 ==资源==
 ===官网===