最新消息: USBMI致力于为网友们分享Windows、安卓、IOS等主流手机系统相关的资讯以及评测、同时提供相关教程、应用、软件下载等服务。

python进阶-04-Python Scrapy带你掌握Python Scrapy(2.12)爬虫框架,附带实战

业界 admin 21浏览 0评论

python进阶-04-一篇带你掌握Python Scrapy(2.12)爬虫框架,附带实战

一.简介

在Python进阶系列我们来介绍Scrapy框架最新版本2.12,远超市面上的老版本,Scrapy框架在爬虫行业内鼎鼎大名,在学习之前我想请大家思考Scrapy究竟能解决什么问题?或者能爬哪一类型的网站!还有针对Scrapy的局限性我们如何依然使用好Scrapy!好,开始我们今天的日拱一卒!

二.安装Python Scrapy

#使用豆瓣源安装 提升安装速度
pip install Scrapy -i http://pypi.doubanio/simple --trusted-host pypi.doubanio

三.Scrapy 中文文档

学习任何一门技术最好的还是看官方文档,我先贴上

https://scrapy/

Scrapy也有比较不错的中文文档

https://scrapy-chs.readthedocs.io/zh-cn/stable/intro/tutorial.html

大家根据需要自己选择,这个框架很简单。。

四.创建Scrapy项目

在开始学习之前我先带大家实现一个简单的爬虫,再最后对Scrapy的运行流程进行介绍,这样大家才能更好的理解,我们来创建一个新的Scrapy项目,在vscode的终端中运行以下命令!

scrapy startproject tutorial

文件结构的介绍

tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

注意:不是一个爬虫项目只能有一个爬虫,一个爬虫项目中可以创建很多爬虫任务,我们通过不同爬虫任务的name来指定运行哪个爬虫。

五.创建我们的第一个爬虫

tutorial/spiders目录下新建quotes_spider.py 文件

当然也可以使用命令创建一个爬虫,大家初学习的时候先手动创建吧!一样的。

scrapy genspider mydomain mydomain

quotes_spider.py文件内容如下:

#quotes_spider.py
from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            "https://quotes.toscrape/page/1/",
            "https://quotes.toscrape/page/2/",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body) # 写入文件 默认utf-8
        self.log(f"Saved file {filename}") #终端中输出log日志
        
        
# 也可以这样写
# parse()是Scrapy的默认回调方法,该方法用于没有显式分配回调的请求
#quotes_spider.py
from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            "https://quotes.toscrape/page/1/",
            "https://quotes.toscrape/page/2/",
        ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body) # 写入文件 默认utf-8
        self.log(f"Saved file {filename}") #终端中输出log日志
        
        

注意:

name = "quotes":Scrapy项目中,name必须是唯一;

def start_requests(self): 必须返回一个可迭代的请求(可以返回一个请求列表或编写一个生成器函数),Scrapy将从它开始开始爬行。后续请求将从这些初始请求连续生成。

def parse(self, response):将被调用以处理为每个请求下载的响应的方法。response参数是TextResponse的一个实例,它保存页面内容,并有更多有用的方法来处理它。parse()方法通常解析响应,将抓取的数据提取为dict,并查找要跟踪的新URL并从中创建新请求(Request)。

五.启动我们的Scrapy项目

进入我们的Scrapy项目tutorial,执行scrapy crawl quotes

(my_venv) PS F:\开发源码\python_demo_06> cd tutorial
(my_venv) PS F:\开发源码\python_demo_06\tutorial> 
(my_venv) PS F:\开发源码\python_demo_06\tutorial> scrapy crawl quotes

大家运行成功应该可以看到我们爬虫项目运行成功,并且我们tutorial 文件夹下多了2个文件quotes-1.html、quotes-2.html,这时候我们已经成功实现Scrapy框架;

执行原理:

1.Scrapy执行scrapy crawl quotes时会从spiders中找到name为quotes的爬虫,启动此爬虫;

2.接着执行start_requests 函数中的urls,请求地址,开始执行Scrapy中的内置请求,yield scrapy.Request(url=url, callback=self.parse) 如果我们指定了callback 就走callback对应的函数,如果没有指定则找默认的self.parse函数,如果啥都没有。。。爬虫关闭

3.self.parse接到请求返回后会执行解析。。

请大家思考一个问题 为啥用yield 而不用return?如果用return会出现什么情况?

截止目前我们还没有解析HTML,请稍等,好菜还没上!慢慢看。。

六.Scrapy解析数据,利用Scrapy自带的Xpath和css selectors

我们之前的文章介绍过BeautifulSoupXpath来提取数据,但是呢Scrapy很强大,自带css选择器和Xpath选择器,我们可以直接使用,当然也可以依然使用BeautifulSoupXpath来提取数据,既然我们今天介绍Scrapy,那么我们就用Scrapy自带的来提取数据,好还是上面的代码,只不过改一改quotes_spider.py文件

修改后的代码:

#quotes_spider.py
from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            "https://quotes.toscrape/page/1/",
            "https://quotes.toscrape/page/2/",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    # def parse(self, response):
    #     page = response.url.split("/")[-2]
    #     filename = f"quotes-{page}.html"
    #     Path(filename).write_bytes(response.body)
    #     self.log(f"Saved file {filename}")
    def parse(self, response):
        print("**************提取开始******************")
        print(response.css("title"))
        print("**************提取结束******************")

        '''
        输出:
        **************提取开始******************
        [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
        **************提取结束******************
        2024-11-20 22:57:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape/page/2/> (referer: None)
        **************提取开始******************
        [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
        **************提取结束******************
        '''

看到了没?输出如下(后面所有的提取 我只写关键部分:):

print(response.css("title"))
#[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
CSS选择器:
  1. CSS选择器提取内容为列表

    response.css("title::text").getall()
    #['Quotes to Scrape']
    

    这里注意::text如果不加::text会出现什么情况呢?可以发现节点标签不是我们想要的。。所以要加::text才能获取我们想要的内容

    response.css("title").getall()
    #['<title>Quotes to Scrape</title>']
    
  2. 只获取第一个结果

    response.css("title::text").get()
    # 'Quotes to Scrape'
    

    也可以这样写

    response.css("title::text")[0].get()
    #'Quotes to Scrape'
    

    **注意:**这2种写法有什么区别呢?

    response.css(“title::text”)[0].get():如果没有结果 索引会引发IndexError

    response.css(“title::text”).get():如果没有结果,返回None

  3. CSS选择器+正则表达式

    response.css("title::text").re(r"Quotes.*")
    #['Quotes to Scrape']
    response.css("title::text").re(r"Q\w+")
    #['Quotes']
    response.css("title::text").re(r"(\w+) to (\w+)")
    #['Quotes', 'Scrape']
    
  4. 直接使用CSS选择器

    response.css("div.quote")
    '''
    输出:
    [<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
    <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
    ...]
    '''
    
Xpath选择器:
  1. 提取内容

    response.xpath("//title")
    #[<Selector query='//title' data='<title>Quotes to Scrape</title>'>]
    
  2. 提取文字内容

    response.xpath("//title/text()").get()
    #'Quotes to Scrape'
    
  3. 提取标签包含指定文字的标签

            next_page = response.xpath("//div[@class='bottem2']/a[contains(text(), '下一章')]/@href").get()
            next_page = response.xpath("substring-after(//a[contains(text(), '下一章')]/@href, '/html')").get()
            next_page = response.xpath("//a[contains(text(), '下一章')]/@href").get()
    
  4. 提取指定标签及其子标签的全部内容

    textContent = response.xpath('//div[@id="content"]//text()').getall()
    
使用插件SelectorGadget 帮我们快速获取css选择器和Xpath选择器:

我已经给大家准备好最新版下载地址:https://download.csdn/download/Lookontime/90025172

安装方式:

  1. 下载后解压

  1. 在谷歌浏览器中输入chrome://extensions/

在我们真实项目中,这样构造CSS选择器和Xpath选择器,效率还有有点慢!有没有更好的办法,还真有,但不是很完美,就是使用SelectorGadget 插件!可以帮我们快速构建Xpath,我们只要稍做修改即可!

注意,我在阅读官方的时候,说CSS选择器在Scrapy引擎下实际是被转换为Xpath,而且官方建议使用Xpath,这里我之前写过一篇专门介绍Xpath的文章,有兴趣的可以去看我之前的文章,有前端基础的小伙伴看这个应该超级简单。。。

七.使用scrapy shell 'https://quotes.toscrape’来验证我们的解析:

'https://quotes.toscrape’的页面结构如下:

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

让我们执行这个命令

scrapy shell 'https://quotes.toscrape'

接着会进行等待页面,我们执行我们的选择器

>>> response.css("div.quote")
[<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
...]

>>> quote = response.css("div.quote")[0]

>>> text = quote.css("span.text::text").get()
>>> text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").get()
>>> author
'Albert Einstein'
>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']



>>> for quote in response.css("div.quote"):
...    text = quote.css("span.text::text").get()
...    author = quote.css("small.author::text").get()
...    tags = quote.css("div.tags a.tag::text").getall()
...    print(dict(text=text, author=author, tags=tags))
...
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
...

如何退出 scrapy shell?

quit()

八.如何实现解析下一页直到不满足条件时停止

举个例子,当我们想安安静静看本小说又不想被满屏的广告打扰,这个时候我们就有一个爬虫需求,爬取网页中的内容,让后找到下一页,继续爬取,继续找寻下一页,直到不满足条件时停止,这个时候我们怎么实现?

有人说,我们把所有的页面url全部放到def start_requests(self)函数 urls中,这样不就可以了?可以是可以,你估计得累死。。因为第一url不可能是规律的递增变化。还有就是爬取的顺序我们需要控制或者才有其他办法。

那么有没有办法我们只给起始页面,页面解析下一页的url让后返回给parse来进行循环解析呢?当然有

举例我们的下一页如下:

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

构建我们的爬虫

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape/page/1/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

刚刚我们看到了,我们解析出下一页的url然后构建请求地址,然后再将请求内容返回给self.parse,直到不满足条件为止,好!非常棒!但是Scrapy框架更强大,有更简单的方法,大家接着看

方式一:response.follow

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape/page/1/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("span small::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

方式二:利用for循环

for href in response.css("ul.pager a::attr(href)"):
    yield response.follow(href, callback=self.parse)

方式三:直接返回a标签

for a in response.css("ul.pager a"):
    yield response.follow(a, callback=self.parse)

方式四:使用response.follow_all

anchors = response.css("ul.pager a")
yield from response.follow_all(anchors, callback=self.parse)

方式五:直接传入解析器

yield from response.follow_all(css="ul.pager a", callback=self.parse)

一个完整的例子

import scrapy


class AuthorSpider(scrapy.Spider):
    name = "author"

    start_urls = ["https://quotes.toscrape/"]

    def parse(self, response):
        author_page_links = response.css(".author + a")
        yield from response.follow_all(author_page_links, self.parse_author)

        pagination_links = response.css("li.next a")
        yield from response.follow_all(pagination_links, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default="").strip()

        yield {
            "name": extract_with_css("h3.author-title::text"),
            "birthdate": extract_with_css(".author-born-date::text"),
            "bio": extract_with_css(".author-description::text"),
        }

九.给爬虫传入参数

我们想在运行代码时传入参数, 只需要执行命令时使用 -a选项

执行命令:

scrapy crawl quotes -O quotes-humor.json -a tag=humor

爬虫代码:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = "https://quotes.toscrape/"
        tag = getattr(self, "tag", None)
        if tag is not None:
            url = url + "tag/" + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

命令解析:

  1. scrapy crawl quotes
    • 运行名为 quotes 的爬虫。quotes 是在你的 Scrapy 项目中定义的爬虫名称,通常会在 spiders 文件夹中找到对应的代码文件。
  2. -O quotes-humor.json
    • -O 代表输出文件,quotes-humor.json 是输出文件的名称。
    • Scrapy 会将爬取到的数据保存为 JSON 格式文件,覆盖同名文件(如果存在)。
  3. -a tag=humor
    • 使用 -a 参数为爬虫传递一个名为 tag 的参数,其值为 humor
    • 在爬虫代码中,可以通过 self.tag 访问这个参数。通常,这种参数用于向爬虫指定一个过滤条件,比如只抓取与“幽默”相关的内容。

十.Scrapy数据容器Item和Field

截止到目前大家是不是好像明白了Scrapy,但是又不太明白,是不是存在一个疑问,我是实现了爬虫和解析数据,但是我怎么使用呢?这就涉及到Scrapy数据容器和Scrapy管道的概念!先别急,我们来介绍Scrapy数据容器

Scrapy中提供了2个类 Item和Field,使用前需要在items.py中先导入,items.py代码如下:

#items.py
import scrapy


# class TutorialItem(scrapy.Item):
#     # define the fields for your item here like:
#     # name = scrapy.Field()
#     pass
# class DmozItem(scrapy.Item):
#     title = scrapy.Field()
#     link = scrapy.Field()
#     desc = scrapy.Field()

class QuoteItem(scrapy.Item):
    imgBase64 = scrapy.Field()
    file_name = scrapy.Field()  

class VideoItem(scrapy.Item):
    video_url = scrapy.Field()
    file_name = scrapy.Field()  


class TextItem(scrapy.Item):
    title = scrapy.Field()
    Content = scrapy.Field()  

**Item基类:**实现的自定义数据类,必须继承Item基类 如class TextItem(scrapy.Item)

**Field类:**描述自定义数据类包含的字段,如title、Content

使用前需要创建Item对象

item = TextItem()
item['title'] = response.xpath('//div[@class="bookname"]/h1/text()').get()
textContent = response.xpath('//div[@id="content"]//text()').getall()
# 去除 '\r' 和 '\xa0'
cleaned_list = [re.sub(r'[\r\xa0]+', '', text) for text in textContent]
item['Content'] = cleaned_list

获取字段值:

print(item['title'])
print(item['Content'])

获取所有字段名

item.keys()

Item复制

item2 = item.copy()

十一.Scrapy pipeline 管道

截止到目前我们实现了Scrapy数据容器,那么怎么使用数据容器?这就涉及到Scrapy pipeline 管道,这里是重点因为Scrapy pipeline可以自动接收Scrapy数据容器,并根据Scrapy数据容器来实现不同的功能,如将item解析存储到数据库,下载图片,下载文件,数据存储到json,excel,txt等。

使用Scrapy pipeline 管道 首先要进行注册,Scrapy 爬虫开始后会自动将item数据传输到所有已经注册的pipeline 以实现不同管道处理不同内容。

pipeline 注册:在settings.py文件下注册

#settings.py文件下

ITEM_PIPELINES = {

#   'tutorial.pipelines.TutorialPipeline': 300,

#   'tutorial.save_Image_pipeline.SaveImagePipeline': 300,

#   'tutorial.video_download_pipeline.VideoDownloadPipeline': 500,

  'tutorial.text_download_pipeline.TextDownloadPipeline':300

}

‘tutorial.text_download_pipeline.TextDownloadPipeline’ : pipeline 文件地址

300:数字越小优先级越高

一个完整的pipeline示例

#text_download_pipeline.py
from itemadapter import ItemAdapter
import os


class TextDownloadPipeline():
    def __init__(self):
        # 定义存储text的目标文件夹
        self.target_folder = "DownLoadText"
        # 如果目标文件夹不存在,则创建
        if not os.path.exists(self.target_folder):
            os.makedirs(self.target_folder)
    def open_spider(self,spider):
        #spider开始前
        pass
    def close_spider(self,spider):
        #spider 结束后
        pass
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        title = adapter.get('title', '未知标题')
        content = adapter.get('Content', [])
        text_name = '下载的文档内容.txt'
        file_path = os.path.join(self.target_folder, text_name)

        with open(file_path, 'a', encoding='utf-8') as file:
            file.write(f"标题: {title}\n")
            for line in content:
                file.write(f"{line}\n")
            file.write("\n")  # 每个 item 之间添加空行分隔
            return item

注意:def process_item(self, item, spider):必须实现的方法

这个管道是用来解析Scrapy 容器item来实现将item中的内容一行行写入 txt文件

除了我们自定义的pipeline外,Scrapy 两个特殊的pipeline,分别用来处理文件和图片:FilesPipeline和ImagesPipeline,下面我们来掌握这一概念:

FilesPipelineImagesPipeline
导入路径scrapy.pipelines.files.FilesPipelinescrapy.pipelines.images.ImagesPipeline
Item字段file_urls,filesimage_urls,images
存储路径FILES_STOREIMAGES_STORE

FilesPipeline

  1. settings.py文件下配置

    import os
    #注册pipeline
    ITEM_PIPELINES = {
      'scrapy.pipelines.files.FilesPipeline':300
    
    }
    #配置文件存储路径
    FILES_STORE="F:\\DownloadFiels"
    if not os.path.exists(FILES_STORE):
        os.makedirs(FILES_STORE)
    
  2. 实现item.py 中的数据容器

    class FileItem(scrapy.Item):
        file_urls = scrapy.Field()
        files = scrapy.Field()
    
  3. 代码实现

    #新建爬虫
    scrapy genspider dload_files model
    #或者直接在spiders创建
    
    #dload_files.py
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from tutorial.items import FileItem, TextItem
    import re
    
    class XbiquguSpider(CrawlSpider):
        name = 'dload_files'
        allowed_domains = ['www.model']
        def __init__(self):
            pass
        def start_requests(self):
            urls = [
                "https://www.model/html/265/265564/229802.html",
            ]
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)
        def parse(self, response):
            for line in response.xpath('//div[@class="bookname"]/ul/li'):
                for example in line.xpath('.//ul/li'):
                    url = example.xpath('.//a//@href').extract_first()
                    url = response.urljoin(url)
                    yield scrapy.Request(url,callback=self.parse_files)
        def parse_files(self, response):
            href = response.xpath('//a/@href').extract_first()
            url = response.urljoin(href)
            fileItem = FileItem()
            fileItem['file_urls'] =[url]
            return fileItem
    
  4. 运行爬虫

    (my_venv) PS F:\开发源码\python_demo_06\tutorial> scrapy crawl dload_files - o myfiles.json
    

ImagesPipeline

  1. settings.py文件下配置

    import os
    #注册pipeline
    ITEM_PIPELINES = {
      'scrapy.pipelines.images.ImagesPipeline':300
    
    }
    #配置文件存储路径
    IMAGES_STORE="F:\\ImageFiels"
    if not os.path.exists(IMAGES_STORE):
        os.makedirs(IMAGES_STORE)
    #配置要抓取最大最小图片尺寸
    IMAGES_THUMBS = {
        'small': (50, 50),
    #    'big': (270, 270),
    }
    #配置要抓取最大最小图片尺寸
    #IMAGES_MIN_WIDTH = 50 #最小宽度
    #IMAGES_MIN_HEIGHT = 50 #最小宽度
    
  2. 实现item.py 中的数据容器

    import scrapy
    
    class MyImageItem(scrapy.Item):
        image_urls = scrapy.Field()  # 存放图片 URL 列表
        images = scrapy.Field()  # 存放下载后的图片信息
    
  3. 代码实现

    #新建爬虫
    scrapy genspider dload_files model
    #或者直接在spiders创建
    
    #image_files.py
    import scrapy
    from myproject.items import MyImageItem
    
    class ImageSpider(scrapy.Spider):
        name = 'imagespider'
        start_urls = ['https://example']
    
        def parse(self, response):
            item = MyImageItem()
            item['image_urls'] = response.css('img::attr(src)').extract()  # 提取图片 URL
            yield item
    
    
  4. 运行爬虫

    (my_venv) PS F:\开发源码\python_demo_06\tutorial> scrapy crawl imagespider - o myimages.json
    

十二.使用Scrapy来实现小说下载的完整案例

在spiders文件下创建xbiqugu.py 爬虫

#xbiqugu.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tutorial.items import TextItem
import re

class XbiquguSpider(CrawlSpider):
    name = 'xbiqugu'
    allowed_domains = ['www.477zw3']
    def __init__(self):
        self.count = 0
    def start_requests(self):
        urls = [
            "https://www.477zw3/html/265/265564/229802.html",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse_item)
    def parse_item(self, response):
        self.count +=1
        print(f"开始爬取-----------{self.count}")
        item = TextItem()
        item['title'] = response.xpath('//div[@class="bookname"]/h1/text()').get()
        print(item['title'])
        textContent = response.xpath('//div[@id="content"]//text()').getall()
        # 去除 '\r' 和 '\xa0'
        cleaned_list = [re.sub(r'[\r\xa0]+', '', text) for text in textContent]
        item['Content'] = cleaned_list
        # print(item['Content'])
        yield item
        #爬取下一页
        # next_page = response.xpath("//div[@class='bottem2']/a[contains(text(), '下一章')]/@href").get()
        # next_page = response.xpath("substring-after(//a[contains(text(), '下一章')]/@href, '/html')").get()
        next_page = response.xpath("//a[contains(text(), '下一章')]/@href").get()
        print("下一页:",next_page) # 输出:下一页: /html/265/265564/229803.html
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse_item)
        

创建Scrapy容器

#items.py
import scrapy


class TextItem(scrapy.Item):
    title = scrapy.Field()
    Content = scrapy.Field()  

创建管道text_download_pipeline.py

# text_download_pipeline
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import os


class TextDownloadPipeline():
    def __init__(self):
        # 定义存储text的目标文件夹
        self.target_folder = "DownLoadText"
        # 如果目标文件夹不存在,则创建
        if not os.path.exists(self.target_folder):
            os.makedirs(self.target_folder)
    def open_spider(self,spider):
        #管道开始前
        pass
    def close_spider(self,spider):
        #pipeline 结束后
        pass
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        title = adapter.get('title', '未知标题')
        content = adapter.get('Content', [])
        text_name = '测试技术.txt'
        file_path = os.path.join(self.target_folder, text_name)

        with open(file_path, 'a', encoding='utf-8') as file:
            file.write(f"标题: {title}\n")
            for line in content:
                file.write(f"{line}\n")
            file.write("\n")  # 每个 item 之间添加空行分隔
            return item

settings.py中注册管道

#settings.py
import random

BOT_NAME = 'tutorial'

SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0 #每次请求间隔 0 秒

ITEM_PIPELINES = {
    'tutorial.text_download_pipeline.TextDownloadPipeline':300
}

USER_AGENT_LIST = [
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
      "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
      "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
      "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
      "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
      "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
      "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
USER_AGENT = random.choice(USER_AGENT_LIST)
#创建日志
LOG_LEVEL = "INFO"

from  datetime import datetime



LOG_DIR = "log"

if not os.path.exists(LOG_DIR):
    os.makedirs(LOG_DIR)

today = datetime.now()

LOG_FILE = f"{LOG_DIR}/scrapy_{today.year}_{today.month}_{today.day}.log"

大家注意爬虫不是一直都是可以使用,需要根据情况进行调整,但是Scrapy框架确实减少了我们实现爬虫的逻辑,非常强大!

十三.总结

关于python Scrapy一起写了好几个晚上,应该都已经明白实现原理和怎么使用了,我们回归到一开始,Scrapy有什么局限性?怎么解决!确实现在的网页内容大多是JS动态生成,针对这种情况Scrapy是不能解决的!那么如何解决,这就涉及到Scrapy的动态爬取!请大家继续关注后续,我来给大家介绍!利用Scraoy来实现动态爬取!

python进阶-04-一篇带你掌握Python Scrapy(2.12)爬虫框架,附带实战

一.简介

在Python进阶系列我们来介绍Scrapy框架最新版本2.12,远超市面上的老版本,Scrapy框架在爬虫行业内鼎鼎大名,在学习之前我想请大家思考Scrapy究竟能解决什么问题?或者能爬哪一类型的网站!还有针对Scrapy的局限性我们如何依然使用好Scrapy!好,开始我们今天的日拱一卒!

二.安装Python Scrapy

#使用豆瓣源安装 提升安装速度
pip install Scrapy -i http://pypi.doubanio/simple --trusted-host pypi.doubanio

三.Scrapy 中文文档

学习任何一门技术最好的还是看官方文档,我先贴上

https://scrapy/

Scrapy也有比较不错的中文文档

https://scrapy-chs.readthedocs.io/zh-cn/stable/intro/tutorial.html

大家根据需要自己选择,这个框架很简单。。

四.创建Scrapy项目

在开始学习之前我先带大家实现一个简单的爬虫,再最后对Scrapy的运行流程进行介绍,这样大家才能更好的理解,我们来创建一个新的Scrapy项目,在vscode的终端中运行以下命令!

scrapy startproject tutorial

文件结构的介绍

tutorial/
    scrapy.cfg            # deploy configuration file

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

注意:不是一个爬虫项目只能有一个爬虫,一个爬虫项目中可以创建很多爬虫任务,我们通过不同爬虫任务的name来指定运行哪个爬虫。

五.创建我们的第一个爬虫

tutorial/spiders目录下新建quotes_spider.py 文件

当然也可以使用命令创建一个爬虫,大家初学习的时候先手动创建吧!一样的。

scrapy genspider mydomain mydomain

quotes_spider.py文件内容如下:

#quotes_spider.py
from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            "https://quotes.toscrape/page/1/",
            "https://quotes.toscrape/page/2/",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body) # 写入文件 默认utf-8
        self.log(f"Saved file {filename}") #终端中输出log日志
        
        
# 也可以这样写
# parse()是Scrapy的默认回调方法,该方法用于没有显式分配回调的请求
#quotes_spider.py
from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            "https://quotes.toscrape/page/1/",
            "https://quotes.toscrape/page/2/",
        ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body) # 写入文件 默认utf-8
        self.log(f"Saved file {filename}") #终端中输出log日志
        
        

注意:

name = "quotes":Scrapy项目中,name必须是唯一;

def start_requests(self): 必须返回一个可迭代的请求(可以返回一个请求列表或编写一个生成器函数),Scrapy将从它开始开始爬行。后续请求将从这些初始请求连续生成。

def parse(self, response):将被调用以处理为每个请求下载的响应的方法。response参数是TextResponse的一个实例,它保存页面内容,并有更多有用的方法来处理它。parse()方法通常解析响应,将抓取的数据提取为dict,并查找要跟踪的新URL并从中创建新请求(Request)。

五.启动我们的Scrapy项目

进入我们的Scrapy项目tutorial,执行scrapy crawl quotes

(my_venv) PS F:\开发源码\python_demo_06> cd tutorial
(my_venv) PS F:\开发源码\python_demo_06\tutorial> 
(my_venv) PS F:\开发源码\python_demo_06\tutorial> scrapy crawl quotes

大家运行成功应该可以看到我们爬虫项目运行成功,并且我们tutorial 文件夹下多了2个文件quotes-1.html、quotes-2.html,这时候我们已经成功实现Scrapy框架;

执行原理:

1.Scrapy执行scrapy crawl quotes时会从spiders中找到name为quotes的爬虫,启动此爬虫;

2.接着执行start_requests 函数中的urls,请求地址,开始执行Scrapy中的内置请求,yield scrapy.Request(url=url, callback=self.parse) 如果我们指定了callback 就走callback对应的函数,如果没有指定则找默认的self.parse函数,如果啥都没有。。。爬虫关闭

3.self.parse接到请求返回后会执行解析。。

请大家思考一个问题 为啥用yield 而不用return?如果用return会出现什么情况?

截止目前我们还没有解析HTML,请稍等,好菜还没上!慢慢看。。

六.Scrapy解析数据,利用Scrapy自带的Xpath和css selectors

我们之前的文章介绍过BeautifulSoupXpath来提取数据,但是呢Scrapy很强大,自带css选择器和Xpath选择器,我们可以直接使用,当然也可以依然使用BeautifulSoupXpath来提取数据,既然我们今天介绍Scrapy,那么我们就用Scrapy自带的来提取数据,好还是上面的代码,只不过改一改quotes_spider.py文件

修改后的代码:

#quotes_spider.py
from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            "https://quotes.toscrape/page/1/",
            "https://quotes.toscrape/page/2/",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    # def parse(self, response):
    #     page = response.url.split("/")[-2]
    #     filename = f"quotes-{page}.html"
    #     Path(filename).write_bytes(response.body)
    #     self.log(f"Saved file {filename}")
    def parse(self, response):
        print("**************提取开始******************")
        print(response.css("title"))
        print("**************提取结束******************")

        '''
        输出:
        **************提取开始******************
        [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
        **************提取结束******************
        2024-11-20 22:57:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape/page/2/> (referer: None)
        **************提取开始******************
        [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
        **************提取结束******************
        '''

看到了没?输出如下(后面所有的提取 我只写关键部分:):

print(response.css("title"))
#[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
CSS选择器:
  1. CSS选择器提取内容为列表

    response.css("title::text").getall()
    #['Quotes to Scrape']
    

    这里注意::text如果不加::text会出现什么情况呢?可以发现节点标签不是我们想要的。。所以要加::text才能获取我们想要的内容

    response.css("title").getall()
    #['<title>Quotes to Scrape</title>']
    
  2. 只获取第一个结果

    response.css("title::text").get()
    # 'Quotes to Scrape'
    

    也可以这样写

    response.css("title::text")[0].get()
    #'Quotes to Scrape'
    

    **注意:**这2种写法有什么区别呢?

    response.css(“title::text”)[0].get():如果没有结果 索引会引发IndexError

    response.css(“title::text”).get():如果没有结果,返回None

  3. CSS选择器+正则表达式

    response.css("title::text").re(r"Quotes.*")
    #['Quotes to Scrape']
    response.css("title::text").re(r"Q\w+")
    #['Quotes']
    response.css("title::text").re(r"(\w+) to (\w+)")
    #['Quotes', 'Scrape']
    
  4. 直接使用CSS选择器

    response.css("div.quote")
    '''
    输出:
    [<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
    <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
    ...]
    '''
    
Xpath选择器:
  1. 提取内容

    response.xpath("//title")
    #[<Selector query='//title' data='<title>Quotes to Scrape</title>'>]
    
  2. 提取文字内容

    response.xpath("//title/text()").get()
    #'Quotes to Scrape'
    
  3. 提取标签包含指定文字的标签

            next_page = response.xpath("//div[@class='bottem2']/a[contains(text(), '下一章')]/@href").get()
            next_page = response.xpath("substring-after(//a[contains(text(), '下一章')]/@href, '/html')").get()
            next_page = response.xpath("//a[contains(text(), '下一章')]/@href").get()
    
  4. 提取指定标签及其子标签的全部内容

    textContent = response.xpath('//div[@id="content"]//text()').getall()
    
使用插件SelectorGadget 帮我们快速获取css选择器和Xpath选择器:

我已经给大家准备好最新版下载地址:https://download.csdn/download/Lookontime/90025172

安装方式:

  1. 下载后解压

  1. 在谷歌浏览器中输入chrome://extensions/

在我们真实项目中,这样构造CSS选择器和Xpath选择器,效率还有有点慢!有没有更好的办法,还真有,但不是很完美,就是使用SelectorGadget 插件!可以帮我们快速构建Xpath,我们只要稍做修改即可!

注意,我在阅读官方的时候,说CSS选择器在Scrapy引擎下实际是被转换为Xpath,而且官方建议使用Xpath,这里我之前写过一篇专门介绍Xpath的文章,有兴趣的可以去看我之前的文章,有前端基础的小伙伴看这个应该超级简单。。。

七.使用scrapy shell 'https://quotes.toscrape’来验证我们的解析:

'https://quotes.toscrape’的页面结构如下:

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

让我们执行这个命令

scrapy shell 'https://quotes.toscrape'

接着会进行等待页面,我们执行我们的选择器

>>> response.css("div.quote")
[<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
...]

>>> quote = response.css("div.quote")[0]

>>> text = quote.css("span.text::text").get()
>>> text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").get()
>>> author
'Albert Einstein'
>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']



>>> for quote in response.css("div.quote"):
...    text = quote.css("span.text::text").get()
...    author = quote.css("small.author::text").get()
...    tags = quote.css("div.tags a.tag::text").getall()
...    print(dict(text=text, author=author, tags=tags))
...
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
...

如何退出 scrapy shell?

quit()

八.如何实现解析下一页直到不满足条件时停止

举个例子,当我们想安安静静看本小说又不想被满屏的广告打扰,这个时候我们就有一个爬虫需求,爬取网页中的内容,让后找到下一页,继续爬取,继续找寻下一页,直到不满足条件时停止,这个时候我们怎么实现?

有人说,我们把所有的页面url全部放到def start_requests(self)函数 urls中,这样不就可以了?可以是可以,你估计得累死。。因为第一url不可能是规律的递增变化。还有就是爬取的顺序我们需要控制或者才有其他办法。

那么有没有办法我们只给起始页面,页面解析下一页的url让后返回给parse来进行循环解析呢?当然有

举例我们的下一页如下:

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

构建我们的爬虫

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape/page/1/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

刚刚我们看到了,我们解析出下一页的url然后构建请求地址,然后再将请求内容返回给self.parse,直到不满足条件为止,好!非常棒!但是Scrapy框架更强大,有更简单的方法,大家接着看

方式一:response.follow

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape/page/1/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("span small::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

方式二:利用for循环

for href in response.css("ul.pager a::attr(href)"):
    yield response.follow(href, callback=self.parse)

方式三:直接返回a标签

for a in response.css("ul.pager a"):
    yield response.follow(a, callback=self.parse)

方式四:使用response.follow_all

anchors = response.css("ul.pager a")
yield from response.follow_all(anchors, callback=self.parse)

方式五:直接传入解析器

yield from response.follow_all(css="ul.pager a", callback=self.parse)

一个完整的例子

import scrapy


class AuthorSpider(scrapy.Spider):
    name = "author"

    start_urls = ["https://quotes.toscrape/"]

    def parse(self, response):
        author_page_links = response.css(".author + a")
        yield from response.follow_all(author_page_links, self.parse_author)

        pagination_links = response.css("li.next a")
        yield from response.follow_all(pagination_links, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default="").strip()

        yield {
            "name": extract_with_css("h3.author-title::text"),
            "birthdate": extract_with_css(".author-born-date::text"),
            "bio": extract_with_css(".author-description::text"),
        }

九.给爬虫传入参数

我们想在运行代码时传入参数, 只需要执行命令时使用 -a选项

执行命令:

scrapy crawl quotes -O quotes-humor.json -a tag=humor

爬虫代码:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = "https://quotes.toscrape/"
        tag = getattr(self, "tag", None)
        if tag is not None:
            url = url + "tag/" + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

命令解析:

  1. scrapy crawl quotes
    • 运行名为 quotes 的爬虫。quotes 是在你的 Scrapy 项目中定义的爬虫名称,通常会在 spiders 文件夹中找到对应的代码文件。
  2. -O quotes-humor.json
    • -O 代表输出文件,quotes-humor.json 是输出文件的名称。
    • Scrapy 会将爬取到的数据保存为 JSON 格式文件,覆盖同名文件(如果存在)。
  3. -a tag=humor
    • 使用 -a 参数为爬虫传递一个名为 tag 的参数,其值为 humor
    • 在爬虫代码中,可以通过 self.tag 访问这个参数。通常,这种参数用于向爬虫指定一个过滤条件,比如只抓取与“幽默”相关的内容。

十.Scrapy数据容器Item和Field

截止到目前大家是不是好像明白了Scrapy,但是又不太明白,是不是存在一个疑问,我是实现了爬虫和解析数据,但是我怎么使用呢?这就涉及到Scrapy数据容器和Scrapy管道的概念!先别急,我们来介绍Scrapy数据容器

Scrapy中提供了2个类 Item和Field,使用前需要在items.py中先导入,items.py代码如下:

#items.py
import scrapy


# class TutorialItem(scrapy.Item):
#     # define the fields for your item here like:
#     # name = scrapy.Field()
#     pass
# class DmozItem(scrapy.Item):
#     title = scrapy.Field()
#     link = scrapy.Field()
#     desc = scrapy.Field()

class QuoteItem(scrapy.Item):
    imgBase64 = scrapy.Field()
    file_name = scrapy.Field()  

class VideoItem(scrapy.Item):
    video_url = scrapy.Field()
    file_name = scrapy.Field()  


class TextItem(scrapy.Item):
    title = scrapy.Field()
    Content = scrapy.Field()  

**Item基类:**实现的自定义数据类,必须继承Item基类 如class TextItem(scrapy.Item)

**Field类:**描述自定义数据类包含的字段,如title、Content

使用前需要创建Item对象

item = TextItem()
item['title'] = response.xpath('//div[@class="bookname"]/h1/text()').get()
textContent = response.xpath('//div[@id="content"]//text()').getall()
# 去除 '\r' 和 '\xa0'
cleaned_list = [re.sub(r'[\r\xa0]+', '', text) for text in textContent]
item['Content'] = cleaned_list

获取字段值:

print(item['title'])
print(item['Content'])

获取所有字段名

item.keys()

Item复制

item2 = item.copy()

十一.Scrapy pipeline 管道

截止到目前我们实现了Scrapy数据容器,那么怎么使用数据容器?这就涉及到Scrapy pipeline 管道,这里是重点因为Scrapy pipeline可以自动接收Scrapy数据容器,并根据Scrapy数据容器来实现不同的功能,如将item解析存储到数据库,下载图片,下载文件,数据存储到json,excel,txt等。

使用Scrapy pipeline 管道 首先要进行注册,Scrapy 爬虫开始后会自动将item数据传输到所有已经注册的pipeline 以实现不同管道处理不同内容。

pipeline 注册:在settings.py文件下注册

#settings.py文件下

ITEM_PIPELINES = {

#   'tutorial.pipelines.TutorialPipeline': 300,

#   'tutorial.save_Image_pipeline.SaveImagePipeline': 300,

#   'tutorial.video_download_pipeline.VideoDownloadPipeline': 500,

  'tutorial.text_download_pipeline.TextDownloadPipeline':300

}

‘tutorial.text_download_pipeline.TextDownloadPipeline’ : pipeline 文件地址

300:数字越小优先级越高

一个完整的pipeline示例

#text_download_pipeline.py
from itemadapter import ItemAdapter
import os


class TextDownloadPipeline():
    def __init__(self):
        # 定义存储text的目标文件夹
        self.target_folder = "DownLoadText"
        # 如果目标文件夹不存在,则创建
        if not os.path.exists(self.target_folder):
            os.makedirs(self.target_folder)
    def open_spider(self,spider):
        #spider开始前
        pass
    def close_spider(self,spider):
        #spider 结束后
        pass
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        title = adapter.get('title', '未知标题')
        content = adapter.get('Content', [])
        text_name = '下载的文档内容.txt'
        file_path = os.path.join(self.target_folder, text_name)

        with open(file_path, 'a', encoding='utf-8') as file:
            file.write(f"标题: {title}\n")
            for line in content:
                file.write(f"{line}\n")
            file.write("\n")  # 每个 item 之间添加空行分隔
            return item

注意:def process_item(self, item, spider):必须实现的方法

这个管道是用来解析Scrapy 容器item来实现将item中的内容一行行写入 txt文件

除了我们自定义的pipeline外,Scrapy 两个特殊的pipeline,分别用来处理文件和图片:FilesPipeline和ImagesPipeline,下面我们来掌握这一概念:

FilesPipelineImagesPipeline
导入路径scrapy.pipelines.files.FilesPipelinescrapy.pipelines.images.ImagesPipeline
Item字段file_urls,filesimage_urls,images
存储路径FILES_STOREIMAGES_STORE

FilesPipeline

  1. settings.py文件下配置

    import os
    #注册pipeline
    ITEM_PIPELINES = {
      'scrapy.pipelines.files.FilesPipeline':300
    
    }
    #配置文件存储路径
    FILES_STORE="F:\\DownloadFiels"
    if not os.path.exists(FILES_STORE):
        os.makedirs(FILES_STORE)
    
  2. 实现item.py 中的数据容器

    class FileItem(scrapy.Item):
        file_urls = scrapy.Field()
        files = scrapy.Field()
    
  3. 代码实现

    #新建爬虫
    scrapy genspider dload_files model
    #或者直接在spiders创建
    
    #dload_files.py
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from tutorial.items import FileItem, TextItem
    import re
    
    class XbiquguSpider(CrawlSpider):
        name = 'dload_files'
        allowed_domains = ['www.model']
        def __init__(self):
            pass
        def start_requests(self):
            urls = [
                "https://www.model/html/265/265564/229802.html",
            ]
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)
        def parse(self, response):
            for line in response.xpath('//div[@class="bookname"]/ul/li'):
                for example in line.xpath('.//ul/li'):
                    url = example.xpath('.//a//@href').extract_first()
                    url = response.urljoin(url)
                    yield scrapy.Request(url,callback=self.parse_files)
        def parse_files(self, response):
            href = response.xpath('//a/@href').extract_first()
            url = response.urljoin(href)
            fileItem = FileItem()
            fileItem['file_urls'] =[url]
            return fileItem
    
  4. 运行爬虫

    (my_venv) PS F:\开发源码\python_demo_06\tutorial> scrapy crawl dload_files - o myfiles.json
    

ImagesPipeline

  1. settings.py文件下配置

    import os
    #注册pipeline
    ITEM_PIPELINES = {
      'scrapy.pipelines.images.ImagesPipeline':300
    
    }
    #配置文件存储路径
    IMAGES_STORE="F:\\ImageFiels"
    if not os.path.exists(IMAGES_STORE):
        os.makedirs(IMAGES_STORE)
    #配置要抓取最大最小图片尺寸
    IMAGES_THUMBS = {
        'small': (50, 50),
    #    'big': (270, 270),
    }
    #配置要抓取最大最小图片尺寸
    #IMAGES_MIN_WIDTH = 50 #最小宽度
    #IMAGES_MIN_HEIGHT = 50 #最小宽度
    
  2. 实现item.py 中的数据容器

    import scrapy
    
    class MyImageItem(scrapy.Item):
        image_urls = scrapy.Field()  # 存放图片 URL 列表
        images = scrapy.Field()  # 存放下载后的图片信息
    
  3. 代码实现

    #新建爬虫
    scrapy genspider dload_files model
    #或者直接在spiders创建
    
    #image_files.py
    import scrapy
    from myproject.items import MyImageItem
    
    class ImageSpider(scrapy.Spider):
        name = 'imagespider'
        start_urls = ['https://example']
    
        def parse(self, response):
            item = MyImageItem()
            item['image_urls'] = response.css('img::attr(src)').extract()  # 提取图片 URL
            yield item
    
    
  4. 运行爬虫

    (my_venv) PS F:\开发源码\python_demo_06\tutorial> scrapy crawl imagespider - o myimages.json
    

十二.使用Scrapy来实现小说下载的完整案例

在spiders文件下创建xbiqugu.py 爬虫

#xbiqugu.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tutorial.items import TextItem
import re

class XbiquguSpider(CrawlSpider):
    name = 'xbiqugu'
    allowed_domains = ['www.477zw3']
    def __init__(self):
        self.count = 0
    def start_requests(self):
        urls = [
            "https://www.477zw3/html/265/265564/229802.html",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse_item)
    def parse_item(self, response):
        self.count +=1
        print(f"开始爬取-----------{self.count}")
        item = TextItem()
        item['title'] = response.xpath('//div[@class="bookname"]/h1/text()').get()
        print(item['title'])
        textContent = response.xpath('//div[@id="content"]//text()').getall()
        # 去除 '\r' 和 '\xa0'
        cleaned_list = [re.sub(r'[\r\xa0]+', '', text) for text in textContent]
        item['Content'] = cleaned_list
        # print(item['Content'])
        yield item
        #爬取下一页
        # next_page = response.xpath("//div[@class='bottem2']/a[contains(text(), '下一章')]/@href").get()
        # next_page = response.xpath("substring-after(//a[contains(text(), '下一章')]/@href, '/html')").get()
        next_page = response.xpath("//a[contains(text(), '下一章')]/@href").get()
        print("下一页:",next_page) # 输出:下一页: /html/265/265564/229803.html
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse_item)
        

创建Scrapy容器

#items.py
import scrapy


class TextItem(scrapy.Item):
    title = scrapy.Field()
    Content = scrapy.Field()  

创建管道text_download_pipeline.py

# text_download_pipeline
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import os


class TextDownloadPipeline():
    def __init__(self):
        # 定义存储text的目标文件夹
        self.target_folder = "DownLoadText"
        # 如果目标文件夹不存在,则创建
        if not os.path.exists(self.target_folder):
            os.makedirs(self.target_folder)
    def open_spider(self,spider):
        #管道开始前
        pass
    def close_spider(self,spider):
        #pipeline 结束后
        pass
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        title = adapter.get('title', '未知标题')
        content = adapter.get('Content', [])
        text_name = '测试技术.txt'
        file_path = os.path.join(self.target_folder, text_name)

        with open(file_path, 'a', encoding='utf-8') as file:
            file.write(f"标题: {title}\n")
            for line in content:
                file.write(f"{line}\n")
            file.write("\n")  # 每个 item 之间添加空行分隔
            return item

settings.py中注册管道

#settings.py
import random

BOT_NAME = 'tutorial'

SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0 #每次请求间隔 0 秒

ITEM_PIPELINES = {
    'tutorial.text_download_pipeline.TextDownloadPipeline':300
}

USER_AGENT_LIST = [
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
      "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
      "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
      "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
      "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
      "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
      "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
USER_AGENT = random.choice(USER_AGENT_LIST)
#创建日志
LOG_LEVEL = "INFO"

from  datetime import datetime



LOG_DIR = "log"

if not os.path.exists(LOG_DIR):
    os.makedirs(LOG_DIR)

today = datetime.now()

LOG_FILE = f"{LOG_DIR}/scrapy_{today.year}_{today.month}_{today.day}.log"

大家注意爬虫不是一直都是可以使用,需要根据情况进行调整,但是Scrapy框架确实减少了我们实现爬虫的逻辑,非常强大!

十三.总结

关于python Scrapy一起写了好几个晚上,应该都已经明白实现原理和怎么使用了,我们回归到一开始,Scrapy有什么局限性?怎么解决!确实现在的网页内容大多是JS动态生成,针对这种情况Scrapy是不能解决的!那么如何解决,这就涉及到Scrapy的动态爬取!请大家继续关注后续,我来给大家介绍!利用Scraoy来实现动态爬取!

与本文相关的文章

发布评论

评论列表 (0)

  1. 暂无评论