python进阶-04-一篇带你掌握Python Scrapy(2.12)爬虫框架,附带实战
一.简介
在Python进阶系列我们来介绍Scrapy框架最新版本2.12,远超市面上的老版本,Scrapy框架在爬虫行业内鼎鼎大名,在学习之前我想请大家思考Scrapy究竟能解决什么问题?或者能爬哪一类型的网站!还有针对Scrapy的局限性我们如何依然使用好Scrapy!好,开始我们今天的日拱一卒!
二.安装Python Scrapy
#使用豆瓣源安装 提升安装速度
pip install Scrapy -i http://pypi.doubanio/simple --trusted-host pypi.doubanio
三.Scrapy 中文文档
学习任何一门技术最好的还是看官方文档,我先贴上
https://scrapy/
Scrapy也有比较不错的中文文档
https://scrapy-chs.readthedocs.io/zh-cn/stable/intro/tutorial.html
大家根据需要自己选择,这个框架很简单。。
四.创建Scrapy项目
在开始学习之前我先带大家实现一个简单的爬虫,再最后对Scrapy的运行流程进行介绍,这样大家才能更好的理解,我们来创建一个新的Scrapy项目,在vscode的终端中运行以下命令!
scrapy startproject tutorial
文件结构的介绍
tutorial/
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
注意:不是一个爬虫项目只能有一个爬虫,一个爬虫项目中可以创建很多爬虫任务,我们通过不同爬虫任务的name来指定运行哪个爬虫。
五.创建我们的第一个爬虫
在tutorial/spiders
目录下新建quotes_spider.py 文件
当然也可以使用命令创建一个爬虫,大家初学习的时候先手动创建吧!一样的。
scrapy genspider mydomain mydomain
quotes_spider.py文件内容如下:
#quotes_spider.py
from pathlib import Path
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
"https://quotes.toscrape/page/1/",
"https://quotes.toscrape/page/2/",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f"quotes-{page}.html"
Path(filename).write_bytes(response.body) # 写入文件 默认utf-8
self.log(f"Saved file {filename}") #终端中输出log日志
# 也可以这样写
# parse()是Scrapy的默认回调方法,该方法用于没有显式分配回调的请求
#quotes_spider.py
from pathlib import Path
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
"https://quotes.toscrape/page/1/",
"https://quotes.toscrape/page/2/",
]
def parse(self, response):
page = response.url.split("/")[-2]
filename = f"quotes-{page}.html"
Path(filename).write_bytes(response.body) # 写入文件 默认utf-8
self.log(f"Saved file {filename}") #终端中输出log日志
注意:
name = "quotes":
Scrapy项目中,name必须是唯一;
def start_requests(self):
必须返回一个可迭代的请求(可以返回一个请求列表或编写一个生成器函数),Scrapy将从它开始开始爬行。后续请求将从这些初始请求连续生成。
def parse(self, response):
将被调用以处理为每个请求下载的响应的方法。response参数是TextResponse的一个实例,它保存页面内容,并有更多有用的方法来处理它。parse()方法通常解析响应,将抓取的数据提取为dict,并查找要跟踪的新URL并从中创建新请求(Request)。
五.启动我们的Scrapy项目
进入我们的Scrapy项目tutorial,执行scrapy crawl quotes
(my_venv) PS F:\开发源码\python_demo_06> cd tutorial
(my_venv) PS F:\开发源码\python_demo_06\tutorial>
(my_venv) PS F:\开发源码\python_demo_06\tutorial> scrapy crawl quotes
大家运行成功应该可以看到我们爬虫项目运行成功,并且我们tutorial 文件夹下多了2个文件quotes-1.html、quotes-2.html,这时候我们已经成功实现Scrapy框架;
执行原理:
1.Scrapy执行scrapy crawl quotes
时会从spiders中找到name为quotes的爬虫,启动此爬虫;
2.接着执行start_requests 函数中的urls,请求地址,开始执行Scrapy中的内置请求,yield scrapy.Request(url=url, callback=self.parse) 如果我们指定了callback 就走callback对应的函数,如果没有指定则找默认的self.parse函数,如果啥都没有。。。爬虫关闭
3.self.parse接到请求返回后会执行解析。。
请大家思考一个问题 为啥用yield 而不用return?如果用return会出现什么情况?
截止目前我们还没有解析HTML,请稍等,好菜还没上!慢慢看。。
六.Scrapy解析数据,利用Scrapy自带的Xpath和css selectors
我们之前的文章介绍过BeautifulSoup 和Xpath来提取数据,但是呢Scrapy很强大,自带css选择器和Xpath选择器,我们可以直接使用,当然也可以依然使用BeautifulSoup 和Xpath来提取数据,既然我们今天介绍Scrapy,那么我们就用Scrapy自带的来提取数据,好还是上面的代码,只不过改一改quotes_spider.py文件
修改后的代码:
#quotes_spider.py
from pathlib import Path
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
"https://quotes.toscrape/page/1/",
"https://quotes.toscrape/page/2/",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
# def parse(self, response):
# page = response.url.split("/")[-2]
# filename = f"quotes-{page}.html"
# Path(filename).write_bytes(response.body)
# self.log(f"Saved file {filename}")
def parse(self, response):
print("**************提取开始******************")
print(response.css("title"))
print("**************提取结束******************")
'''
输出:
**************提取开始******************
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
**************提取结束******************
2024-11-20 22:57:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape/page/2/> (referer: None)
**************提取开始******************
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
**************提取结束******************
'''
看到了没?输出如下(后面所有的提取 我只写关键部分:):
print(response.css("title"))
#[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
CSS选择器:
-
CSS选择器提取内容为列表
response.css("title::text").getall() #['Quotes to Scrape']
这里注意
::text
如果不加::text
会出现什么情况呢?可以发现节点标签不是我们想要的。。所以要加::text
才能获取我们想要的内容response.css("title").getall() #['<title>Quotes to Scrape</title>']
-
只获取第一个结果
response.css("title::text").get() # 'Quotes to Scrape'
也可以这样写
response.css("title::text")[0].get() #'Quotes to Scrape'
**注意:**这2种写法有什么区别呢?
response.css(“title::text”)[0].get():如果没有结果 索引会引发IndexError
response.css(“title::text”).get():如果没有结果,返回None
-
CSS选择器+正则表达式
response.css("title::text").re(r"Quotes.*") #['Quotes to Scrape'] response.css("title::text").re(r"Q\w+") #['Quotes'] response.css("title::text").re(r"(\w+) to (\w+)") #['Quotes', 'Scrape']
-
直接使用CSS选择器
response.css("div.quote") ''' 输出: [<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, ...] '''
Xpath选择器:
-
提取内容
response.xpath("//title") #[<Selector query='//title' data='<title>Quotes to Scrape</title>'>]
-
提取文字内容
response.xpath("//title/text()").get() #'Quotes to Scrape'
-
提取标签包含指定文字的标签
next_page = response.xpath("//div[@class='bottem2']/a[contains(text(), '下一章')]/@href").get() next_page = response.xpath("substring-after(//a[contains(text(), '下一章')]/@href, '/html')").get() next_page = response.xpath("//a[contains(text(), '下一章')]/@href").get()
-
提取指定标签及其子标签的全部内容
textContent = response.xpath('//div[@id="content"]//text()').getall()
使用插件SelectorGadget 帮我们快速获取css选择器和Xpath选择器:
我已经给大家准备好最新版下载地址:https://download.csdn/download/Lookontime/90025172
安装方式:
- 下载后解压
-
在谷歌浏览器中输入chrome://extensions/
在我们真实项目中,这样构造CSS选择器和Xpath选择器,效率还有有点慢!有没有更好的办法,还真有,但不是很完美,就是使用SelectorGadget 插件!可以帮我们快速构建Xpath,我们只要稍做修改即可!
注意,我在阅读官方的时候,说CSS选择器在Scrapy引擎下实际是被转换为Xpath,而且官方建议使用Xpath,这里我之前写过一篇专门介绍Xpath的文章,有兴趣的可以去看我之前的文章,有前端基础的小伙伴看这个应该超级简单。。。
七.使用scrapy shell 'https://quotes.toscrape’来验证我们的解析:
'https://quotes.toscrape’的页面结构如下:
<div class="quote">
<span class="text">“The world as we have created it is a process of our
thinking. It cannot be changed without changing our thinking.”</span>
<span>
by <small class="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
让我们执行这个命令
scrapy shell 'https://quotes.toscrape'
接着会进行等待页面,我们执行我们的选择器
>>> response.css("div.quote")
[<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
...]
>>> quote = response.css("div.quote")[0]
>>> text = quote.css("span.text::text").get()
>>> text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").get()
>>> author
'Albert Einstein'
>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']
>>> for quote in response.css("div.quote"):
... text = quote.css("span.text::text").get()
... author = quote.css("small.author::text").get()
... tags = quote.css("div.tags a.tag::text").getall()
... print(dict(text=text, author=author, tags=tags))
...
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
...
如何退出 scrapy shell?
quit()
八.如何实现解析下一页直到不满足条件时停止
举个例子,当我们想安安静静看本小说又不想被满屏的广告打扰,这个时候我们就有一个爬虫需求,爬取网页中的内容,让后找到下一页,继续爬取,继续找寻下一页,直到不满足条件时停止,这个时候我们怎么实现?
有人说,我们把所有的页面url全部放到def start_requests(self)函数 urls中,这样不就可以了?可以是可以,你估计得累死。。因为第一url不可能是规律的递增变化。还有就是爬取的顺序我们需要控制或者才有其他办法。
那么有没有办法我们只给起始页面,页面解析下一页的url让后返回给parse来进行循环解析呢?当然有
举例我们的下一页如下:
<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
</ul>
构建我们的爬虫
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
"https://quotes.toscrape/page/1/",
]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
刚刚我们看到了,我们解析出下一页的url然后构建请求地址,然后再将请求内容返回给self.parse,直到不满足条件为止,好!非常棒!但是Scrapy框架更强大,有更简单的方法,大家接着看
方式一:response.follow
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
"https://quotes.toscrape/page/1/",
]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("span small::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
方式二:利用for循环
for href in response.css("ul.pager a::attr(href)"):
yield response.follow(href, callback=self.parse)
方式三:直接返回a标签
for a in response.css("ul.pager a"):
yield response.follow(a, callback=self.parse)
方式四:使用response.follow_all
anchors = response.css("ul.pager a")
yield from response.follow_all(anchors, callback=self.parse)
方式五:直接传入解析器
yield from response.follow_all(css="ul.pager a", callback=self.parse)
一个完整的例子
import scrapy
class AuthorSpider(scrapy.Spider):
name = "author"
start_urls = ["https://quotes.toscrape/"]
def parse(self, response):
author_page_links = response.css(".author + a")
yield from response.follow_all(author_page_links, self.parse_author)
pagination_links = response.css("li.next a")
yield from response.follow_all(pagination_links, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).get(default="").strip()
yield {
"name": extract_with_css("h3.author-title::text"),
"birthdate": extract_with_css(".author-born-date::text"),
"bio": extract_with_css(".author-description::text"),
}
九.给爬虫传入参数
我们想在运行代码时传入参数, 只需要执行命令时使用 -a选项
执行命令:
scrapy crawl quotes -O quotes-humor.json -a tag=humor
爬虫代码:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
url = "https://quotes.toscrape/"
tag = getattr(self, "tag", None)
if tag is not None:
url = url + "tag/" + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
yield response.follow(next_page, self.parse)
命令解析:
scrapy crawl quotes
- 运行名为
quotes
的爬虫。quotes
是在你的 Scrapy 项目中定义的爬虫名称,通常会在spiders
文件夹中找到对应的代码文件。
- 运行名为
-O quotes-humor.json
-O
代表输出文件,quotes-humor.json
是输出文件的名称。- Scrapy 会将爬取到的数据保存为 JSON 格式文件,覆盖同名文件(如果存在)。
-a tag=humor
- 使用
-a
参数为爬虫传递一个名为tag
的参数,其值为humor
。 - 在爬虫代码中,可以通过
self.tag
访问这个参数。通常,这种参数用于向爬虫指定一个过滤条件,比如只抓取与“幽默”相关的内容。
- 使用
十.Scrapy数据容器Item和Field
截止到目前大家是不是好像明白了Scrapy,但是又不太明白,是不是存在一个疑问,我是实现了爬虫和解析数据,但是我怎么使用呢?这就涉及到Scrapy数据容器和Scrapy管道的概念!先别急,我们来介绍Scrapy数据容器
Scrapy中提供了2个类 Item和Field,使用前需要在items.py中先导入,items.py代码如下:
#items.py
import scrapy
# class TutorialItem(scrapy.Item):
# # define the fields for your item here like:
# # name = scrapy.Field()
# pass
# class DmozItem(scrapy.Item):
# title = scrapy.Field()
# link = scrapy.Field()
# desc = scrapy.Field()
class QuoteItem(scrapy.Item):
imgBase64 = scrapy.Field()
file_name = scrapy.Field()
class VideoItem(scrapy.Item):
video_url = scrapy.Field()
file_name = scrapy.Field()
class TextItem(scrapy.Item):
title = scrapy.Field()
Content = scrapy.Field()
**Item基类:**实现的自定义数据类,必须继承Item基类 如class TextItem(scrapy.Item)
**Field类:**描述自定义数据类包含的字段,如title、Content
使用前需要创建Item对象
item = TextItem()
item['title'] = response.xpath('//div[@class="bookname"]/h1/text()').get()
textContent = response.xpath('//div[@id="content"]//text()').getall()
# 去除 '\r' 和 '\xa0'
cleaned_list = [re.sub(r'[\r\xa0]+', '', text) for text in textContent]
item['Content'] = cleaned_list
获取字段值:
print(item['title'])
print(item['Content'])
获取所有字段名
item.keys()
Item复制
item2 = item.copy()
十一.Scrapy pipeline 管道
截止到目前我们实现了Scrapy数据容器,那么怎么使用数据容器?这就涉及到Scrapy pipeline 管道,这里是重点因为Scrapy pipeline可以自动接收Scrapy数据容器,并根据Scrapy数据容器来实现不同的功能,如将item解析存储到数据库,下载图片,下载文件,数据存储到json,excel,txt等。
使用Scrapy pipeline 管道 首先要进行注册,Scrapy 爬虫开始后会自动将item数据传输到所有已经注册的pipeline 以实现不同管道处理不同内容。
pipeline 注册:在settings.py文件下注册
#settings.py文件下
ITEM_PIPELINES = {
# 'tutorial.pipelines.TutorialPipeline': 300,
# 'tutorial.save_Image_pipeline.SaveImagePipeline': 300,
# 'tutorial.video_download_pipeline.VideoDownloadPipeline': 500,
'tutorial.text_download_pipeline.TextDownloadPipeline':300
}
‘tutorial.text_download_pipeline.TextDownloadPipeline’ : pipeline 文件地址
300:数字越小优先级越高
一个完整的pipeline示例
#text_download_pipeline.py
from itemadapter import ItemAdapter
import os
class TextDownloadPipeline():
def __init__(self):
# 定义存储text的目标文件夹
self.target_folder = "DownLoadText"
# 如果目标文件夹不存在,则创建
if not os.path.exists(self.target_folder):
os.makedirs(self.target_folder)
def open_spider(self,spider):
#spider开始前
pass
def close_spider(self,spider):
#spider 结束后
pass
def process_item(self, item, spider):
adapter = ItemAdapter(item)
title = adapter.get('title', '未知标题')
content = adapter.get('Content', [])
text_name = '下载的文档内容.txt'
file_path = os.path.join(self.target_folder, text_name)
with open(file_path, 'a', encoding='utf-8') as file:
file.write(f"标题: {title}\n")
for line in content:
file.write(f"{line}\n")
file.write("\n") # 每个 item 之间添加空行分隔
return item
注意:def process_item(self, item, spider):
必须实现的方法
这个管道是用来解析Scrapy 容器item来实现将item中的内容一行行写入 txt文件
除了我们自定义的pipeline外,Scrapy 两个特殊的pipeline,分别用来处理文件和图片:FilesPipeline和ImagesPipeline,下面我们来掌握这一概念:
FilesPipeline | ImagesPipeline | |
---|---|---|
导入路径 | scrapy.pipelines.files.FilesPipeline | scrapy.pipelines.images.ImagesPipeline |
Item字段 | file_urls,files | image_urls,images |
存储路径 | FILES_STORE | IMAGES_STORE |
FilesPipeline
-
settings.py文件下配置
import os #注册pipeline ITEM_PIPELINES = { 'scrapy.pipelines.files.FilesPipeline':300 } #配置文件存储路径 FILES_STORE="F:\\DownloadFiels" if not os.path.exists(FILES_STORE): os.makedirs(FILES_STORE)
-
实现item.py 中的数据容器
class FileItem(scrapy.Item): file_urls = scrapy.Field() files = scrapy.Field()
-
代码实现
#新建爬虫 scrapy genspider dload_files model #或者直接在spiders创建
#dload_files.py import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from tutorial.items import FileItem, TextItem import re class XbiquguSpider(CrawlSpider): name = 'dload_files' allowed_domains = ['www.model'] def __init__(self): pass def start_requests(self): urls = [ "https://www.model/html/265/265564/229802.html", ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): for line in response.xpath('//div[@class="bookname"]/ul/li'): for example in line.xpath('.//ul/li'): url = example.xpath('.//a//@href').extract_first() url = response.urljoin(url) yield scrapy.Request(url,callback=self.parse_files) def parse_files(self, response): href = response.xpath('//a/@href').extract_first() url = response.urljoin(href) fileItem = FileItem() fileItem['file_urls'] =[url] return fileItem
-
运行爬虫
(my_venv) PS F:\开发源码\python_demo_06\tutorial> scrapy crawl dload_files - o myfiles.json
ImagesPipeline
-
settings.py文件下配置
import os #注册pipeline ITEM_PIPELINES = { 'scrapy.pipelines.images.ImagesPipeline':300 } #配置文件存储路径 IMAGES_STORE="F:\\ImageFiels" if not os.path.exists(IMAGES_STORE): os.makedirs(IMAGES_STORE) #配置要抓取最大最小图片尺寸 IMAGES_THUMBS = { 'small': (50, 50), # 'big': (270, 270), } #配置要抓取最大最小图片尺寸 #IMAGES_MIN_WIDTH = 50 #最小宽度 #IMAGES_MIN_HEIGHT = 50 #最小宽度
-
实现item.py 中的数据容器
import scrapy class MyImageItem(scrapy.Item): image_urls = scrapy.Field() # 存放图片 URL 列表 images = scrapy.Field() # 存放下载后的图片信息
-
代码实现
#新建爬虫 scrapy genspider dload_files model #或者直接在spiders创建
#image_files.py import scrapy from myproject.items import MyImageItem class ImageSpider(scrapy.Spider): name = 'imagespider' start_urls = ['https://example'] def parse(self, response): item = MyImageItem() item['image_urls'] = response.css('img::attr(src)').extract() # 提取图片 URL yield item
-
运行爬虫
(my_venv) PS F:\开发源码\python_demo_06\tutorial> scrapy crawl imagespider - o myimages.json
十二.使用Scrapy来实现小说下载的完整案例
在spiders文件下创建xbiqugu.py 爬虫
#xbiqugu.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tutorial.items import TextItem
import re
class XbiquguSpider(CrawlSpider):
name = 'xbiqugu'
allowed_domains = ['www.477zw3']
def __init__(self):
self.count = 0
def start_requests(self):
urls = [
"https://www.477zw3/html/265/265564/229802.html",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_item)
def parse_item(self, response):
self.count +=1
print(f"开始爬取-----------{self.count}")
item = TextItem()
item['title'] = response.xpath('//div[@class="bookname"]/h1/text()').get()
print(item['title'])
textContent = response.xpath('//div[@id="content"]//text()').getall()
# 去除 '\r' 和 '\xa0'
cleaned_list = [re.sub(r'[\r\xa0]+', '', text) for text in textContent]
item['Content'] = cleaned_list
# print(item['Content'])
yield item
#爬取下一页
# next_page = response.xpath("//div[@class='bottem2']/a[contains(text(), '下一章')]/@href").get()
# next_page = response.xpath("substring-after(//a[contains(text(), '下一章')]/@href, '/html')").get()
next_page = response.xpath("//a[contains(text(), '下一章')]/@href").get()
print("下一页:",next_page) # 输出:下一页: /html/265/265564/229803.html
if next_page is not None:
yield response.follow(next_page, callback=self.parse_item)
创建Scrapy容器
#items.py
import scrapy
class TextItem(scrapy.Item):
title = scrapy.Field()
Content = scrapy.Field()
创建管道text_download_pipeline.py
# text_download_pipeline
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import os
class TextDownloadPipeline():
def __init__(self):
# 定义存储text的目标文件夹
self.target_folder = "DownLoadText"
# 如果目标文件夹不存在,则创建
if not os.path.exists(self.target_folder):
os.makedirs(self.target_folder)
def open_spider(self,spider):
#管道开始前
pass
def close_spider(self,spider):
#pipeline 结束后
pass
def process_item(self, item, spider):
adapter = ItemAdapter(item)
title = adapter.get('title', '未知标题')
content = adapter.get('Content', [])
text_name = '测试技术.txt'
file_path = os.path.join(self.target_folder, text_name)
with open(file_path, 'a', encoding='utf-8') as file:
file.write(f"标题: {title}\n")
for line in content:
file.write(f"{line}\n")
file.write("\n") # 每个 item 之间添加空行分隔
return item
settings.py中注册管道
#settings.py
import random
BOT_NAME = 'tutorial'
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0 #每次请求间隔 0 秒
ITEM_PIPELINES = {
'tutorial.text_download_pipeline.TextDownloadPipeline':300
}
USER_AGENT_LIST = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
USER_AGENT = random.choice(USER_AGENT_LIST)
#创建日志
LOG_LEVEL = "INFO"
from datetime import datetime
LOG_DIR = "log"
if not os.path.exists(LOG_DIR):
os.makedirs(LOG_DIR)
today = datetime.now()
LOG_FILE = f"{LOG_DIR}/scrapy_{today.year}_{today.month}_{today.day}.log"
大家注意爬虫不是一直都是可以使用,需要根据情况进行调整,但是Scrapy框架确实减少了我们实现爬虫的逻辑,非常强大!
十三.总结
关于python Scrapy一起写了好几个晚上,应该都已经明白实现原理和怎么使用了,我们回归到一开始,Scrapy有什么局限性?怎么解决!确实现在的网页内容大多是JS动态生成,针对这种情况Scrapy是不能解决的!那么如何解决,这就涉及到Scrapy的动态爬取!请大家继续关注后续,我来给大家介绍!利用Scraoy来实现动态爬取!
python进阶-04-一篇带你掌握Python Scrapy(2.12)爬虫框架,附带实战
一.简介
在Python进阶系列我们来介绍Scrapy框架最新版本2.12,远超市面上的老版本,Scrapy框架在爬虫行业内鼎鼎大名,在学习之前我想请大家思考Scrapy究竟能解决什么问题?或者能爬哪一类型的网站!还有针对Scrapy的局限性我们如何依然使用好Scrapy!好,开始我们今天的日拱一卒!
二.安装Python Scrapy
#使用豆瓣源安装 提升安装速度
pip install Scrapy -i http://pypi.doubanio/simple --trusted-host pypi.doubanio
三.Scrapy 中文文档
学习任何一门技术最好的还是看官方文档,我先贴上
https://scrapy/
Scrapy也有比较不错的中文文档
https://scrapy-chs.readthedocs.io/zh-cn/stable/intro/tutorial.html
大家根据需要自己选择,这个框架很简单。。
四.创建Scrapy项目
在开始学习之前我先带大家实现一个简单的爬虫,再最后对Scrapy的运行流程进行介绍,这样大家才能更好的理解,我们来创建一个新的Scrapy项目,在vscode的终端中运行以下命令!
scrapy startproject tutorial
文件结构的介绍
tutorial/
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
注意:不是一个爬虫项目只能有一个爬虫,一个爬虫项目中可以创建很多爬虫任务,我们通过不同爬虫任务的name来指定运行哪个爬虫。
五.创建我们的第一个爬虫
在tutorial/spiders
目录下新建quotes_spider.py 文件
当然也可以使用命令创建一个爬虫,大家初学习的时候先手动创建吧!一样的。
scrapy genspider mydomain mydomain
quotes_spider.py文件内容如下:
#quotes_spider.py
from pathlib import Path
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
"https://quotes.toscrape/page/1/",
"https://quotes.toscrape/page/2/",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f"quotes-{page}.html"
Path(filename).write_bytes(response.body) # 写入文件 默认utf-8
self.log(f"Saved file {filename}") #终端中输出log日志
# 也可以这样写
# parse()是Scrapy的默认回调方法,该方法用于没有显式分配回调的请求
#quotes_spider.py
from pathlib import Path
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
"https://quotes.toscrape/page/1/",
"https://quotes.toscrape/page/2/",
]
def parse(self, response):
page = response.url.split("/")[-2]
filename = f"quotes-{page}.html"
Path(filename).write_bytes(response.body) # 写入文件 默认utf-8
self.log(f"Saved file {filename}") #终端中输出log日志
注意:
name = "quotes":
Scrapy项目中,name必须是唯一;
def start_requests(self):
必须返回一个可迭代的请求(可以返回一个请求列表或编写一个生成器函数),Scrapy将从它开始开始爬行。后续请求将从这些初始请求连续生成。
def parse(self, response):
将被调用以处理为每个请求下载的响应的方法。response参数是TextResponse的一个实例,它保存页面内容,并有更多有用的方法来处理它。parse()方法通常解析响应,将抓取的数据提取为dict,并查找要跟踪的新URL并从中创建新请求(Request)。
五.启动我们的Scrapy项目
进入我们的Scrapy项目tutorial,执行scrapy crawl quotes
(my_venv) PS F:\开发源码\python_demo_06> cd tutorial
(my_venv) PS F:\开发源码\python_demo_06\tutorial>
(my_venv) PS F:\开发源码\python_demo_06\tutorial> scrapy crawl quotes
大家运行成功应该可以看到我们爬虫项目运行成功,并且我们tutorial 文件夹下多了2个文件quotes-1.html、quotes-2.html,这时候我们已经成功实现Scrapy框架;
执行原理:
1.Scrapy执行scrapy crawl quotes
时会从spiders中找到name为quotes的爬虫,启动此爬虫;
2.接着执行start_requests 函数中的urls,请求地址,开始执行Scrapy中的内置请求,yield scrapy.Request(url=url, callback=self.parse) 如果我们指定了callback 就走callback对应的函数,如果没有指定则找默认的self.parse函数,如果啥都没有。。。爬虫关闭
3.self.parse接到请求返回后会执行解析。。
请大家思考一个问题 为啥用yield 而不用return?如果用return会出现什么情况?
截止目前我们还没有解析HTML,请稍等,好菜还没上!慢慢看。。
六.Scrapy解析数据,利用Scrapy自带的Xpath和css selectors
我们之前的文章介绍过BeautifulSoup 和Xpath来提取数据,但是呢Scrapy很强大,自带css选择器和Xpath选择器,我们可以直接使用,当然也可以依然使用BeautifulSoup 和Xpath来提取数据,既然我们今天介绍Scrapy,那么我们就用Scrapy自带的来提取数据,好还是上面的代码,只不过改一改quotes_spider.py文件
修改后的代码:
#quotes_spider.py
from pathlib import Path
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
"https://quotes.toscrape/page/1/",
"https://quotes.toscrape/page/2/",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
# def parse(self, response):
# page = response.url.split("/")[-2]
# filename = f"quotes-{page}.html"
# Path(filename).write_bytes(response.body)
# self.log(f"Saved file {filename}")
def parse(self, response):
print("**************提取开始******************")
print(response.css("title"))
print("**************提取结束******************")
'''
输出:
**************提取开始******************
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
**************提取结束******************
2024-11-20 22:57:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape/page/2/> (referer: None)
**************提取开始******************
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
**************提取结束******************
'''
看到了没?输出如下(后面所有的提取 我只写关键部分:):
print(response.css("title"))
#[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
CSS选择器:
-
CSS选择器提取内容为列表
response.css("title::text").getall() #['Quotes to Scrape']
这里注意
::text
如果不加::text
会出现什么情况呢?可以发现节点标签不是我们想要的。。所以要加::text
才能获取我们想要的内容response.css("title").getall() #['<title>Quotes to Scrape</title>']
-
只获取第一个结果
response.css("title::text").get() # 'Quotes to Scrape'
也可以这样写
response.css("title::text")[0].get() #'Quotes to Scrape'
**注意:**这2种写法有什么区别呢?
response.css(“title::text”)[0].get():如果没有结果 索引会引发IndexError
response.css(“title::text”).get():如果没有结果,返回None
-
CSS选择器+正则表达式
response.css("title::text").re(r"Quotes.*") #['Quotes to Scrape'] response.css("title::text").re(r"Q\w+") #['Quotes'] response.css("title::text").re(r"(\w+) to (\w+)") #['Quotes', 'Scrape']
-
直接使用CSS选择器
response.css("div.quote") ''' 输出: [<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, ...] '''
Xpath选择器:
-
提取内容
response.xpath("//title") #[<Selector query='//title' data='<title>Quotes to Scrape</title>'>]
-
提取文字内容
response.xpath("//title/text()").get() #'Quotes to Scrape'
-
提取标签包含指定文字的标签
next_page = response.xpath("//div[@class='bottem2']/a[contains(text(), '下一章')]/@href").get() next_page = response.xpath("substring-after(//a[contains(text(), '下一章')]/@href, '/html')").get() next_page = response.xpath("//a[contains(text(), '下一章')]/@href").get()
-
提取指定标签及其子标签的全部内容
textContent = response.xpath('//div[@id="content"]//text()').getall()
使用插件SelectorGadget 帮我们快速获取css选择器和Xpath选择器:
我已经给大家准备好最新版下载地址:https://download.csdn/download/Lookontime/90025172
安装方式:
- 下载后解压
-
在谷歌浏览器中输入chrome://extensions/
在我们真实项目中,这样构造CSS选择器和Xpath选择器,效率还有有点慢!有没有更好的办法,还真有,但不是很完美,就是使用SelectorGadget 插件!可以帮我们快速构建Xpath,我们只要稍做修改即可!
注意,我在阅读官方的时候,说CSS选择器在Scrapy引擎下实际是被转换为Xpath,而且官方建议使用Xpath,这里我之前写过一篇专门介绍Xpath的文章,有兴趣的可以去看我之前的文章,有前端基础的小伙伴看这个应该超级简单。。。
七.使用scrapy shell 'https://quotes.toscrape’来验证我们的解析:
'https://quotes.toscrape’的页面结构如下:
<div class="quote">
<span class="text">“The world as we have created it is a process of our
thinking. It cannot be changed without changing our thinking.”</span>
<span>
by <small class="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
让我们执行这个命令
scrapy shell 'https://quotes.toscrape'
接着会进行等待页面,我们执行我们的选择器
>>> response.css("div.quote")
[<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
...]
>>> quote = response.css("div.quote")[0]
>>> text = quote.css("span.text::text").get()
>>> text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").get()
>>> author
'Albert Einstein'
>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']
>>> for quote in response.css("div.quote"):
... text = quote.css("span.text::text").get()
... author = quote.css("small.author::text").get()
... tags = quote.css("div.tags a.tag::text").getall()
... print(dict(text=text, author=author, tags=tags))
...
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
...
如何退出 scrapy shell?
quit()
八.如何实现解析下一页直到不满足条件时停止
举个例子,当我们想安安静静看本小说又不想被满屏的广告打扰,这个时候我们就有一个爬虫需求,爬取网页中的内容,让后找到下一页,继续爬取,继续找寻下一页,直到不满足条件时停止,这个时候我们怎么实现?
有人说,我们把所有的页面url全部放到def start_requests(self)函数 urls中,这样不就可以了?可以是可以,你估计得累死。。因为第一url不可能是规律的递增变化。还有就是爬取的顺序我们需要控制或者才有其他办法。
那么有没有办法我们只给起始页面,页面解析下一页的url让后返回给parse来进行循环解析呢?当然有
举例我们的下一页如下:
<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
</ul>
构建我们的爬虫
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
"https://quotes.toscrape/page/1/",
]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
刚刚我们看到了,我们解析出下一页的url然后构建请求地址,然后再将请求内容返回给self.parse,直到不满足条件为止,好!非常棒!但是Scrapy框架更强大,有更简单的方法,大家接着看
方式一:response.follow
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
"https://quotes.toscrape/page/1/",
]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("span small::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
方式二:利用for循环
for href in response.css("ul.pager a::attr(href)"):
yield response.follow(href, callback=self.parse)
方式三:直接返回a标签
for a in response.css("ul.pager a"):
yield response.follow(a, callback=self.parse)
方式四:使用response.follow_all
anchors = response.css("ul.pager a")
yield from response.follow_all(anchors, callback=self.parse)
方式五:直接传入解析器
yield from response.follow_all(css="ul.pager a", callback=self.parse)
一个完整的例子
import scrapy
class AuthorSpider(scrapy.Spider):
name = "author"
start_urls = ["https://quotes.toscrape/"]
def parse(self, response):
author_page_links = response.css(".author + a")
yield from response.follow_all(author_page_links, self.parse_author)
pagination_links = response.css("li.next a")
yield from response.follow_all(pagination_links, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).get(default="").strip()
yield {
"name": extract_with_css("h3.author-title::text"),
"birthdate": extract_with_css(".author-born-date::text"),
"bio": extract_with_css(".author-description::text"),
}
九.给爬虫传入参数
我们想在运行代码时传入参数, 只需要执行命令时使用 -a选项
执行命令:
scrapy crawl quotes -O quotes-humor.json -a tag=humor
爬虫代码:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
url = "https://quotes.toscrape/"
tag = getattr(self, "tag", None)
if tag is not None:
url = url + "tag/" + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
yield response.follow(next_page, self.parse)
命令解析:
scrapy crawl quotes
- 运行名为
quotes
的爬虫。quotes
是在你的 Scrapy 项目中定义的爬虫名称,通常会在spiders
文件夹中找到对应的代码文件。
- 运行名为
-O quotes-humor.json
-O
代表输出文件,quotes-humor.json
是输出文件的名称。- Scrapy 会将爬取到的数据保存为 JSON 格式文件,覆盖同名文件(如果存在)。
-a tag=humor
- 使用
-a
参数为爬虫传递一个名为tag
的参数,其值为humor
。 - 在爬虫代码中,可以通过
self.tag
访问这个参数。通常,这种参数用于向爬虫指定一个过滤条件,比如只抓取与“幽默”相关的内容。
- 使用
十.Scrapy数据容器Item和Field
截止到目前大家是不是好像明白了Scrapy,但是又不太明白,是不是存在一个疑问,我是实现了爬虫和解析数据,但是我怎么使用呢?这就涉及到Scrapy数据容器和Scrapy管道的概念!先别急,我们来介绍Scrapy数据容器
Scrapy中提供了2个类 Item和Field,使用前需要在items.py中先导入,items.py代码如下:
#items.py
import scrapy
# class TutorialItem(scrapy.Item):
# # define the fields for your item here like:
# # name = scrapy.Field()
# pass
# class DmozItem(scrapy.Item):
# title = scrapy.Field()
# link = scrapy.Field()
# desc = scrapy.Field()
class QuoteItem(scrapy.Item):
imgBase64 = scrapy.Field()
file_name = scrapy.Field()
class VideoItem(scrapy.Item):
video_url = scrapy.Field()
file_name = scrapy.Field()
class TextItem(scrapy.Item):
title = scrapy.Field()
Content = scrapy.Field()
**Item基类:**实现的自定义数据类,必须继承Item基类 如class TextItem(scrapy.Item)
**Field类:**描述自定义数据类包含的字段,如title、Content
使用前需要创建Item对象
item = TextItem()
item['title'] = response.xpath('//div[@class="bookname"]/h1/text()').get()
textContent = response.xpath('//div[@id="content"]//text()').getall()
# 去除 '\r' 和 '\xa0'
cleaned_list = [re.sub(r'[\r\xa0]+', '', text) for text in textContent]
item['Content'] = cleaned_list
获取字段值:
print(item['title'])
print(item['Content'])
获取所有字段名
item.keys()
Item复制
item2 = item.copy()
十一.Scrapy pipeline 管道
截止到目前我们实现了Scrapy数据容器,那么怎么使用数据容器?这就涉及到Scrapy pipeline 管道,这里是重点因为Scrapy pipeline可以自动接收Scrapy数据容器,并根据Scrapy数据容器来实现不同的功能,如将item解析存储到数据库,下载图片,下载文件,数据存储到json,excel,txt等。
使用Scrapy pipeline 管道 首先要进行注册,Scrapy 爬虫开始后会自动将item数据传输到所有已经注册的pipeline 以实现不同管道处理不同内容。
pipeline 注册:在settings.py文件下注册
#settings.py文件下
ITEM_PIPELINES = {
# 'tutorial.pipelines.TutorialPipeline': 300,
# 'tutorial.save_Image_pipeline.SaveImagePipeline': 300,
# 'tutorial.video_download_pipeline.VideoDownloadPipeline': 500,
'tutorial.text_download_pipeline.TextDownloadPipeline':300
}
‘tutorial.text_download_pipeline.TextDownloadPipeline’ : pipeline 文件地址
300:数字越小优先级越高
一个完整的pipeline示例
#text_download_pipeline.py
from itemadapter import ItemAdapter
import os
class TextDownloadPipeline():
def __init__(self):
# 定义存储text的目标文件夹
self.target_folder = "DownLoadText"
# 如果目标文件夹不存在,则创建
if not os.path.exists(self.target_folder):
os.makedirs(self.target_folder)
def open_spider(self,spider):
#spider开始前
pass
def close_spider(self,spider):
#spider 结束后
pass
def process_item(self, item, spider):
adapter = ItemAdapter(item)
title = adapter.get('title', '未知标题')
content = adapter.get('Content', [])
text_name = '下载的文档内容.txt'
file_path = os.path.join(self.target_folder, text_name)
with open(file_path, 'a', encoding='utf-8') as file:
file.write(f"标题: {title}\n")
for line in content:
file.write(f"{line}\n")
file.write("\n") # 每个 item 之间添加空行分隔
return item
注意:def process_item(self, item, spider):
必须实现的方法
这个管道是用来解析Scrapy 容器item来实现将item中的内容一行行写入 txt文件
除了我们自定义的pipeline外,Scrapy 两个特殊的pipeline,分别用来处理文件和图片:FilesPipeline和ImagesPipeline,下面我们来掌握这一概念:
FilesPipeline | ImagesPipeline | |
---|---|---|
导入路径 | scrapy.pipelines.files.FilesPipeline | scrapy.pipelines.images.ImagesPipeline |
Item字段 | file_urls,files | image_urls,images |
存储路径 | FILES_STORE | IMAGES_STORE |
FilesPipeline
-
settings.py文件下配置
import os #注册pipeline ITEM_PIPELINES = { 'scrapy.pipelines.files.FilesPipeline':300 } #配置文件存储路径 FILES_STORE="F:\\DownloadFiels" if not os.path.exists(FILES_STORE): os.makedirs(FILES_STORE)
-
实现item.py 中的数据容器
class FileItem(scrapy.Item): file_urls = scrapy.Field() files = scrapy.Field()
-
代码实现
#新建爬虫 scrapy genspider dload_files model #或者直接在spiders创建
#dload_files.py import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from tutorial.items import FileItem, TextItem import re class XbiquguSpider(CrawlSpider): name = 'dload_files' allowed_domains = ['www.model'] def __init__(self): pass def start_requests(self): urls = [ "https://www.model/html/265/265564/229802.html", ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): for line in response.xpath('//div[@class="bookname"]/ul/li'): for example in line.xpath('.//ul/li'): url = example.xpath('.//a//@href').extract_first() url = response.urljoin(url) yield scrapy.Request(url,callback=self.parse_files) def parse_files(self, response): href = response.xpath('//a/@href').extract_first() url = response.urljoin(href) fileItem = FileItem() fileItem['file_urls'] =[url] return fileItem
-
运行爬虫
(my_venv) PS F:\开发源码\python_demo_06\tutorial> scrapy crawl dload_files - o myfiles.json
ImagesPipeline
-
settings.py文件下配置
import os #注册pipeline ITEM_PIPELINES = { 'scrapy.pipelines.images.ImagesPipeline':300 } #配置文件存储路径 IMAGES_STORE="F:\\ImageFiels" if not os.path.exists(IMAGES_STORE): os.makedirs(IMAGES_STORE) #配置要抓取最大最小图片尺寸 IMAGES_THUMBS = { 'small': (50, 50), # 'big': (270, 270), } #配置要抓取最大最小图片尺寸 #IMAGES_MIN_WIDTH = 50 #最小宽度 #IMAGES_MIN_HEIGHT = 50 #最小宽度
-
实现item.py 中的数据容器
import scrapy class MyImageItem(scrapy.Item): image_urls = scrapy.Field() # 存放图片 URL 列表 images = scrapy.Field() # 存放下载后的图片信息
-
代码实现
#新建爬虫 scrapy genspider dload_files model #或者直接在spiders创建
#image_files.py import scrapy from myproject.items import MyImageItem class ImageSpider(scrapy.Spider): name = 'imagespider' start_urls = ['https://example'] def parse(self, response): item = MyImageItem() item['image_urls'] = response.css('img::attr(src)').extract() # 提取图片 URL yield item
-
运行爬虫
(my_venv) PS F:\开发源码\python_demo_06\tutorial> scrapy crawl imagespider - o myimages.json
十二.使用Scrapy来实现小说下载的完整案例
在spiders文件下创建xbiqugu.py 爬虫
#xbiqugu.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tutorial.items import TextItem
import re
class XbiquguSpider(CrawlSpider):
name = 'xbiqugu'
allowed_domains = ['www.477zw3']
def __init__(self):
self.count = 0
def start_requests(self):
urls = [
"https://www.477zw3/html/265/265564/229802.html",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_item)
def parse_item(self, response):
self.count +=1
print(f"开始爬取-----------{self.count}")
item = TextItem()
item['title'] = response.xpath('//div[@class="bookname"]/h1/text()').get()
print(item['title'])
textContent = response.xpath('//div[@id="content"]//text()').getall()
# 去除 '\r' 和 '\xa0'
cleaned_list = [re.sub(r'[\r\xa0]+', '', text) for text in textContent]
item['Content'] = cleaned_list
# print(item['Content'])
yield item
#爬取下一页
# next_page = response.xpath("//div[@class='bottem2']/a[contains(text(), '下一章')]/@href").get()
# next_page = response.xpath("substring-after(//a[contains(text(), '下一章')]/@href, '/html')").get()
next_page = response.xpath("//a[contains(text(), '下一章')]/@href").get()
print("下一页:",next_page) # 输出:下一页: /html/265/265564/229803.html
if next_page is not None:
yield response.follow(next_page, callback=self.parse_item)
创建Scrapy容器
#items.py
import scrapy
class TextItem(scrapy.Item):
title = scrapy.Field()
Content = scrapy.Field()
创建管道text_download_pipeline.py
# text_download_pipeline
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import os
class TextDownloadPipeline():
def __init__(self):
# 定义存储text的目标文件夹
self.target_folder = "DownLoadText"
# 如果目标文件夹不存在,则创建
if not os.path.exists(self.target_folder):
os.makedirs(self.target_folder)
def open_spider(self,spider):
#管道开始前
pass
def close_spider(self,spider):
#pipeline 结束后
pass
def process_item(self, item, spider):
adapter = ItemAdapter(item)
title = adapter.get('title', '未知标题')
content = adapter.get('Content', [])
text_name = '测试技术.txt'
file_path = os.path.join(self.target_folder, text_name)
with open(file_path, 'a', encoding='utf-8') as file:
file.write(f"标题: {title}\n")
for line in content:
file.write(f"{line}\n")
file.write("\n") # 每个 item 之间添加空行分隔
return item
settings.py中注册管道
#settings.py
import random
BOT_NAME = 'tutorial'
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0 #每次请求间隔 0 秒
ITEM_PIPELINES = {
'tutorial.text_download_pipeline.TextDownloadPipeline':300
}
USER_AGENT_LIST = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
USER_AGENT = random.choice(USER_AGENT_LIST)
#创建日志
LOG_LEVEL = "INFO"
from datetime import datetime
LOG_DIR = "log"
if not os.path.exists(LOG_DIR):
os.makedirs(LOG_DIR)
today = datetime.now()
LOG_FILE = f"{LOG_DIR}/scrapy_{today.year}_{today.month}_{today.day}.log"
大家注意爬虫不是一直都是可以使用,需要根据情况进行调整,但是Scrapy框架确实减少了我们实现爬虫的逻辑,非常强大!
十三.总结
关于python Scrapy一起写了好几个晚上,应该都已经明白实现原理和怎么使用了,我们回归到一开始,Scrapy有什么局限性?怎么解决!确实现在的网页内容大多是JS动态生成,针对这种情况Scrapy是不能解决的!那么如何解决,这就涉及到Scrapy的动态爬取!请大家继续关注后续,我来给大家介绍!利用Scraoy来实现动态爬取!