网页抓取和浏览器自动化库Crawlee-python

GitHubStore · 公众号 · · 2024-07-19 15:45

正文

项目简介

Crawlee 涵盖了端到端的爬虫和抓取，并帮助您构建可靠的抓取工具。快速地。

🚀 Crawlee for Python 向早期采用者开放！

即使使用默认配置，您的爬虫也会看起来几乎像人类一样，并且在现代机器人保护的雷达下飞行。Crawlee 为您提供了在网络上抓取链接、抓取数据并将其持久存储为机器可读格式的工具，而无需担心技术细节。得益于丰富的配置选项，如果默认设置无法满足您的项目需求，您几乎可以调整 Crawlee 的任何方面。

👉 在 Crawlee 项目网站上查看完整文档、指南和示例 👈

我们还有 Crawlee 的 TypeScript 实现，您可以在您的项目中探索和利用它。请访问我们的 GitHub 存储库，了解 GitHub 上 Crawlee for JS/TS 的更多信息。

安装

我们建议访问 Crawlee 文档中的简介教程以获取更多信息。

Crawlee is available as the crawlee PyPI package.
Crawlee 可作为 crawlee PyPI 包使用。

pip install crawlee

解锁更多功能的附加可选依赖项作为附加包提供。

如果您打算解析 HTML 并使用 CSS 选择器，请额外安装 crawlee 和 beautifulsoup ：

pip install 'crawlee[beautifulsoup]'

如果您打算使用（无头）浏览器，请安装 crawlee 以及 playwright 额外内容：

pip install 'crawlee[playwright]'

然后，安装 Playwright 依赖项：

playwright install

您可以使用逗号作为分隔符一次安装多个附加组件：

pip install 'crawlee[beautifulsoup,playwright]'

验证Crawlee是否安装成功：

python -c 'import crawlee; print(crawlee.__version__)'

使用 Crawlee CLI

开始使用 Crawlee 的最快方法是使用 Crawlee CLI 并选择准备好的模板之一。首先，确保您已安装 Pipx：

pipx --help

然后，运行 CLI 并从可用模板中进行选择：

pipx run crawlee create my-crawler

如果您已经安装了 crawlee ，您可以通过运行以下命令来启动它：

crawlee create my-crawler

例子

以下是一些实际示例，可帮助您开始使用 Crawlee 中的不同类型的爬虫。每个示例都演示了如何针对特定用例设置和运行爬网程序，无论您需要处理简单的 HTML 页面还是与 JavaScript 较多的网站进行交互。爬网程序运行将在当前工作目录中创建一个 storage/ 目录。

BeautifulSoupCrawler

BeautifulSoupCrawler 使用 HTTP 库下载网页并向用户提供 HTML 解析的内容。它使用 HTTPX 进行 HTTP 通信，使用 BeautifulSoup 解析 HTML。它非常适合需要从 HTML 内容中高效提取数据的项目。该爬虫由于不使用浏览器而具有非常好的性能。但是，如果您需要执行客户端 JavaScript 来获取内容，这还不够，您需要使用 PlaywrightCrawler。另外，如果您想使用此爬网程序，请确保安装 crawlee 以及额外的 beautifulsoup 。

import asyncio
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

async def main() -> None:    crawler = BeautifulSoupCrawler(        # Limit the crawl to max requests. Remove or increase it for crawling all links.        max_requests_per_crawl=10,    )
    # Define the default request handler, which will be called for every request.    @crawler.router.default_handler    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:        context.log.info(f'Processing {context.request.url} ...')
        # Extract data from the page.        data = {            'url': context.request.url,            'title': context.soup.title.string if context.soup.title else None,        }
        # Push the extracted data to the default dataset.        await context.push_data(data)
        # Enqueue all links found on the page.        await context.enqueue_links()
    # Run the crawler with the initial list of URLs.    await crawler.run(['https://crawlee.dev'])
if __name__ == '__main__':    asyncio.run(main())

PlaywrightCrawler

PlaywrightCrawler 使用无头浏览器下载网页并提供用于数据提取的API。它基于 Playwright，一个专为管理无头浏览器而设计的自动化库。它擅长检索依赖客户端 JavaScript 生成内容的网页，或需要与 JavaScript 驱动的内容交互的任务。对于不需要执行 JavaScript 或需要更高性能的场景，请考虑使用 BeautifulSoupCrawler 。另外，如果您想使用此爬网程序，请确保安装 crawlee 以及额外的 playwright 。

import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main() -> None:    crawler = PlaywrightCrawler(

网页抓取和浏览器自动化库Crawlee-python

正文

项目简介 (adsbygoogle = window.adsbygoogle || []).push({});

安装

使用 Crawlee CLI

例子

BeautifulSoupCrawler

PlaywrightCrawler

请到「今天看啥」查看全文

项目简介