面向AI应用的网络爬虫和数据提取工具Crawl4AI

GitHubStore · 公众号 · · 2024-07-13 09:08

正文

项目简介

扫码加入技术交流群，备注「 开发语言-城市-昵称 」

合作请注明

Crawl4AI 是一个开源 Python 库，简化了网络爬虫和数据提取，使其适用于 LLMs和 AI 应用。

主要特性 ✨

🆓 完全免费且开源

🤖 适合 LLM 的输出格式（JSON、HTML、Markdown） 🌍 支持同时爬取多个 URL

🎨 提取并返回所有媒体标签（图片、音频和视频）

🔗 提取所有外部和内部链接

📚 提取页面的元数据

🔄 支持自定义钩子进行身份验证、设置头部信息以及在爬取前修改页面

🕵️ 用户代理 (User-agent) 自定义

🖼️ 截取页面屏幕截图

📜 在爬取前执行多个自定义 JavaScript 脚本

📚 各种分块策略：基于主题的、正则表达式的、句子分割等

🧠 高级提取策略：余弦聚类、LLM 等

🎯 支持 CSS 选择器

📝 传递指令/关键词以优化提取过程

快速开始

from crawl4ai import WebCrawler
# Create an instance of WebCrawlercrawler = WebCrawler()
# Warm up the crawler (load necessary models)crawler.warmup()
# Run the crawler on a URLresult = crawler.run(url="https://www.nbcnews.com/business")
# Print the extracted contentprint(result.markdown)

安装

virtualenv venvsource venv/bin/activatepip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git"```️
### Speed-First Design 🚀
Perhaps the most important design principle for this library is speed. We need to ensure it can handle many links and resources in parallel as quickly as possible. By combining this speed with fast LLMs like Groq, the results will be truly amazing.
```pythonimport timefrom crawl4ai.web_crawler import WebCrawlercrawler = WebCrawler()crawler.warmup()
start = time.time()url = r"https://www.nbcnews.com/business"result = crawler.run( url, word_count_threshold=10, bypass_cache=True)end = time.time()print(f"Time taken: {end - start}")

我们看一下上面代码片段的计算时间：

[LOG] 🚀 Crawling done, success: True, time taken: 1.3623387813568115 seconds[LOG] 🚀 Content extracted, success: True, time taken: 0.05715131759643555 seconds[LOG] 🚀 Extraction, time taken: 0.05750393867492676 seconds.Time taken: 1.439958095550537

从页面获取内容花费了 1.3623 秒，提取内容花费了 0.0575 秒。🚀

从网页中提取结构化数据📊

从官方页面爬取所有 OpenAI 模型及其费用。

import osfrom crawl4ai import WebCrawlerfrom crawl4ai.extraction_strategy import LLMExtractionStrategyfrom pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):    model_name: str = Field(..., description="Name of the OpenAI model.")    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")    output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.")
url = 'https://openai.com/api/pricing/'crawler = WebCrawler()crawler.warmup()
result = crawler.run(        url=url,        word_count_threshold=1,        extraction_strategy= LLMExtractionStrategy(            provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),             schema=OpenAIModelFee.schema(),

面向AI应用的网络爬虫和数据提取工具Crawl4AI

正文

项目简介 (adsbygoogle = window.adsbygoogle || []).push({});

从网页中提取结构化数据📊

请到「今天看啥」查看全文

项目简介