项目简介
扫码
加入技术交流群,备注
「
开发语言-城市-昵称
」
合作请注明
Crawl4AI 是一个开源 Python 库,简化了网络爬虫和数据提取,使其适用于 LLMs和 AI 应用。
主要特性 ✨
🆓 完全免费且开源
🤖 适合 LLM 的输出格式(JSON、HTML、Markdown) 🌍 支持同时爬取多个 URL
🎨 提取并返回所有媒体标签(图片、音频和视频)
🔗 提取所有外部和内部链接
📚 提取页面的元数据
🔄 支持自定义钩子进行身份验证、设置头部信息以及在爬取前修改页面
🕵️ 用户代理 (User-agent) 自定义
🖼️ 截取页面屏幕截图
📜 在爬取前执行多个自定义 JavaScript 脚本
📚 各种分块策略:基于主题的、正则表达式的、句子分割等
🧠 高级提取策略:余弦聚类、LLM 等
🎯 支持 CSS 选择器
📝 传递指令/关键词以优化提取过程
快速开始
from crawl4ai import WebCrawler
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(url="https://www.nbcnews.com/business")
print(result.markdown)
安装
virtualenv venv
source venv/bin/activate
pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git"
```️
Perhaps the most important design principle for this library is speed. We need to ensure it can handle many links and resources in parallel as quickly as possible. By combining this speed with fast LLMs like Groq, the results will be truly amazing.
```python
import time
from crawl4ai.web_crawler import WebCrawler
crawler = WebCrawler()
crawler.warmup()
start = time.time()
url = r"https://www.nbcnews.com/business"
result = crawler.run( url, word_count_threshold=10, bypass_cache=True)
end = time.time()
print(f"Time taken: {end - start}")
我们看一下上面代码片段的计算时间:
[LOG] 🚀 Crawling done, success: True, time taken: 1.3623387813568115 seconds
[LOG] 🚀 Content extracted, success: True, time taken: 0.05715131759643555 seconds
[LOG] 🚀 Extraction, time taken: 0.05750393867492676 seconds.
Time taken: 1.439958095550537
从页面获取内容花费了 1.3623 秒,提取内容花费了 0.0575 秒。🚀
从网页中提取结构化数据📊
从官方页面爬取所有 OpenAI 模型及其费用。
import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.")
url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()
result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy= LLMExtractionStrategy(
provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.schema(),