PyMuPDF4LLM：多模态PDF 解析神器！

AGI Hunt · 公众号 · · 2024-10-07 00:00

正文

从现在起，PDF 不再是你 AI 应用的拦路虎！

PyMuPDF4LLM，这个新鲜出炉的开源库，正改变着 PDF 处理的游戏规则。它不仅能轻松提取文本和图像，还能为 LLM 和 RAG 应用提供结构化的数据，让你的 AI 项目如虎添翼。

文本提取：从混沌到有序

PyMuPDF4LLM 的 to_markdown() 函数就像一把锋利的手术刀，能够精准地从 PDF 中剖析出文本内容。

#### Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.\n\n## Attention Is All You Need\n\n\n**Ashish Vaswani[∗]**\nGoogle Brain\n```\[email protected]\n\n```\n**Llion Jones[∗]**\nGoogle Research\n```\n [email protected]\n\n```\n\n**Noam Shazeer[∗]**\nGoogle Brain\n```\[email protected]\n\n```\n\n

它不仅仅是简单地复制粘贴，而是将文本转换成结构良好的 Markdown 格式。这意味着你可以轻松保留原文的标题、段落和列表结构，为后续的 NLP 任务打下坚实基础。

元数据：PDF 的"隐藏宝藏"

但 PyMuPDF4LLM 的魔力远不止于此。它还能挖掘出 PDF 中的各种元数据，如文档创建日期、文件路径、图像坐标，甚至目录结构。

{'metadata': {'format': 'PDF 1.5',   'title': '',   'author': '',   'subject': '',   'keywords': '',   'creator': 'LaTeX with hyperref',   'producer': 'pdfTeX-1.40.25',   'creationDate': 'D:20240410211143Z',   'modDate': 'D:20240410211143Z',   'trapped': '',   'encryption': None,   'file_path': '/content/document.pdf',   'page_count': 15,   'page': 3},  'toc_items': [[2, 'Encoder and Decoder Stacks', 3], [2, 'Attention', 3]],  'tables': [],  'images': [{'number': 0,    'bbox': (196.5590057373047,     72.00198364257812,     415.43902587890625,     394.4179992675781),    'transform': (218.8800048828125,     0.0,     -0.0,     322.416015625,     196.5590057373047,     72.00198364257812),    'width': 1520,    'height': 2239,    'colorspace': 3,    'cs-name': 'DeviceRGB',    'xres': 96,    'yres': 96,    'bpc': 8,    'size': 264957}],  'graphics': [],  'text': '![](/content/images/document.pdf-2-0.jpg)\n\nFigure 1: The Transformer - model architecture.\n\nThe Transformer follows this overall architecture using stacked self-attention and point-wise, fully\nconnected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,\nrespectively.\n\n**3.1** **Encoder and Decoder Stacks**\n\n          **Encoder:** The encoder is composed of a stack of N = 6 identical'  'words': []}]

这些信息就像给 AI 模型装上了"透视眼"，让它能够更全面地理解文档的结构和内容。

图像处理：不再是可有可无

在多模态 AI 的时代，图像信息变得越来越重要。

PyMuPDF4LLM 不仅能提取图像，还允许你控制图像的大小、分辨率和格式。

更妙的是，它可以将图像直接嵌入到 Markdown 文本中，为你的多模态应用提供完整的素材。

表格识别：结构化数据的福音

对于那些充满表格的 PDF 文档，PyMuPDF4LLM 也有妙招。它能精确定位表格的位置，并提供行数和列数信息。这为后续的表格数据提取和分析铺平了道路。

词语提取：精细到每个字

如果你需要更细粒度的文本分析，PyMuPDF4LLM 的 extract_words 功能堪称神器。

它不仅能提取每个单词，还能给出它们在页面上的精确坐标。

这对于需要保留原文排版信息的应用来说，简直是雪中送炭。

输出的文字序列：

'graphics': [],  'text': 'Table 1: Maximum path lengths,'  'words': [(107.69100189208984,    71.19241333007812,    129.12155151367188,    81.05488586425781,    'Table',    0,    0,    0),   (131.31829833984375,    71.19241333007812,    138.9141845703125,    81.05488586425781,    '1:',    0,    0,    1),   (144.78195190429688,    71.19241333007812,    185.4658203125,    81.05488586425781,    'Maximum',    0,    0,    2),   (187.65281677246094,    71.19241333007812,    204.46530151367188,    81.05488586425781,    'path',    0,    0,    3),

实战应用：多模态 RAG 系统

PyMuPDF4LLM 的强大之处，在于它能无缝集成到现有的 AI 工作流中。

PyMuPDF4LLM：多模态PDF 解析神器！

正文

请到「今天看啥」查看全文