将任意文本通过给定的本体转换成知识图谱!

GitHubStore · 公众号 · · 2024-05-18 11:33

正文

项目简介

一个Python工具，可以将任何文本转换为给定本体的知识图。

什么是知识图谱？

知识图谱，也称为语义网络，表示现实世界实体的网络，即物体、事件、情况或概念——并说明它们之间的关系。这些信息通常存储在图形数据库中并可视化为图形结构，从而产生了知识“图形”一词。

来源：https://www.ibm.com/topics/knowledge-graph

Why Graph? 为什么要绘制图表？

KG 可用于多种目的。我们可以运行图算法并计算任何节点的中心性，以了解概念（节点）对于这项工作的重要性。我们可以计算社区，将概念组合在一起，以便更好地分析文本。我们可以理解看似互不相关的概念之间的联系。

最重要的是，我们可以实现图检索增强生成（GRAG），并使用图作为检索器以更深刻的方式与我们的文本聊天。这是检索增强生成（RAG）的新改进版本，我们使用矢量数据库作为检索器来与我们的文档聊天。

这个项目

这是一个 python 库，可以使用给定的本体从任何文本创建知识图。该库创建的图表相当一致，并且对 LLM 生成的错误响应具有良好的弹性。

以下是创建知识图谱的步骤。

要建立这个项目，你需要Poetry

# Create a local environment$ poetry config virtualenvs.create false --local# Install dependencies.$ poetry install

1. 定义图的本体

该库理解本体的以下模式。在幕后，本体论是一个迂腐的模型。

ontology = Ontology(    # labels of the entities to be extracted. Can be a string or an object, like the following.    labels=[        {"Person": "Person name without any adjectives, Remember a person may be references by their name or using a pronoun"},        {"Object": "Do not add the definite article 'the' in the object name"},        {"Event": "Event event involving multiple people. Do not include qualifiers or verbs like gives, leaves, works etc."},        "Place",        "Document",        "Organisation",        "Action",        {"Miscellanous": "Any important concept can not be categorised with any other given label"},    ],    # Relationships that are important for your application.    # These are more like instructions for the LLM to nudge it to focus on specific relationships.    # There is no guarentee that only these relationships will be extracted, but some models do a good job overall at sticking to these relations.    relationships=[        "Relation between any pair of Entities",        ],)

我已经调整了提示以产生与给定本体一致的结果。我认为它在这方面做得很好。然而，它仍然不是 100% 准确。准确性取决于我们选择生成图表的模型、应用程序、本体和数据质量。

2. 将文本分割成块。

我们可以使用尽可能多的文本语料库来创建大型知识图。然而，LLMs 现在有一个有限的上下文窗口。因此，我们需要对文本进行适当的分块，并一次创建一个图块。我们应该使用的块大小取决于模型上下文窗口。该项目中使用的提示消耗了大约 500 个代币。上下文的其余部分可以分为输入文本和输出图形。根据我的经验，800 到 1200 个令牌块非常合适。

3. 将这些块转换为文档。

Documents 是一个 pydantic 模型，具有以下架构

## Pydantic document modelclass Document(BaseModel):    text: str    metadata: dict

我们在此处添加到文档的元数据被标记到从文档中提取的每个关系。我们可以将关系的上下文，例如页码、章节、文章名称等添加到元数据中。通常，每个节点对在多个文档中彼此具有多种关系。元数据有助于将这些关系置于上下文中。

4. 运行图形制作器。

图形制作器直接获取文档列表并迭代每个文档以为每个文档创建一个子图。最终输出是所有文档的完整图表。

这是简单的示例代码

from graph_maker import GraphMaker, Ontology, GroqClientfrom graph_maker import Document
## Select a groq supported model## model = "mixtral-8x7b-32768"model ="llama3-8b-8192"## model = "llama3-70b-8192"## model="gemma-7b-it" ## This is probably the fastest of all models, though a tad inaccurate.
llm = GroqClient(model=model, temperature=0.1, top_p=0.5)graph_maker = GraphMaker(ontology=ontology, llm_client=llm, verbose=False)
## create a graph out of a list of Documents.graph = graph_maker.from_documents(    list(docs),    delay_s_between=10 ## delay_s_between because otherwise groq api maxes out pretty fast.    )## result -> a list of Edges.print("Total number of Edges", len(graph))## 1503

输出是作为边列表的最终图，其中每条边都是如下所示的 pydantic 模型。

class Node(BaseModel):    label: str    name: str
class Edge(BaseModel):    node_1: Node    node_2: Node    relationship: str    metadata: dict = {}    order: Union[int, None] = None