Why Graph? 为什么要绘制图表?
KG 可用于多种目的。我们可以运行图算法并计算任何节点的中心性,以了解概念(节点)对于这项工作的重要性。我们可以计算社区,将概念组合在一起,以便更好地分析文本。我们可以理解看似互不相关的概念之间的联系。
这是一个 python 库,可以使用给定的本体从任何文本创建知识图。该库创建的图表相当一致,并且对 LLM 生成的错误响应具有良好的弹性。
# Create a local environment
$ poetry config virtualenvs.create false --local
# Install dependencies.
$ poetry install
1. 定义图的本体
ontology = Ontology(
# labels of the entities to be extracted. Can be a string or an object, like the following.
{"Person": "Person name without any adjectives, Remember a person may be references by their name or using a pronoun"},
{"Object": "Do not add the definite article 'the' in the object name"},
{"Event": "Event event involving multiple people. Do not include qualifiers or verbs like gives, leaves, works etc."},
{"Miscellanous": "Any important concept can not be categorised with any other given label"},
# Relationships that are important for your application.
# These are more like instructions for the LLM to nudge it to focus on specific relationships.
# There is no guarentee that only these relationships will be extracted, but some models do a good job overall at sticking to these relations.
"Relation between any pair of Entities",
我已经调整了提示以产生与给定本体一致的结果。我认为它在这方面做得很好。然而,它仍然不是 100% 准确。准确性取决于我们选择生成图表的模型、应用程序、本体和数据质量。
2. 将文本分割成块。
我们可以使用尽可能多的文本语料库来创建大型知识图。然而,LLMs 现在有一个有限的上下文窗口。因此,我们需要对文本进行适当的分块,并一次创建一个图块。我们应该使用的块大小取决于模型上下文窗口。该项目中使用的提示消耗了大约 500 个代币。上下文的其余部分可以分为输入文本和输出图形。根据我的经验,800 到 1200 个令牌块非常合适。
3. 将这些块转换为文档。
Documents 是一个 pydantic 模型,具有以下架构
## Pydantic document model
class Document(BaseModel):
text: str
metadata: dict
4. 运行图形制作器。
from graph_maker import GraphMaker, Ontology, GroqClient
from graph_maker import Document
## Select a groq supported model
## model = "mixtral-8x7b-32768"
model ="llama3-8b-8192"
## model = "llama3-70b-8192"
## model="gemma-7b-it" ## This is probably the fastest of all models, though a tad inaccurate.
llm = GroqClient(model=model, temperature=0.1, top_p=0.5)
graph_maker = GraphMaker(ontology=ontology, llm_client=llm, verbose=False)
## create a graph out of a list of Documents.
graph = graph_maker.from_documents(
delay_s_between=10 ## delay_s_between because otherwise groq api maxes out pretty fast.
## result -> a list of Edges.
print("Total number of Edges", len(graph))
## 1503
输出是作为边列表的最终图,其中每条边都是如下所示的 pydantic 模型。
class Node(BaseModel):
label: str
name: str
class Edge(BaseModel):
node_1: Node
node_2: Node
relationship: str
metadata: dict = {}
order: Union[int, None] = None