专栏名称: 深度学习自然语言处理

一个从大三就接触NLP的小小NLPer，本公众号每天记录自己的一点一滴，每篇文章最后也有托福单词等新知识，学技术同时，也一点一滴积累额外的知识。期待与你在知识的殿堂与你相遇！

CGE：基于Causal LLM的Code Embedding模型

深度学习自然语言处理 · 公众号 · · 2024-09-21 17:42

正文

来自：CodeFuse

近日，CodeFuse-CGE 项目在外滩大会展出，吸引了众多技术、产品从业者的到访，部分参观者表示“文搜代码”令人耳目一新，期待模型后续的表现。

以下是 CodeFuse-CGE 项目的相关开源介绍，如果对这部分内容感兴趣，欢迎访问我们的项目主页 https://github.com/codefuse-ai/CodeFuse-CGE 为我们点赞，支持我们的项目。

简介

Code Embedding 是一种将代码片段转化为向量表示的技术。这种表示形式使得机器学习模型能够更好地理解和处理代码，在自动化程序分析、代码搜索、代码补全，以及自动化测试等领域都起到非常重要的作用。大语言模型（Large Language Models）因为其在大量的语言数据上预训练，可以获得对语义细微表示的能力。最近，LLMs 在代码生成、代码补全等任务上都有非常出色的表现。

目前 Code Embedding 模型主要基于 Encoder 架构，如 CodeBert、Unixcoder 等。又或者基于 Encoder-Decoder 架构，如 CodeT5、CodeT5+ 等。然而局限于架构设计和模型大小，他们很难获取到更丰富的语义表示能力。

我们以 CodeQwen1.5-7B-Chat 和 Phi-3.5-mini-instruct 模型作为基座模型，通过一个交叉注意力计算模块来提取输入序列的 Embedding，将文本表征和代码表征投射到同一空间中。我们的方法可以激发出基座模型强大的代码、文本的语义表示能力。实验表明我们的方法在 CSN 和 AdvTest 这 2 个 NL2Code Benchmarks 上都有着超越 SOTA 的能力。我们将开源 CGE-Large 和 CGE-Small 两种大小的模型。

TLDR

CGE 即 Code General Embedding。我们提出了一种基于大语言模型的获取 Embedding 方案，通过 Lora 微调来借助大语言模型的语义能力，激发其语义表征能力，在 2 个 NL2Code Benchmarks 上达到了 SOTA 的表现。

🏡 Homepage:

https://github.com/codefuse-ai/CodeFuse-CGE

(Please give us your support with a Star🌟 + Fork🚀 + Watch👀)

开源模型

CodeFuse-CGE-Large

huggingface 地址

https://huggingface.co/codefuse-ai/CodeFuse-CGE-Large

Model Configuration

Base Model：CodeQwen1.5-7B-Chat
Model Size：7B
Embedding Dimension：1024

Requirements

flash_attn==2.4.2torch==2.1.0accelerate==0.28.0transformers==4.39.2 vllm=0.5.3

CodeFuse-CGE-Small

huggingface 地址

https://huggingface.co/codefuse-ai/CodeFuse-CGE-Small

Model Configuration

Base Model：Phi-3.5-mini-instruct
Model Size：3.8B
Embedding Dimension：1024

Requirements

flash_attn==2.4.2torch==2.1.0accelerate==0.28.0transformers>=4.43.0

How to Use?

Transformers

import




    
 torchfrom transformers import AutoTokenizer, AutoModel
model_name_or_path = "codefuse-ai/CodeFuse-CGE-Large"model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True)tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, truncation_side='right', padding_side='right')
if torch.cuda.is_available():    device = 'cuda'else:    device = 'cpu'model.to(device)
prefix_dict =  {'python':{'query':'Retrieve the Python code that solves the following query:', 'passage':'Python code:'},                'java':{'query':'Retrieve the Java code that solves the following query:', 'passage':'Java code:'},                'go':{'query':'Retrieve the Go code that solves the following query:', 'passage':'Go code:'},                'c++':{'query':'Retrieve the C++ code that solves the following query:', 'passage':'C++ code:'},                'javascript':{'query':'Retrieve the Javascript code that solves the following query:', 'passage':'Javascript code:'},                'php':{'query':'Retrieve the PHP code that solves the following query:', 'passage':'PHP code:'},                'ruby':{'query':'Retrieve the Ruby code that solves the following query:', 'passage':'Ruby code:'},                'default':{'query':'Retrieve the code that solves the following query:', 'passage':'Code:'}                }
text = ["Writes a Boolean to the stream.",        "def writeBoolean(self, n): t = TYPE_BOOL_TRUE if n is False: t = TYPE_BOOL_FALSE self.stream.write(t)"]text[0] += prefix_dict['python']['query']text[1] += prefix_dict['python']['passage']embed = model.encode(tokenizer, text)score = embed[0] @ embed[1].Tprint("score", score)

Vllm

我们同时适配了 vllm 的部署，来减少部署时的时延。可参考我们的 github 仓库： https://github.com/codefuse-ai/CodeFuse-CGE

from vllm import ModelRegistryfrom utils.vllm_codefuse_cge_large import CodeFuse_CGE_Largefrom vllm.model_executor.models import ModelRegistryfrom vllm import LLM
def always_true_is_embedding_model(model_arch: str) -> bool:    return TrueModelRegistry.is_embedding_model = always_true_is_embedding_modelModelRegistry.register_model("CodeFuse_CGE_Large", CodeFuse_CGE_Large)

model_name_or_path = "codefuse-ai/CodeFuse-CGE-Large"model = LLM(model=model_name_or_path, trust_remote_code=True, enforce_eager=True, enable_chunked_prefill=False)prefix_dict =  {'python':{'query':'Retrieve the Python code that solves the following query:', 'passage':'Python code:'},                'java':{'query':'Retrieve the Java code that solves the following query:', 'passage':'Java code:'},                'go':{'query':'Retrieve the Go code that solves the following query:', 'passage':'Go code:'},                'c++':{'query':'Retrieve the C++ code that solves the following query:', 'passage':'C++ code:'},                'javascript':{'query':'Retrieve the Javascript code that solves the following query:', 'passage':'Javascript code:'},                'php':{'query':'Retrieve the PHP code that solves the following query:', 'passage':'PHP code:'},                'ruby':{'query':'Retrieve the Ruby code that solves the following query:', 'passage':'Ruby code:'},                'default':{'query':'Retrieve the code that solves the following query:', 'passage':'Code:'}                }
text = ["Return the best fit based on rsquared",        "def find_best_rsquared ( list_of_fits ) : res = sorted ( list_of_fits , key = lambda x : x . rsquared ) return res [ - 1 ]"]text[0] += prefix_dict['python']['query']text[1] += prefix_dict['python']['passage']embed_0 = model.encode([text[0]])[0].outputs.embeddingembed_1 = model.encode([text[1]])[0].outputs.embedding

注：

1、在适配 Vllm 之后，模型的输入只能设置为批量大小为 1；否则会导致数组溢出错误。

2、目前仅对 CodeFuse-CGE-Large 模型进行了适配，CodeFuse-CGE-Small 模型的支持将在不久后提供。

实验

我们以 CodeQwen1.5-7B-Chat 作为模型底座，主要在 Text2Code Retrieval 这个任务上去验证算法的有效性。用于评测的数据集包括：AdvTest、CSN 以及 CosQA 数据集。评测指标我们主要使用的 MRR。我们在 32 张 80 G A100 上运行训练。

实验结果

CGE：基于Causal LLM的Code Embedding模型

正文

请到「今天看啥」查看全文