nm-vllm：高吞吐量和内存高效的LLM推理和服务引擎

GitHubStore · 公众号 · · 2024-03-18 07:46

正文

项目简介

vLLM 是一个快速且易于使用的推理库LLM，Neural Magic 会定期为上游改进做出贡献。这个分叉是我们 nm-vllm 固执己见的重点，即整合最新的LLM优化，如量化和稀疏性，以增强性能。

安装

nm-vllm PyPi 软件包包括用于 CUDA（版本 12.1）内核的预编译二进制文件，从而简化了安装过程。对于其他 PyTorch 或 CUDA 版本，请从源代码编译包。

使用 pip 安装它：

pip install nm-vllm

要利用权重稀疏性内核，例如通过 sparsity="sparse_w16a16" ，您可以使用附加功能 sparsity 扩展安装：

pip install nm-vllm[sparse]

您还可以从源代码构建和安装 nm-vllm （这将需要 ~10 分钟）：

git clone https://github.com/neuralmagic/nm-vllm.gitcd nm-vllmpip install -e .

快速入门

Neural Magic 在我们的 Hugging Face 组织配置文件、neuralmagic 和 nm-testing 上维护了各种稀疏模型。

Hugging Face 上提供了一系列推理优化的马林鱼格式的即用型 SparseGPT 和 GPTQ 模型

使用 Marlin 进行模型推理（4 位量化）

Marlin 是一个经过高度优化的 FP16xINT4 matmul 内核，旨在LLM进行推理，可以提供接近理想（4x）的加速，批量大小为 16-32 个令牌。要在 nm-vllm 中使用 Marlin，只需将量化的 Marlin 直接传递到引擎即可。它将从模型的配置中检测量化。

下面是一个带有 4 位量化 OpenHermes Mistral 模型的演示：

from vllm import LLM, SamplingParamsfrom transformers import AutoTokenizer
model_id = "neuralmagic/OpenHermes-2.5-Mistral-7B-marlin"model = LLM(model_id, max_model_len=4096)tokenizer = AutoTokenizer.from_pretrained(model_id)sampling_params = SamplingParams(max_tokens=100, temperature=0.8, top_p=0.95)
messages = [    {"role": "user", "content": "What is synthetic data in machine learning?"},]formatted_prompt =  tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)outputs = model.generate(formatted_prompt, sampling_params=sampling_params)print(outputs[0].outputs[0].text)

具有权重稀疏性的模型推理

为了快速演示，以下是如何运行一个小型的 50% 稀疏 llama2-110M 模型，该模型经过讲故事训练：

from vllm import LLM, SamplingParams
model = LLM(    "neuralmagic/llama2.c-stories110M-pruned50",    sparsity="sparse_w16a16",   # If left off, model will be loaded as dense)
sampling_params = SamplingParams(max_tokens=100, temperature=0)outputs = model.generate("Hello my name is", sampling_params=sampling_params)print(outputs[0].outputs[0].text)

下面是一个更现实的例子，它运行一个 50% 稀疏的 OpenHermes 2.5 Mistral 7B 模型，该模型针对指令遵循进行了微调：

from vllm import LLM, SamplingParamsfrom transformers import AutoTokenizer
model_id = "neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50"model = LLM(model_id, sparsity="sparse_w16a16", max_model_len=4096)tokenizer = AutoTokenizer.from_pretrained(model_id)sampling_params = SamplingParams(max_tokens=100, temperature=0.8, top_p=0.95)
messages = [    {"role": "user", "content": "What is sparsity in deep learning?"},]formatted_prompt =  tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)outputs = model.generate(formatted_prompt, sampling_params=sampling_params)print(outputs[0].outputs[0].text)

还支持使用以下 sparsity="semi_structured_sparse_w16a16" 参数的半结构化 2：4 稀疏性：

from vllm import LLM, SamplingParams
model = LLM("neuralmagic/llama2.c-stories110M-pruned2.4", sparsity="semi_structured_sparse_w16a16")

nm-vllm：高吞吐量和内存高效的LLM推理和服务引擎

正文

项目简介

安装

快速入门

使用 Marlin 进行模型推理（4 位量化）

具有权重稀疏性的模型推理

请到「今天看啥」查看全文