MLNLP
社区是国内外知名的机器学习与自然语言处理社区,受众覆盖国内外NLP硕博生、高校老师以及企业研究人员。
社区的愿景
是促进国内外自然语言处理,机器学习学术界、产业界和广大爱好者之间的交流和进步,特别是初学者同学们的进步。
十天前的 Meta Connect 2024 大会上,开源领域迎来了可在边缘和移动设备上的运行的轻量级模型 Llama 3.2 1B 和 3B。两个版本都是纯文本模型,但也具备多语言文本生成和工具调用能力。Meta 表示,这些模型可让开发者构建个性化的、在设备本地上运行的通用应用 —— 这类应用将具备很强的隐私性,因为数据无需离开设备。
近日,机器学习研究员 Sebastian Raschka 光速发布长篇教程《Converting Llama 2 to Llama 3.2 From Scratch》。
-
博文链接:https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/07_gpt_to_llama/converting-llama2-to-llama3.ipynb
本文是《 Converting a From-Scratch GPT Architecture to Llama 2》的后续,更新的内容是如何将 Meta 的 Llama 2 架构模型逐步转换为 Llama 3、Llama 3.1 和 Llama 3.2。为了避免不必要的冗长,本文特意将解释部分缩至最短,并将重点放在主代码上。
1 逐步转换 Llama 模型实现
如果你是初次实施 LLM 架构,建议从《Build a Large Language Model From Scratch》(
https://github.com/rasbt/LLMs-from-scratch/blob/0972ded5309c25dc5eecc98b62897d677c6c36c4/ch04/01_main-chapter-code/ch04.ipynb
)的第 4 章开始,那部分内容将逐步指导你实施原始 GPT 架构。
然后可参考《Converting a From-Scratch GPT Architecture to Llama 2》(
https://github.com/rasbt/LLMs-from-scratch/blob/0972ded5309c25dc5eecc98b62897d677c6c36c4/ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb
),将实现 Llama 特有的组件,如 RMSNorm 层、SiLU 和 SwiGLU 激活、RoPE(旋转位置嵌入)和 SentencePiece tokenizer。
本笔记本采用 Llama 2 架构,并通过以下方式将其转换为 Llama 3 架构:
随后,我们将 Meta 共享的原始 Llama 3 权重加载到架构中:
Llama 2 实际上与 Llama 3 非常相似,如上文所述和本文开头的图片所示。
这意味着我们可以使用以下代码从 Llama 2 笔记本中导入多个构建模块:
import os
import sys
import io
import nbformat
import types
def import_from_notebook():
def import_definitions_from_notebook(fullname, names):
current_dir = os.getcwd()
path = os.path.join(current_dir, fullname + ".ipynb")
path = os.path.normpath(path)
# Load the notebook
if not os.path.exists(path):
raise FileNotFoundError(f"Notebook file not found at: {path}")
with io.open(path, "r", encoding="utf-8") as f:
nb = nbformat.read(f, as_version=4)
# Create a module to store the imported functions and classes
mod = types.ModuleType(fullname)
sys.modules[fullname] = mod
# Go through the notebook cells and only execute function or class definitions
for cell in nb.cells:
if cell.cell_type == "code":
cell_code = cell.source
for name in names:
# Check for function or class definitions
if f"def {name}" in cell_code or f"class {name}" in cell_code:
exec(cell_code, mod.__dict__)
return mod
fullname = "converting-gpt-to-llama2"
names = ["precompute_rope_params", "compute_rope", "SiLU", "FeedForward", "RMSNorm", "MultiHeadAttention"]
return import_definitions_from_notebook(fullname, names)
imported_module = import_from_notebook()
compute_rope = getattr(imported_module, "compute_rope", None)
SiLU = getattr(imported_module, "SiLU", None)
FeedForward = getattr(imported_module, "FeedForward", None)
RMSNorm = getattr(imported_module, "RMSNorm", None)
MultiHeadAttention = getattr(imported_module, "MultiHeadAttention", None)
Llama 3 使用的 RoPE 与 Llama 2 相似,可参阅 RoPE 论文(https://arxiv.org/abs/2104.09864)。
不过,二者 RoPE 设置有一些细微差别。Llama 3 现在支持多达 8192 个 token,是 Llama 2(4096)的两倍。
RoPE 的基础值(见下文公式),从 10000(Llama 2)增加到 50000(Llama 3),公式如下(改编自 RoPE 论文):
这些值是一组预定义的参数,用于确定旋转矩阵中的旋转角度,其中的维数是嵌入空间的维数。
将基数从 10000 增加到 50000,频率(或旋转角度)在各维度上的衰减速度会更慢,这意味着维度越高,角度越大(本质上,这是对频率的解压缩)。
此外,我们还在下面的代码中引入了一个 freq_config 部分,用于调整频率;不过,在 Llama 3(只有 Llama 3.1 和 Llama 3.2)中并不需要它,所以稍后会重新访问这个 freq_config(默认设置为「无」并被忽略)。
import torch
def precompute_rope_params(head_dim, theta_base=10000, context_length=4096, freq_config=None):
assert head_dim % 2 == 0, "Embedding dimension must be even"
# Compute the inverse frequencies
inv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim // 2) / (head_dim // 2)))
################################ NEW ###############################################
# Frequency adjustments
if freq_config is not None:
low_freq_wavelen = freq_config["original_context_length"] / freq_config["low_freq_factor"]
high_freq_wavelen = freq_config["original_context_length"] / freq_config["high_freq_factor"]
wavelen = 2 * torch.pi / inv_freq
inv_freq_llama = torch.where(
wavelen > low_freq_wavelen, inv_freq / freq_config["factor"], inv_freq
)
smooth_factor = (freq_config["original_context_length"] / wavelen - freq_config["low_freq_factor"]) / (
freq_config["high_freq_factor"] - freq_config["low_freq_factor"]
)
smoothed_inv_freq = (
(1 - smooth_factor) * (inv_freq / freq_config["factor"]) + smooth_factor * inv_freq
)
is_medium_freq = (wavelen <= low_freq_wavelen) & (wavelen >= high_freq_wavelen)
inv_freq_llama = torch.where(is_medium_freq, smoothed_inv_freq, inv_freq_llama)
inv_freq = inv_freq_llama
####################################################################################
# Generate position indices
positions = torch.arange(context_length)
# Compute the angles
angles = positions[:, None] * inv_freq[None, :] # Shape: (context_length, head_dim // 2)
# Expand angles to match the head_dim
angles = torch.cat([angles, angles], dim=1) # Shape: (context_length, head_dim)
# Precompute sine and cosine
cos = torch.cos(angles)
sin = torch.sin(angles)
return cos, sin
总之,与 Llama 2 相比,Llama 3 的新功能是「上下文长度」和 theta 基底参数:
# Instantiate RoPE parameters
llama_2_context_len = 4096
llama_3_context_len = 8192
llama_2_theta_base = 10_000
llama_3_theta_base = 50_000
# Settings
batch_size = 2
num_heads = 4
head_dim = 16
# Instantiate RoPE parameters
cos, sin = precompute_rope_params(
head_dim=head_dim,
theta_base=llama_3_theta_base,
context_length=llama_3_context_len
)
# Dummy query and key tensors
torch.manual_seed(123)
queries = torch.randn(batch_size, llama_3_context_len, num_heads, head_dim)
keys = torch.randn(batch_size, llama_3_context_len, num_heads, head_dim)
# Apply rotary position embeddings
queries_rot = compute_rope(queries, cos, sin)
keys_rot = compute_rope(keys, cos, sin)
本节将用一种名为分组查询注意力(GQA)的替代机制来取代多头注意力(MHA)。简而言之,可以将 GQA 视为计算和参数效率更高的 MHA 版本。
在 GQA 中,通过在多个注意力头之间共享来减少键和值投影的数量,每个注意力头仍有其独特的查询,但这些查询关注同一组键和值。
下面是具有 2 个 key-value 组的 GQA 示例:
GQA 的主要思想是减少与键值对相关的唯一查询组的数量,从而在不显著降低建模性能的情况下,减少 MHA 中某些矩阵乘法的大小和参数的数量。
简而言之,GQA 的主要变化是每个查询组都需要重复,以匹配与之相关的头数量,具体实现如下:
import torch.nn as nn
class GroupedQueryAttention(nn.Module):
def __init__(
self, d_in, d_out, context_length, num_heads,
num_kv_groups, # NEW
rope_base=10_000, # NEW
rope_config=None, # NEW
dtype=None
):
super().__init__()
assert d_out % num_heads == 0, "d_out must be divisible by num_heads"
assert num_heads % num_kv_groups == 0, "num_heads must be divisible by num_kv_groups"
self.d_out = d_out
self.num_heads = num_heads
self.head_dim = d_out // num_heads
############################# NEW #############################
# self.W_key = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
# self.W_value = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
self.W_key = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)
self.W_value = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)
self.num_kv_groups = num_kv_groups
self.group_size = num_heads // num_kv_groups
################################################################
self.W_query = nn.Linear(d_in, d_out, bias=False, dtype=dtype)
self.out_proj = nn.Linear(d_out, d_out, bias=False, dtype=dtype)
self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1))
cos, sin = precompute_rope_params(
head_dim=self.head_dim,
theta_base=rope_base, # NEW
freq_config=rope_config, # NEW
context_length=8192
)
self.register_buffer("cos", cos)
self.register_buffer("sin", sin)
def forward(self, x):
b, num_tokens, d_in = x.shape
queries = self.W_query(x) # Shape: (b, num_tokens, d_out)
keys = self.W_key(x) # Shape: (b, num_tokens, num_kv_groups * head_dim)
values = self.W_value(x) # Shape: (b, num_tokens, num_kv_groups * head_dim)
# Reshape queries, keys, and values
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
##################### NEW #####################
# keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
# values = values.view(b, num_tokens, self.num_heads, self.head_dim)
keys = keys.view(b, num_tokens, self.num_kv_groups, self.head_dim)
values = values.view(b, num_tokens, self.num_kv_groups, self.head_dim)
################################################
# Transpose keys, values, and queries
keys = keys.transpose(1, 2) # Shape: (b, num_heads, num_tokens, head_dim)
values = values.transpose(1, 2) # Shape: (b, num_heads, num_tokens, head_dim)
queries = queries.transpose(1, 2) # Shape: (b, num_query_groups, num_tokens, head_dim)
# Apply RoPE
keys = compute_rope(keys, self.cos, self.sin)
queries = compute_rope(queries, self.cos, self.sin)
##################### NEW #####################
# Expand keys and values to match the number of heads
# Shape: (b, num_heads, num_tokens, head_dim)
keys = keys.repeat_interleave(self.group_size, dim=1) # Shape: (b, num_heads, num_tokens, head_dim)
values = values.repeat_interleave(self.group_size, dim=1) # Shape: (b, num_heads, num_tokens, head_dim)
# For example, before repeat_interleave along dim=1 (query groups):
# [K1, K2]
# After repeat_interleave (each query group is repeated group_size times):
# [K1, K1, K2, K2]
# If we used regular repeat instead of repeat_interleave, we'd get:
# [K1, K2, K1, K2]
################################################
# Compute scaled dot-product attention (aka self-attention) with a causal mask
# Shape: (b, num_heads, num_tokens, num_tokens)
attn_scores = queries @ keys.transpose(2, 3) # Dot product for each head
# Original mask truncated to the number of tokens and converted to boolean
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
# Use the mask to fill attention scores
attn_scores.masked_fill_(mask_bool, -torch.inf)
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
assert keys.shape[-1] == self.head_dim
# Shape: (b, num_tokens, num_heads, head_dim)
context_vec = (attn_weights @ values).transpose(1, 2)
# Combine heads, where self.d_out = self.num_heads * self.head_dim
context_vec = context_vec.reshape(b, num_tokens, self.d_out)
context_vec = self.out_proj(context_vec) # optional projection
return context_vec
参数节省的情况,请参考以下来自 GPT 和 Llama 2 代码的多头注意力示例:
batch_size = 1
context_len = 3000
max_context_len = 8192
embed_dim = 4096
num_heads = 32
example_batch = torch.randn((batch_size, context_len, embed_dim))
mha = MultiHeadAttention(
d_in=embed_dim,
d_out=embed_dim,
context_length=max_context_len,
num_heads=num_heads
)
mha(example_batch)
print("W_key:", mha.W_key.weight.shape)
print("W_value:", mha.W_value.weight.shape)
print("W_query:", mha.W_query.weight.shape)
W_key: torch.Size([4096, 4096])
W_value: torch.Size([4096, 4096])
W_query: torch.Size([4096, 4096])
现在,如果改用分组查询注意力,并使用 8 个 kv 组(Llama 3 8B 使用了 8 个 kv 组),可以看到 key 和 value 矩阵的行数减少了 4 倍(因为 32 个注意力头除以 8 个 kv 组就是 4):
gqa = GroupedQueryAttention(
d_in=embed_dim,
d_out=embed_dim,
context_length=max_context_len,
num_heads=num_heads,
num_kv_groups=8,
rope_base=llama_3_theta_base
)
gqa(example_batch)
print("W_key:", gqa.W_key.weight.shape)
print("W_value:", gqa.W_value.weight.shape)
print("W_query:", gqa.W_query.weight.shape)
W_key: torch.Size([1024, 4096])
W_value: torch.Size([1024, 4096])
W_query: torch.Size([4096, 4096])
顺便提一下,为了使分组查询注意力等同于标准的多头注意力,可以将查询组的数量(num_kv_groups)设置为与头的数量(num_heads)相等。
print("Total number of parameters:")
mha_total_params = sum(p.numel() for p in mha.parameters())
print(f"MHA: {mha_total_params:,}")
gqa_total_params = sum(p.numel() for p in gqa.parameters())
print(f"GQA: {gqa_total_params:,}")
Total number of parameters:
MHA: 67,108,864
GQA: 41,943,040
# Free up memory:
del mha
del gqa
1.4 更新 TransformerBlock 模块
接下来,更新 Transformer 块。在这里,只需将 MultiHeadAttention 与 GroupedQueryAttention 互换,并添加新的 RoPE 设置:
class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.att = GroupedQueryAttention( # MultiHeadAttention(
d_in=cfg["emb_dim"],
d_out=cfg["emb_dim"],
context_length=cfg["context_length"],
num_heads=cfg["n_heads"],
num_kv_groups=cfg["n_kv_groups"], # NEW
rope_base=cfg["rope_base"], # NEW
rope_config=cfg["rope_freq"], # NEW
dtype=cfg["dtype"]
)
self.ff = FeedForward(cfg)
self.norm1 = RMSNorm(cfg["emb_dim"], eps=1e-5)
self.norm2 = RMSNorm(cfg["emb_dim"], eps=1e-5)
def forward(self, x):
# Shortcut connection for attention block
shortcut = x
x = self.norm1(x)
x = self.att(x.to(torch.bfloat16)) # Shape [batch_size, num_tokens, emb_size]
x = x + shortcut # Add the original input back
# Shortcut connection for feed-forward block
shortcut = x
x = self.norm2(x)
x = self.ff(x.to(torch.bfloat16))
x = x + shortcut # Add the original input back
return x
幸运的是,在设置模型类时,我们不需要做太多,只需将名称更新为 Llama3Model
# class Llama2Model(nn.Module):
class Llama3Model(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"], dtype=cfg["dtype"])
self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
self.final_norm = RMSNorm(cfg["emb_dim"], eps=1e-5)
self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False, dtype=cfg["dtype"])
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
x = tok_embeds
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x.to(torch.bfloat16))
return logits
2 初始化模型
现在,我们可以定义一个 Llama 3 配置文件(为便于比较,显示的是 Llama 2 配置文件):
LLAMA2_CONFIG_7B = {
"vocab_size"
: 32_000,
"context_length": 4096,
"emb_dim": 4096,
"n_heads": 32,
"n_layers": 32,
"hidden_dim": 11_008,
"dtype": torch.bfloat16
}
LLAMA3_CONFIG_8B = {
"vocab_size": 128_256, # NEW: Larger vocabulary size
"context_length": 8192, # NEW: Larger context length
"emb_dim": 4096, # Embedding dimension
"n_heads": 32, # Number of attention heads
"n_layers": 32, # Number of layers
"hidden_dim": 14_336, # NEW: Larger size of the intermediate dimension in FeedForward
"n_kv_groups": 8, # NEW: Key-Value groups for grouped-query attention
"rope_base": 50_000, # NEW: The base in RoPE's "theta" was increased to 50_000
"rope_freq": None, # NEW: Additional configuration for adjusting the RoPE frequencies
"dtype": torch.bfloat16 # Lower-precision dtype to save memory
}
使用这些设置,我们现在可以初始化 Llama 3 8B 模型。
请注意,这需要约 34 GB 内存(作为对比,Llama 2 7B 需要约 26 GB 内存)
model = Llama3Model(LLAMA3_CONFIG_8B)
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params:,}")
Total number of parameters: 8,030,261,248
如上图所示,模型包含 80 亿个参数。此外,我们还可以使用下面的代码计算该模型的内存需求:
def model_memory_size(model, input_dtype=torch.float32):
total_params = 0
total_grads = 0
for param in model.parameters():
# Calculate total number of elements per parameter
param_size = param.numel()
total_params += param_size
# Check if gradients are stored for this parameter
if param.requires_grad:
total_grads += param_size
# Calculate buffer size (non-parameters that require memory)
total_buffers = sum(buf.numel() for buf in model.buffers())
# Size in bytes = (Number of elements) * (Size of each element in bytes)
# We assume parameters and gradients are stored in the same type as input dtype
element_size = torch.tensor(0, dtype=input_dtype).element_size()
total_memory_bytes = (total_params + total_grads + total_buffers) * element_size
# Convert bytes to gigabytes
total_memory_gb = total_memory_bytes / (1024**3)
return total_memory_gb
print(f"float32 (PyTorch default): {model_memory_size(model, input_dtype=torch.float32):.2f} GB")
print(f"bfloat16: {model_memory_size(model, input_dtype=torch.bfloat16):.2f} GB")
float32 (PyTorch default): 68.08 GB
bfloat16: 34.04 GB
最后,如果适用,我们还可以将模型转移到 NVIDIA 或 Apple Silicon GPU 上:
if torch.cuda.is_available():
device = torch.device("cuda")
elif torch.backends.mps.is_available():
device = torch.device("mps")
else:
device = torch.device("cpu")
model.to(device);
3 加载 tokenizer
Llama 2 使用了谷歌的 SentencePiece tokenizer ,而不是 OpenAI 基于 Tiktoken 库的 BPE tokenizer 。然而,Llama 3 恢复使用 Tiktoken 的 BPE tokenizer;具体来说,它使用的是具有扩展词汇的 GPT-4 tokenizer。我们可以在 Meta AI 的官方 Llama 3 存储库中找到最初的 Tiktoken 适配程序。
下面重写了 tokenizer 的代码,使其更易读,更适合本笔记本使用(但表现应该是相似的):
import os
from pathlib import Path
import tiktoken
from tiktoken.load import load_tiktoken_bpe
class Tokenizer:
def __init__(self, model_path):
assert os.path.isfile(model_path), f"Model file {model_path} not found"
mergeable_ranks = load_tiktoken_bpe(model_path)
num_base_tokens = len(mergeable_ranks)
self.special_tokens = {
"": 128000,
"": 128001,
"": 128006,
"": 128007,
"": 128009,
}
self.special_tokens.update({
f"": 128002 + i for i in range(256) if (128002 + i) not in self.special_tokens.values()
})
self.model = tiktoken.Encoding(
name=Path(model_path).name,
pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+",
mergeable_ranks=mergeable_ranks,
special_tokens=self.special_tokens
)
def encode(self, text, bos=False, eos=False, allowed_special=set(), disallowed_special=()):
if bos:
tokens = [self.special_tokens[""]]
else:
tokens = []
tokens += self.model.encode(text, allowed_special=allowed_special, disallowed_special=disallowed_special)
if eos:
tokens.append(self.special_tokens[""])
return tokens
def decode(self, tokens):
return self.model.decode(tokens)
Meta AI 在 Hugging Face Hub 上共享了 Llama 3 模型的原始权重和 tokenizer 词库。
我们将首先从 Hub 下载 tokenizer 词库,并将其加载到上述代码中。请注意,Meta AI 要求你在下载文件前接受 Llama 3 许可条款;为此必须创建一个 Hugging Face Hub 账户,并访问 meta-llama/Meta-Llama-3-8B 存储库以接受条款。
接下来,需要创建一个访问 token;要生成一个具有「读取」权限的访问 token,请点击右上角的个人资料图片,然后点击「设置」。
然后,创建并复制访问 token,以便复制并粘贴到下一个代码单元中:
from huggingface_hub import login
import json
with open("config.json", "r") as config_file:
config = json.load(config_file)
access_token = config["HF_ACCESS_TOKEN"]
login(token=access_token)
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful
通过访问 token 登录(这是验证我们是否接受 Llama 3 许可条款所必需的)后,就可以下载 tokenizer 词库了:
from huggingface_hub import hf_hub_download
tokenizer_file_path = hf_hub_download(
repo_id="meta-llama/Meta-Llama-3-8B",
filename="original/tokenizer.model",
local_dir="llama3-files"
)
请注意,在使用 Llama 3 文件时,我们可能需要 blobfile 软件包,它用于处理存储在云存储解决方案(如 Google Cloud Storage (GCS)、Azure Blob Storage 或 Amazon S3)中的数据集或模型。
可以通过取消注释并执行下面的 pip 命令来安装此依赖包:
tokenizer = Tokenizer(tokenizer_file_path)
现在,我们可以使用生成函数让 Llama 3 模型生成新文本:
from previous_chapters import generate, text_to_token_ids, token_ids_to_text
torch.manual_seed(123)
token_ids = generate(
model=model,
idx=text_to_token_ids("Every effort", tokenizer).to(device),
max_new_tokens=30,
context_size=LLAMA3_CONFIG_8B["context_length"],
top_k=1,
temperature=0.
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
Output text:
Every effort_dead aeros Ingredients başında.extension clangmissions.esp 사진 Ek Pars til DoctorsDaoеньostivan normal Ekized � Ekized � Ek rdr tık%,orgen>',
当然,正如我们在上面看到的,这段文字是毫无意义的,因为我们还没有训练过 Llama 3 模型。在下一节中,我们将从 Meta AI 中加载预训练的权重,而不是自己训练模型,因为这将花费数万至数十万美元。
4 加载预训练权重
我们将加载下面的「meta-llama/Meta-Llama-3-8B 」base 模型,它是微调前的简单文本补全模型。
或者,你也可以加载经过指令微调和对齐的「meta-llama/Meta-Llama-3-8B-Instruct」模型,方法是相应修改下一个代码单元中的字符串。加起来,权重文件大约有 16 GB 大。
from safetensors.torch import load_file
combined_weights = {}
for i in range(1, 5):
weights_file = hf_hub_download(
repo_id="meta-llama/Meta-Llama-3-8B",
filename=f"model-0000{i}-of-00004.safetensors",
local_dir="llama3-files"
)
current_weights = load_file(weights_file)
combined_weights.update(current_weights)
model-00001-of-00004.safetensors: 0%| | 0.00/4.98G [00:00, ?B/s]
model-00002-of-00004.safetensors: 0%| | 0.00/5.00G [00:00, ?B/s]
model-00003-of-00004.safetensors: 0%| | 0.00/4.92G [00:00, ?B/s]
model-00004-of-00004.safetensors: 0%| | 0.00/1.17G [00:00, ?B/s]
权重包含以下张量(为简单起见,只显示前 15 个张量):
list(combined_weights.keys())[:15]
['model.embed_tokens.weight',
'model.layers.0.input_layernorm.weight',
'model.layers.0.mlp.down_proj.weight',
'model.layers.0.mlp.gate_proj.weight',
'model.layers.0.mlp.up_proj.weight',
'model.layers.0.post_attention_layernorm.weight',
'model.layers.0.self_attn.k_proj.weight',
'model.layers.0.self_attn.o_proj.weight',
'model.layers.0.self_attn.q_proj.weight',
'model.layers.0.self_attn.v_proj.weight',
'model.layers.1.input_layernorm.weight',
'model.layers.1.mlp.down_proj.weight',
'model.layers.1.mlp.gate_proj.weight',
'model.layers.1.mlp.up_proj.weight',
'model.layers.1.post_attention_layernorm.weight']
下面的函数仿照《Build a Large Language Model From Scratch》第 5 章(
https://github.com/rasbt/LLMs-from-scratch/blob/0972ded5309c25dc5eecc98b62897d677c6c36c4/ch05/01_main-chapter-code/ch05.ipynb
)中的 load_weights_into_gpt 函数,将预训练好的权重加载到 Llama 3 模型中:
def assign(left, right, tensor_name="unknown"):
if left.shape != right.shape:
raise ValueError(f"Shape mismatch in tensor '{tensor_name}'. Left: {left.shape}, Right: {right.shape}")
if isinstance(right, torch.Tensor):
return torch.nn.Parameter(right.clone().detach())
else:
return torch.nn.Parameter(torch.tensor(right))
def load_weights_into_llama(model, param_config, params):
model.tok_emb.weight = assign(model.tok_emb.weight, params["model.embed_tokens.weight"], "model.embed_tokens.weight")
for l in range(param_config["n_layers"]):
# Load attention weights
model.trf_blocks[l].att.W_query.weight = assign(
model.trf_blocks[l].att.W_query.weight,
params[f"model.layers.{l}.self_attn.q_proj.weight"],
f"model.layers.{l}.self_attn.q_proj.weight"
)
model.trf_blocks[l].att.W_key.weight = assign(
model.trf_blocks[l].att.W_key.weight,
params[f"model.layers.{l}.self_attn.k_proj.weight"],
f"model.layers.{l}.self_attn.k_proj.weight"
)
model.trf_blocks[l].att.W_value.weight = assign(
model.trf_blocks[l].att.W_value.weight,
params[f"model.layers.{l}.self_attn.v_proj.weight"],
f"model.layers.{l}.self_attn.v_proj.weight"
)
model.trf_blocks[l].att.out_proj.weight = assign(
model.trf_blocks[l].att.out_proj.weight,
params[f"model.layers.{l}.self_attn.o_proj.weight"],
f"model.layers.{l}.self_attn.o_proj.weight"
)
model.trf_blocks[l].norm1.weight = assign(
model.trf_blocks[l].norm1.weight,
params[f"model.layers.{l}.input_layernorm.weight"],
f"model.layers.{l}.input_layernorm.weight"
)
# Load FeedForward weights
model.trf_blocks[l].ff.fc1.weight = assign(
model.trf_blocks[l].ff.fc1.weight,
params[f"model.layers.{l}.mlp.gate_proj.weight"],
f"model.layers.{l}.mlp.gate_proj.weight"
)
model.trf_blocks[l].ff.fc2.weight = assign(
model.trf_blocks[l].ff.fc2.weight,
params[f"model.layers.{l}.mlp.up_proj.weight"],
f"model.layers.{l}.mlp.up_proj.weight"
)
model.trf_blocks[l].ff.fc3.weight = assign(
model.trf_blocks[l].ff.fc3.weight,
params[f"model.layers.{l}.mlp.down_proj.weight"],
f"model.layers.{l}.mlp.down_proj.weight"
)
model.trf_blocks[l].norm2.weight = assign(
model.trf_blocks[l].norm2.weight,
params[f"model.layers.{l}.post_attention_layernorm.weight"],
f"model.layers.{l}.post_attention_layernorm.weight"
)
# Load output layer weights
model.final_norm.weight = assign(model.final_norm.weight, params["model.norm.weight"], "model.norm.weight")
if "lm_head.weight" in params.keys():
model.out_head.weight = assign(model.out_head.weight, params["lm_head.weight"], "lm_head.weight")
else:
model.out_head.weight = assign(model.out_head.weight, params["model.embed_tokens.weight"], "model.embed_tokens.weight")
print("Model uses weight tying.")
load_weights_into_llama(model, LLAMA3_CONFIG_8B, combined_weights)
model.to(device);
del combined_weights # free up memory
torch.manual_seed(123)
token_ids = generate(
model=model,
idx=text_to_token_ids("Every effort", tokenizer).to(device),
max_new_tokens=25,
context_size=LLAMA3_CONFIG_8B["context_length"],
top_k=1,
temperature=0.
)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))
Output text:
Every effort has been made to trace copyright holders and to obtain their permission for the use of copyright material. The publisher apologizes for any
5 使用指令微调模型
上面我们使用的是经过预训练的基础模型,如果你想使用一个能够遵循指令的模型,请使用「meta-llama/Llama-3-8b-Instruct」模型,如下所示:
# to free up memory
import gc
del model
gc.collect() # Run Python garbage collector
if torch.cuda.is_available():
torch.cuda.empty_cache()
combined_weights = {}
for i in range(1, 5):
weights_file = hf_hub_download(
repo_id="meta-llama/Meta-Llama-3-8B-Instruct",
filename=f"model-0000{i}-of-00004.safetensors",
local_dir="llama3-files"
)
current_weights = load_file(weights_file)
combined_weights.update(current_weights)
model = Llama3Model(LLAMA3_CONFIG_8B)
load_weights_into_llama(model, LLAMA3_CONFIG_8B, combined_weights)
model.to(device)
del combined_weights
model-00001-of-00004.safetensors: 0%| | 0.00/4.98G [00:00, ?B/s]
model-00002-of-00004.safetensors: 0%| | 0.00/5.00G [00:00, ?B/s]
model-00003-of-00004.safetensors: 0%| | 0.00/4.92G [00:00, ?B/s]
model-00004-of-00004.safetensors: 0%| | 0.00/1.17G [00:00, ?B/s]
请注意,Llama 3 模型最好与微调时使用的正确提示模板一起使用。
下面是一个基于 Meta AI 的 Llama 3 专用 ChatFormat 代码的 tokenizer wrapper 类,用于构建提示模板:
class ChatFormat:
def __init__(self, tokenizer):
self.tokenizer = tokenizer
def encode_header(self, message):
tokens = []
tokens.append(self.tokenizer.special_tokens[""])
tokens.extend(self.tokenizer.encode(message["role"], bos=False, eos=False))
tokens.append(self.tokenizer.special_tokens[""])
tokens.extend(self.tokenizer.encode("\n\n", bos=False, eos=False))
return tokens
def encode(self, text):
message = {
"role": "user",
"content": text
}
tokens = self.encode_header(message)
tokens.extend(
self.tokenizer.encode(message["content"].strip(), bos=False, eos=False)
)
tokens.append(self.tokenizer.special_tokens[""])
return tokens
def decode(self, token_ids):
return self.tokenizer.decode(token_ids)
chat_tokenizer = ChatFormat(tokenizer)
token_ids = chat_tokenizer.encode("Hello World!")
print(token_ids)
[128006, 882, 128007, 271, 9906, 4435, 0, 128009]
tokenizer.decode(token_ids)
现在,让我们来看看 Llama 3 教学模式的实际应用:
import re
torch.manual_seed(123)
token_ids = generate(
model=model,
idx=text_to_token_ids("What do llamas eat?", chat_tokenizer).to(device),
max_new_tokens=150,
context_size=LLAMA3_CONFIG_8B["context_length"],
top_k=1,
temperature=0.
)
output_text = token_ids_to_text(token_ids, tokenizer)
def clean_text(text, header_end="assistant\n\n"):
# Find the index of the first occurrence of ""
index = text.find(header_end)
if index != -1:
# Return the substring starting after ""
return text[index + len(header_end):].strip() # Strip removes leading/trailing whitespace
else:
# If the token is not found, return the original text
return text
print("Output text:\n", clean_text(output_text))
Output text:
Llamas are herbivores, which means they primarily eat plants and plant-based foods. Here are some of the things llamas like to eat:
1. Grass: Llamas love to graze on grass, especially in the spring and summer months.
2. Hay: Hay is a staple in a llama's diet. They like to eat timothy hay, alfalfa hay, and other types of hay.
3. Grains: Llamas may also be fed grains like oats, barley, and corn. However, grains should not make up more than 10% of a llama's diet.
4. Fruits and vegetables: Llamas may enjoy fruits and vegetables as treats, such as apples,
Llama 3.1 8B
在 Llama 3 发布几个月后,Meta AI 又推出了 Llama 3.1 模型套件(详见 Llama 3.1 官方介绍)。
方便的是,我们可以重复使用之前的 Llama 3 代码来实现 Llama 3.1 8B:
结构完全相同,唯一的变化是重新调整了 RoPE 频率,如下配置文件所示:
LLAMA3_CONFIG_8B = {
"vocab_size": 128_256,
"context_length": 8192,
"emb_dim": 4096,
"n_heads": 32,
"n_layers": 32,
"hidden_dim": 14_336,
"n_kv_groups": 8,
"rope_base": 50_000,
"rope_freq": None,
"dtype": torch.bfloat16
}
LLAMA31_CONFIG_8B = {
"vocab_size": 128_256,
"context_length": 8192,
"emb_dim": 4096,
"n_heads": 32,
"n_layers"