专栏名称: 机器学习算法与自然语言处理

一个有情怀的公众号。机器学习、自然语言处理、算法等知识集中营、期待与你相遇~

Sebastian Raschka最新博客：从头开始，用Llama 2构建Llama 3.2

机器学习算法与自然语言处理 · 公众号 · · 2024-10-08 00:00

正文

MLNLP 社区是国内外知名的机器学习与自然语言处理社区，受众覆盖国内外NLP硕博生、高校老师以及企业研究人员。

社区的愿景 是促进国内外自然语言处理，机器学习学术界、产业界和广大爱好者之间的交流和进步，特别是初学者同学们的进步。

转载自 | 机器之心

编辑 | 蛋酱

十天前的 Meta Connect 2024 大会上，开源领域迎来了可在边缘和移动设备上的运行的轻量级模型 Llama 3.2 1B 和 3B。两个版本都是纯文本模型，但也具备多语言文本生成和工具调用能力。Meta 表示，这些模型可让开发者构建个性化的、在设备本地上运行的通用应用 —— 这类应用将具备很强的隐私性，因为数据无需离开设备。

近日，机器学习研究员 Sebastian Raschka 光速发布长篇教程《Converting Llama 2 to Llama 3.2 From Scratch》。

博文链接：https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/07_gpt_to_llama/converting-llama2-to-llama3.ipynb

本文是《 Converting a From-Scratch GPT Architecture to Llama 2》的后续，更新的内容是如何将 Meta 的 Llama 2 架构模型逐步转换为 Llama 3、Llama 3.1 和 Llama 3.2。为了避免不必要的冗长，本文特意将解释部分缩至最短，并将重点放在主代码上。

机器之心对文章内容进行了不改变原意的编译：

1 逐步转换 Llama 模型实现

如果你是初次实施 LLM 架构，建议从《Build a Large Language Model From Scratch》（ https://github.com/rasbt/LLMs-from-scratch/blob/0972ded5309c25dc5eecc98b62897d677c6c36c4/ch04/01_main-chapter-code/ch04.ipynb ）的第 4 章开始，那部分内容将逐步指导你实施原始 GPT 架构。

然后可参考《Converting a From-Scratch GPT Architecture to Llama 2》（ https://github.com/rasbt/LLMs-from-scratch/blob/0972ded5309c25dc5eecc98b62897d677c6c36c4/ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb ），将实现 Llama 特有的组件，如 RMSNorm 层、SiLU 和 SwiGLU 激活、RoPE（旋转位置嵌入）和 SentencePiece tokenizer。

本笔记本采用 Llama 2 架构，并通过以下方式将其转换为 Llama 3 架构：

修改旋转嵌入
实现分组查询注意力
使用定制版的 GPT-4 tokenizer

随后，我们将 Meta 共享的原始 Llama 3 权重加载到架构中：

1.1 复用 Llama 2 的组件

Llama 2 实际上与 Llama 3 非常相似，如上文所述和本文开头的图片所示。

这意味着我们可以使用以下代码从 Llama 2 笔记本中导入多个构建模块：

import osimport sysimport ioimport nbformatimport typesdef import_from_notebook():def import_definitions_from_notebook(fullname, names):current_dir = os.getcwd()path = os.path.join(current_dir, fullname + ".ipynb")path = os.path.normpath(path)# Load the notebookif not os.path.exists(path):raise FileNotFoundError(f"Notebook file not found at: {path}")with io.open(path, "r", encoding="utf-8") as f:nb = nbformat.read(f, as_version=4)# Create a module to store the imported functions and classesmod = types.ModuleType(fullname)sys.modules[fullname] = mod# Go through the notebook cells and only execute function or class definitionsfor cell in nb.cells:if cell.cell_type == "code":cell_code = cell.sourcefor name in names:# Check for function or class definitionsif f"def {name}" in cell_code or f"class {name}" in cell_code:exec(cell_code, mod.__dict__)return modfullname = "converting-gpt-to-llama2"names = ["precompute_rope_params", "compute_rope", "SiLU", "FeedForward", "RMSNorm", "MultiHeadAttention"]return import_definitions_from_notebook(fullname, names)

imported_module = import_from_notebook()# We need to redefine precompute_rope_params# precompute_rope_params = getattr(imported_module, "precompute_rope_params", None)compute_rope = getattr(imported_module, "compute_rope", None)SiLU = getattr(imported_module, "SiLU", None)FeedForward = getattr(imported_module, "FeedForward", None)RMSNorm = getattr(imported_module, "RMSNorm", None)# MultiHeadAttention only for comparison purposesMultiHeadAttention = getattr(imported_module, "MultiHeadAttention", None)

1.2 修改后的 RoPE

Llama 3 使用的 RoPE 与 Llama 2 相似，可参阅 RoPE 论文（https://arxiv.org/abs/2104.09864）。

不过，二者 RoPE 设置有一些细微差别。Llama 3 现在支持多达 8192 个 token，是 Llama 2（4096）的两倍。

RoPE 的基础值（见下文公式），从 10000（Llama 2）增加到 50000（Llama 3），公式如下（改编自 RoPE 论文）：

这些值是一组预定义的参数，用于确定旋转矩阵中的旋转角度，其中的维数是嵌入空间的维数。

将基数从 10000 增加到 50000，频率（或旋转角度）在各维度上的衰减速度会更慢，这意味着维度越高，角度越大（本质上，这是对频率的解压缩）。

此外，我们还在下面的代码中引入了一个 freq_config 部分，用于调整频率；不过，在 Llama 3（只有 Llama 3.1 和 Llama 3.2）中并不需要它，所以稍后会重新访问这个 freq_config（默认设置为「无」并被忽略）。

import torchdef precompute_rope_params(head_dim, theta_base=10000, context_length=4096, freq_config=None):assert head_dim % 2 == 0, "Embedding dimension must be even"# Compute the inverse frequenciesinv_freq = 1.0 / (theta_base ** (torch.arange(0, head_dim // 2) / (head_dim // 2)))################################ NEW ################################################ Frequency adjustmentsif freq_config is not None:low_freq_wavelen = freq_config["original_context_length"] / freq_config["low_freq_factor"]high_freq_wavelen = freq_config["original_context_length"] / freq_config["high_freq_factor"]wavelen = 2 * torch.pi / inv_freqinv_freq_llama = torch.where(wavelen > low_freq_wavelen, inv_freq / freq_config["factor"], inv_freq)smooth_factor = (freq_config["original_context_length"] / wavelen - freq_config["low_freq_factor"]) / (freq_config["high_freq_factor"] - freq_config["low_freq_factor"])smoothed_inv_freq = ((1 - smooth_factor) * (inv_freq / freq_config["factor"]) + smooth_factor * inv_freq)is_medium_freq = (wavelen <= low_freq_wavelen) & (wavelen >= high_freq_wavelen)inv_freq_llama = torch.where(is_medium_freq, smoothed_inv_freq, inv_freq_llama)inv_freq = inv_freq_llama##################################################################################### Generate position indicespositions = torch.arange(context_length)# Compute the anglesangles = positions[:, None] * inv_freq[None, :]  # Shape: (context_length, head_dim // 2)# Expand angles to match the head_dimangles = torch.cat([angles, angles], dim=1)  # Shape: (context_length, head_dim)# Precompute sine and cosinecos = torch.cos(angles)sin = torch.sin(angles)return cos, sin

总之，与 Llama 2 相比，Llama 3 的新功能是「上下文长度」和 theta 基底参数：

# Instantiate RoPE parametersllama_2_context_len = 4096llama_3_context_len = 8192llama_2_theta_base = 10_000llama_3_theta_base = 50_000

在 Llama 2 中，用法与以前相同：

# Settingsbatch_size = 2num_heads = 4head_dim = 16# Instantiate RoPE parameterscos, sin = precompute_rope_params(head_dim=head_dim,theta_base=llama_3_theta_base,context_length=llama_3_context_len)# Dummy query and key tensorstorch.manual_seed(123)queries = torch.randn(batch_size, llama_3_context_len, num_heads, head_dim)keys = torch.randn(batch_size, llama_3_context_len, num_heads, head_dim)# Apply rotary position embeddingsqueries_rot = compute_rope(queries, cos, sin)keys_rot = compute_rope(keys, cos, sin)

1.3 分组查询注意力

本节将用一种名为分组查询注意力（GQA）的替代机制来取代多头注意力（MHA）。简而言之，可以将 GQA 视为计算和参数效率更高的 MHA 版本。

在 GQA 中，通过在多个注意力头之间共享来减少键和值投影的数量，每个注意力头仍有其独特的查询，但这些查询关注同一组键和值。

下面是具有 2 个 key-value 组的 GQA 示例：

GQA 的主要思想是减少与键值对相关的唯一查询组的数量，从而在不显著降低建模性能的情况下，减少 MHA 中某些矩阵乘法的大小和参数的数量。

简而言之，GQA 的主要变化是每个查询组都需要重复，以匹配与之相关的头数量，具体实现如下：

import torch.nn as nn
class GroupedQueryAttention(nn.Module):    def __init__(            self, d_in, d_out, context_length, num_heads,            num_kv_groups,       # NEW            rope_base=10_000,    # NEW            rope_config=None,    # NEW            dtype=None        ):        super().__init__()        assert d_out % num_heads == 0, "d_out must be divisible by num_heads"        assert num_heads % num_kv_groups == 0, "num_heads must be divisible by num_kv_groups"
        self.d_out = d_out        self.num_heads = num_heads        self.head_dim = d_out // num_heads
        ############################# NEW  #############################        # self.W_key = nn.Linear(d_in, d_out, bias=False, dtype=dtype)        # self.W_value = nn.Linear(d_in, d_out, bias=False, dtype=dtype)        self.W_key = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)        self.W_value = nn.Linear(d_in, num_kv_groups * self.head_dim, bias=False, dtype=dtype)        self.num_kv_groups = num_kv_groups        self.group_size = num_heads // num_kv_groups        ################################################################
        self.W_query = nn.Linear(d_in, d_out, bias=False, dtype=dtype)        self.out_proj = nn.Linear(d_out, d_out, bias=False, dtype=dtype)
        self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1))        cos, sin = precompute_rope_params(            head_dim=self.head_dim,            theta_base=rope_base,      # NEW            freq_config=rope_config,   # NEW            context_length=8192        )        self.register_buffer("cos", cos)        self.register_buffer("sin", sin)
    def forward(self, x):        b, num_tokens, d_in = x.shape
        queries = self.W_query(x)  # Shape: (b, num_tokens, d_out)        keys = self.W_key(x)  # Shape: (b, num_tokens, num_kv_groups * head_dim)        values = self.W_value(x)  # Shape: (b, num_tokens, num_kv_groups * head_dim)
        # Reshape queries, keys, and values        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
        ##################### NEW  #####################        # keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)        # values = values.view(b, num_tokens, self.num_heads, self.head_dim)        keys = keys.view(b, num_tokens, self.num_kv_groups, self.head_dim)        values = values.view(b, num_tokens, self.num_kv_groups, self.head_dim)        ################################################
        # Transpose keys, values, and queries        keys = keys.transpose(1, 2)  # Shape: (b, num_heads, num_tokens, head_dim)        values = values.transpose(1, 2)  # Shape: (b, num_heads, num_tokens, head_dim)        queries = queries.transpose(1, 2)  # Shape: (b, num_query_groups, num_tokens, head_dim)
        # Apply RoPE        keys = compute_rope(keys, self.cos, self.sin)        queries = compute_rope(queries, self.cos, self.sin)
        ##################### NEW  #####################        # Expand keys and values to match the number of heads        # Shape: (b, num_heads, num_tokens, head_dim)
        keys = keys.repeat_interleave(self.group_size, dim=1)  # Shape: (b, num_heads, num_tokens, head_dim)




    
        values = values.repeat_interleave(self.group_size, dim=1)  # Shape: (b, num_heads, num_tokens, head_dim)        # For example, before repeat_interleave along dim=1 (query groups):        #   [K1, K2]        # After repeat_interleave (each query group is repeated group_size times):        #   [K1, K1, K2, K2]        # If we used regular repeat instead of repeat_interleave, we'd get:        #   [K1, K2, K1, K2]        ################################################
        # Compute scaled dot-product attention (aka self-attention) with a causal mask        # Shape: (b, num_heads, num_tokens, num_tokens)        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head
        # Original mask truncated to the number of tokens and converted to boolean        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
        # Use the mask to fill attention scores        attn_scores.masked_fill_(mask_bool, -torch.inf)
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)        assert keys.shape[-1] == self.head_dim
        # Shape: (b, num_tokens, num_heads, head_dim)        context_vec = (attn_weights @ values).transpose(1, 2)
        # Combine heads, where self.d_out = self.num_heads * self.head_dim        context_vec = context_vec.reshape(b, num_tokens, self.d_out)        context_vec = self.out_proj(context_vec)  # optional projection
        return context_vec

参数节省的情况，请参考以下来自 GPT 和 Llama 2 代码的多头注意力示例：

# Settingsbatch_size = 1context_len = 3000max_context_len = 8192embed_dim = 4096num_heads = 32example_batch = torch.randn((batch_size, context_len, embed_dim))mha = MultiHeadAttention(d_in=embed_dim,d_out=embed_dim,context_length=max_context_len,num_heads=num_heads)mha(example_batch)print("W_key:", mha.W_key.weight.shape)print("W_value:", mha.W_value.weight.shape)print("W_query:", mha.W_query.weight.shape)

W_key: torch.Size([4096, 4096])W_value: torch.Size([4096, 4096])W_query: torch.Size([4096, 4096])

现在，如果改用分组查询注意力，并使用 8 个 kv 组（Llama 3 8B 使用了 8 个 kv 组），可以看到 key 和 value 矩阵的行数减少了 4 倍（因为 32 个注意力头除以 8 个 kv 组就是 4）：

gqa = GroupedQueryAttention(d_in=embed_dim,d_out=embed_dim,context_length=max_context_len,num_heads=num_heads,num_kv_groups=8,rope_base=llama_3_theta_base)gqa(example_batch)print("W_key:", gqa.W_key.weight.shape)print("W_value:", gqa.W_value.weight.shape)print("W_query:", gqa.W_query.weight.shape)

W_key: torch.Size([1024, 4096])W_value: torch.Size([1024, 4096])W_query: torch.Size([4096, 4096])

顺便提一下，为了使分组查询注意力等同于标准的多头注意力，可以将查询组的数量（num_kv_groups）设置为与头的数量（num_heads）相等。

最后，比较一下下面的参数数量：

print("Total number of parameters:")mha_total_params = sum(p.numel() for p in mha.parameters())print(f"MHA: {mha_total_params:,}")gqa_total_params = sum(p.numel() for p in gqa.parameters())print(f"GQA: {gqa_total_params:,}")

Total number of parameters:MHA: 67,108,864GQA: 41,943,040

# Free up memory:del mhadel gqa

1.4 更新 TransformerBlock 模块

接下来，更新 Transformer 块。在这里，只需将 MultiHeadAttention 与 GroupedQueryAttention 互换，并添加新的 RoPE 设置：

class TransformerBlock(nn.Module):    def __init__(self, cfg):        super().__init__()        self.att =  GroupedQueryAttention(  # MultiHeadAttention(            d_in=cfg["emb_dim"],            d_out=cfg["emb_dim"],            context_length=cfg["context_length"],            num_heads=cfg["n_heads"],            num_kv_groups=cfg["n_kv_groups"],  # NEW            rope_base=cfg["rope_base"],        # NEW            rope_config=cfg["rope_freq"],      # NEW            dtype=cfg["dtype"]        )        self.ff = FeedForward(cfg)        self.norm1 = RMSNorm(cfg["emb_dim"], eps=1e-5)        self.norm2 = RMSNorm(cfg["emb_dim"], eps=1e-5)
    def forward(self, x):        # Shortcut connection for attention block        shortcut = x        x = self.norm1(x)        x = self.att(x.to(torch.bfloat16))   # Shape [batch_size, num_tokens, emb_size]        x = x + shortcut  # Add the original input back
        # Shortcut connection for feed-forward block        shortcut = x        x = self.norm2(x)        x = self.ff(x.to(torch.bfloat16))        x = x + shortcut  # Add the original input back
        return x

1.5 定义模型类

幸运的是，在设置模型类时，我们不需要做太多，只需将名称更新为 Llama3Model

# class Llama2Model(nn.Module):class Llama3Model(nn.Module):    def __init__(self, cfg):        super().__init__()        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"], dtype=cfg["dtype"])
        self.trf_blocks = nn.Sequential(            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])
        self.final_norm = RMSNorm(cfg["emb_dim"], eps=1e-5)        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False, dtype=cfg["dtype"])
    def forward(self, in_idx):        batch_size, seq_len = in_idx.shape        tok_embeds = self.tok_emb(in_idx)        x = tok_embeds        x = self.trf_blocks(x)        x = self.final_norm(x)        logits = self.out_head(x.to(torch.bfloat16))        return logits

2 初始化模型

现在，我们可以定义一个 Llama 3 配置文件（为便于比较，显示的是 Llama 2 配置文件）：

LLAMA2_CONFIG_7B = {"vocab_size"




    
: 32_000,    # Vocabulary size"context_length": 4096,  # Context length"emb_dim": 4096,         # Embedding dimension"n_heads": 32,           # Number of attention heads"n_layers": 32,          # Number of layers"hidden_dim": 11_008,    # Size of the intermediate dimension in FeedForward"dtype": torch.bfloat16  # Lower-precision dtype to save memory}

LLAMA3_CONFIG_8B = {"vocab_size": 128_256,   # NEW: Larger vocabulary size"context_length": 8192,  # NEW: Larger context length"emb_dim": 4096,         # Embedding dimension"n_heads": 32,           # Number of attention heads"n_layers": 32,          # Number of layers"hidden_dim": 14_336,    # NEW: Larger size of the intermediate dimension in FeedForward"n_kv_groups": 8,        # NEW: Key-Value groups for grouped-query attention"rope_base": 50_000,     # NEW: The base in RoPE's "theta" was increased to 50_000"rope_freq": None,       # NEW: Additional configuration for adjusting the RoPE frequencies"dtype": torch.bfloat16  # Lower-precision dtype to save memory}

使用这些设置，我们现在可以初始化 Llama 3 8B 模型。

请注意，这需要约 34 GB 内存（作为对比，Llama 2 7B 需要约 26 GB 内存）

model = Llama3Model(LLAMA3_CONFIG_8B)

total_params = sum(p.numel() for p in model.parameters())print(f"Total number of parameters: {total_params:,}")

Total number of parameters: 8,030,261,248

如上图所示，模型包含 80 亿个参数。此外，我们还可以使用下面的代码计算该模型的内存需求：

def model_memory_size(model, input_dtype=torch.float32):    total_params = 0    total_grads = 0    for param in model.parameters():        # Calculate total number of elements per parameter        param_size = param.numel()        total_params += param_size        # Check if gradients are stored for this parameter        if param.requires_grad:            total_grads += param_size
    # Calculate buffer size (non-parameters that require memory)    total_buffers = sum(buf.numel() for buf in model.buffers())
    # Size in bytes = (Number of elements) * (Size of each element in bytes)    # We assume parameters and gradients are stored in the same type as input dtype    element_size = torch.tensor(0, dtype=input_dtype).element_size()    total_memory_bytes = (total_params + total_grads + total_buffers) * element_size
    # Convert bytes to gigabytes    total_memory_gb = total_memory_bytes / (1024**3)
    return total_memory_gb
print(f"float32 (PyTorch default): {model_memory_size(model, input_dtype=torch.float32):.2f} GB")print(f"bfloat16: {model_memory_size(model, input_dtype=torch.bfloat16):.2f} GB")float32 (PyTorch default): 68.08 GBbfloat16: 34.04 GB

最后，如果适用，我们还可以将模型转移到 NVIDIA 或 Apple Silicon GPU 上：

if torch.cuda.is_available():    device = torch.device("cuda")elif torch.backends.mps.is_available():    device = torch.device("mps")else:    device = torch.device("cpu")
model.to(device);

3 加载 tokenizer

在本节中，我们将为模型加载 tokenizer。

Llama 2 使用了谷歌的 SentencePiece tokenizer ，而不是 OpenAI 基于 Tiktoken 库的 BPE tokenizer 。然而，Llama 3 恢复使用 Tiktoken 的 BPE tokenizer；具体来说，它使用的是具有扩展词汇的 GPT-4 tokenizer。我们可以在 Meta AI 的官方 Llama 3 存储库中找到最初的 Tiktoken 适配程序。

下面重写了 tokenizer 的代码，使其更易读，更适合本笔记本使用（但表现应该是相似的）：

import osfrom pathlib import Path
import tiktokenfrom tiktoken.load import load_tiktoken_bpe

class Tokenizer:    def __init__(self, model_path):        assert os.path.isfile(model_path), f"Model file {model_path} not found"        mergeable_ranks = load_tiktoken_bpe(model_path)        num_base_tokens = len(mergeable_ranks)
        self.special_tokens = {            "": 128000,            "": 128001,            "": 128006,            "": 128007,            "": 128009,        }        self.special_tokens.update({            f"": 128002 + i for i in range(256) if (128002 + i) not in self.special_tokens.values()        })
        self.model = tiktoken.Encoding(            name=Path(model_path).name,            pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+",            mergeable_ranks=mergeable_ranks,            special_tokens=self.special_tokens        )

    def encode(self, text, bos=False, eos=False, allowed_special=set(), disallowed_special=()):        if bos:            tokens = [self.special_tokens[""]]        else:            tokens = []
        tokens += self.model.encode(text, allowed_special=allowed_special, disallowed_special=disallowed_special)
        if eos:            tokens.append(self.special_tokens[""])        return tokens
    def decode(self, tokens):        return self.model.decode(tokens)

Meta AI 在 Hugging Face Hub 上共享了 Llama 3 模型的原始权重和 tokenizer 词库。

我们将首先从 Hub 下载 tokenizer 词库，并将其加载到上述代码中。请注意，Meta AI 要求你在下载文件前接受 Llama 3 许可条款；为此必须创建一个 Hugging Face Hub 账户，并访问 meta-llama/Meta-Llama-3-8B 存储库以接受条款。

接下来，需要创建一个访问 token；要生成一个具有「读取」权限的访问 token，请点击右上角的个人资料图片，然后点击「设置」。

然后，创建并复制访问 token，以便复制并粘贴到下一个代码单元中：

from huggingface_hub import loginimport jsonwith open("config.json", "r") as config_file:config = json.load(config_file)access_token = config["HF_ACCESS_TOKEN"]login(token=access_token)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.




    
Token is valid (permission: read).Your token has been saved to /root/.cache/huggingface/tokenLogin successful

通过访问 token 登录（这是验证我们是否接受 Llama 3 许可条款所必需的）后，就可以下载 tokenizer 词库了：

from huggingface_hub import hf_hub_downloadtokenizer_file_path = hf_hub_download(repo_id="meta-llama/Meta-Llama-3-8B",filename="original/tokenizer.model",local_dir="llama3-files")

请注意，在使用 Llama 3 文件时，我们可能需要 blobfile 软件包，它用于处理存储在云存储解决方案（如 Google Cloud Storage (GCS)、Azure Blob Storage 或 Amazon S3）中的数据集或模型。

可以通过取消注释并执行下面的 pip 命令来安装此依赖包：

# pip install blobfile

tokenizer = Tokenizer(tokenizer_file_path)

现在，我们可以使用生成函数让 Llama 3 模型生成新文本：

from previous_chapters import generate, text_to_token_ids, token_ids_to_texttorch.manual_seed(123)token_ids = generate(model=model,idx=text_to_token_ids("Every effort", tokenizer).to(device),max_new_tokens=30,context_size=LLAMA3_CONFIG_8B["context_length"],top_k=1,temperature=0.)print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text: Every effort_dead aeros Ingredients başında.extension clangmissions.esp 사진 Ek Pars til DoctorsDaoеньostivan normal Ekized � Ekized � Ek rdr tık%,orgen>',

当然，正如我们在上面看到的，这段文字是毫无意义的，因为我们还没有训练过 Llama 3 模型。在下一节中，我们将从 Meta AI 中加载预训练的权重，而不是自己训练模型，因为这将花费数万至数十万美元。

4 加载预训练权重

我们将加载下面的「meta-llama/Meta-Llama-3-8B 」base 模型，它是微调前的简单文本补全模型。

或者，你也可以加载经过指令微调和对齐的「meta-llama/Meta-Llama-3-8B-Instruct」模型，方法是相应修改下一个代码单元中的字符串。加起来，权重文件大约有 16 GB 大。

from safetensors.torch import load_file
combined_weights = {}
for i in range(1, 5):    weights_file = hf_hub_download(        repo_id="meta-llama/Meta-Llama-3-8B",        filename=f"model-0000{i}-of-00004.safetensors",        local_dir="llama3-files"    )    current_weights = load_file(weights_file)    combined_weights.update(current_weights)model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00, ?B/s]model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00, ?B/s]model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00, ?B/s]model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00, ?B/s]

权重包含以下张量（为简单起见，只显示前 15 个张量）：

list(combined_weights.keys())[:15]

['model.embed_tokens.weight', 'model.layers.0.input_layernorm.weight', 'model.layers.0.mlp.down_proj.weight', 'model.layers.0.mlp.gate_proj.weight', 'model.layers.0.mlp.up_proj.weight', 'model.layers.0.post_attention_layernorm.weight', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.0.self_attn.o_proj.weight', 'model.layers.0.self_attn.q_proj.weight', 'model.layers.0.self_attn.v_proj.weight', 'model.layers.1.input_layernorm.weight', 'model.layers.1.mlp.down_proj.weight', 'model.layers.1.mlp.gate_proj.weight', 'model.layers.1.mlp.up_proj.weight', 'model.layers.1.post_attention_layernorm.weight']

下面的函数仿照《Build a Large Language Model From Scratch》第 5 章（ https://github.com/rasbt/LLMs-from-scratch/blob/0972ded5309c25dc5eecc98b62897d677c6c36c4/ch05/01_main-chapter-code/ch05.ipynb ）中的 load_weights_into_gpt 函数，将预训练好的权重加载到 Llama 3 模型中：

def assign(left, right, tensor_name="unknown"):    if left.shape != right.shape:        raise ValueError(f"Shape mismatch in tensor '{tensor_name}'. Left: {left.shape}, Right: {right.shape}")
    if isinstance(right, torch.Tensor):        return torch.nn.Parameter(right.clone().detach())    else:        return torch.nn.Parameter(torch.tensor(right))

def load_weights_into_llama(model, param_config, params):    model.tok_emb.weight = assign(model.tok_emb.weight, params["model.embed_tokens.weight"], "model.embed_tokens.weight")
    for l in range(param_config["n_layers"]):
        # Load attention weights        model.trf_blocks[l].att.W_query.weight = assign(            model.trf_blocks[l].att.W_query.weight,            params[f"model.layers.{l}.self_attn.q_proj.weight"],            f"model.layers.{l}.self_attn.q_proj.weight"        )        model.trf_blocks[l].att.W_key.weight = assign(            model.trf_blocks[l].att.W_key.weight,            params[f"model.layers.{l}.self_attn.k_proj.weight"],            f"model.layers.{l}.self_attn.k_proj.weight"        )        model.trf_blocks[l].att.W_value.weight = assign(            model.trf_blocks[l].att.W_value.weight,            params[f"model.layers.{l}.self_attn.v_proj.weight"],            f"model.layers.{l}.self_attn.v_proj.weight"        )        model.trf_blocks[l].att.out_proj.weight = assign(            model.trf_blocks[l].att.out_proj.weight,            params[f"model.layers.{l}.self_attn.o_proj.weight"],            f"model.layers.{l}.self_attn.o_proj.weight"        )        model.trf_blocks[l].norm1.weight = assign(            model.trf_blocks[l].norm1.weight,            params[f"model.layers.{l}.input_layernorm.weight"],            f"model.layers.{l}.input_layernorm.weight"        )
        # Load FeedForward weights        model.trf_blocks[l].ff.fc1.weight = assign(            model.trf_blocks[l].ff.fc1.weight,            params[f"model.layers.{l}.mlp.gate_proj.weight"],            f"model.layers.{l}.mlp.gate_proj.weight"        )        model.trf_blocks[l].ff.fc2.weight = assign(            model.trf_blocks[l].ff.fc2.weight,            params[f"model.layers.{l}.mlp.up_proj.weight"],            f"model.layers.{l}.mlp.up_proj.weight"        )        model.trf_blocks[l].ff.fc3.weight = assign(            model.trf_blocks[l].ff.fc3.weight,            params[f"model.layers.{l}.mlp.down_proj.weight"],            f"model.layers.{l}.mlp.down_proj.weight"        )        model.trf_blocks[l].norm2.weight = assign(            model.trf_blocks[l].norm2.weight,            params[f"model.layers.{l}.post_attention_layernorm.weight"],            f"model.layers.{l}.post_attention_layernorm.weight"        )
    # Load output layer weights    model.final_norm.weight = assign(model.final_norm.weight, params["model.norm.weight"], "model.norm.weight")
    if "lm_head.weight" in params.keys():        model.out_head.weight = assign(model.out_head.weight, params["lm_head.weight"], "lm_head.weight")




    
    else:        model.out_head.weight = assign(model.out_head.weight, params["model.embed_tokens.weight"], "model.embed_tokens.weight")        print("Model uses weight tying.")

load_weights_into_llama(model, LLAMA3_CONFIG_8B, combined_weights)model.to(device);del combined_weights  # free up memory

接下来，我们就可以使用该模型生成文本了：

torch.manual_seed(123)
token_ids = generate(    model=model,    idx=text_to_token_ids("Every effort", tokenizer).to(device),    max_new_tokens=25,    context_size=LLAMA3_CONFIG_8B["context_length"],    top_k=1,    temperature=0.)
print("Output text:\n", token_ids_to_text(token_ids, tokenizer))Output text: Every effort has been made to trace copyright holders and to obtain their permission for the use of copyright material. The publisher apologizes for any

5 使用指令微调模型

上面我们使用的是经过预训练的基础模型，如果你想使用一个能够遵循指令的模型，请使用「meta-llama/Llama-3-8b-Instruct」模型，如下所示：

# to free up memoryimport gcdel modelgc.collect()  # Run Python garbage collectorif torch.cuda.is_available():torch.cuda.empty_cache()

combined_weights = {}for i in range(1, 5):weights_file = hf_hub_download(repo_id="meta-llama/Meta-Llama-3-8B-Instruct",filename=f"model-0000{i}-of-00004.safetensors",local_dir="llama3-files")current_weights = load_file(weights_file)combined_weights.update(current_weights)model = Llama3Model(LLAMA3_CONFIG_8B)load_weights_into_llama(model, LLAMA3_CONFIG_8B, combined_weights)model.to(device)del combined_weights  # free up memory

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00, ?B/s]

请注意，Llama 3 模型最好与微调时使用的正确提示模板一起使用。

下面是一个基于 Meta AI 的 Llama 3 专用 ChatFormat 代码的 tokenizer wrapper 类，用于构建提示模板：

class ChatFormat:    def __init__(self, tokenizer):        self.tokenizer = tokenizer
    def encode_header(self, message):        tokens = []        tokens.append(self.tokenizer.special_tokens[""])        tokens.extend(self.tokenizer.encode(message["role"], bos=False, eos=False))        tokens.append(self.tokenizer.special_tokens[""])        tokens.extend(self.tokenizer.encode("\n\n", bos=False, eos=False))        return tokens
    def encode(self, text):        message = {            "role": "user",            "content": text        }
        tokens = self.encode_header(message)        tokens.extend(            self.tokenizer.encode(message["content"].strip(), bos=False, eos=False)        )        tokens.append(self.tokenizer.special_tokens[""])        return tokens
    def decode(self, token_ids):        return self.tokenizer.decode(token_ids)

chat_tokenizer = ChatFormat(tokenizer)

用法如下：

token_ids = chat_tokenizer.encode("Hello World!")print(token_ids)

[128006, 882, 128007, 271, 9906, 4435, 0, 128009]

tokenizer.decode(token_ids)

'user\n\nHello World!'

现在，让我们来看看 Llama 3 教学模式的实际应用：

import retorch.manual_seed(123)token_ids = generate(model=model,idx=text_to_token_ids("What do llamas eat?", chat_tokenizer).to(device),max_new_tokens=150,context_size=LLAMA3_CONFIG_8B["context_length"],top_k=1,temperature=0.)output_text = token_ids_to_text(token_ids, tokenizer)def clean_text(text, header_end="assistant\n\n"):# Find the index of the first occurrence of ""index = text.find(header_end)if index != -1:# Return the substring starting after ""return text[index + len(header_end):].strip()  # Strip removes leading/trailing whitespaceelse:# If the token is not found, return the original textreturn textprint("Output text:\n", clean_text(output_text))

Output text: Llamas are herbivores, which means they primarily eat plants and plant-based foods. Here are some of the things llamas like to eat:
1. Grass: Llamas love to graze on grass, especially in the spring and summer months.2. Hay: Hay is a staple in a llama's diet. They like to eat timothy hay, alfalfa hay, and other types of hay.3. Grains: Llamas may also be fed grains like oats, barley, and corn. However, grains should not make up more than 10% of a llama's diet.4. Fruits and vegetables: Llamas may enjoy fruits and vegetables as treats, such as apples,

Llama 3.1 8B

在 Llama 3 发布几个月后，Meta AI 又推出了 Llama 3.1 模型套件（详见 Llama 3.1 官方介绍）。

方便的是，我们可以重复使用之前的 Llama 3 代码来实现 Llama 3.1 8B：

结构完全相同，唯一的变化是重新调整了 RoPE 频率，如下配置文件所示：

LLAMA3_CONFIG_8B = {"vocab_size": 128_256,   # Vocabulary size"context_length": 8192,  # Context length"emb_dim": 4096,         # Embedding dimension"n_heads": 32,           # Number of attention heads"n_layers": 32,          # Number of layers"hidden_dim": 14_336,    # Size of the intermediate dimension in FeedForward"n_kv_groups": 8,        # Key-Value groups for grouped-query attention"rope_base": 50_000,     # The base in RoPE's "theta""rope_freq": None,       # Additional configuration for adjusting the RoPE frequencies"dtype": torch.bfloat16  # Lower-precision dtype to save memory}LLAMA31_CONFIG_8B = {"vocab_size": 128_256,    # Vocabulary size"context_length": 8192,   # Context length"emb_dim": 4096,          # Embedding dimension"n_heads": 32,            # Number of attention heads"n_layers"