作者:LeonYi 链接:https://www.zhihu.com/question/632473480/answer/75664255663
使用Qwen2ForSequenceClassification实现文本分类任务。
一、实验结果和结论 这几个月,在大模型分类场景做了很多实验,攒了一点小小经验。
1、短文本
1)query情感分类,一般不如BERT
ps:结论和,https://segmentfault.com/a/1190000044485544#item-13,基本一致
2、长文本
1)通话ASR转译长文本,BERT截断512不如LLM
LLM没有截断(如果都阶段512,可能效果差不多) 2)Base v.s. Instruct
数据量小时,Base微调不如Instruct(Instruct模型有对齐税,但是微调数据量小时,效果还是比Base没见过指令微调样本的好) 3)SFT v.s. LoRA
数据量小时(总样本10K以下,每个标签需要视情况而定),SFT微调不如LoRA(SFT调参成本也更大) 3、分类场景的提升方案
1)生成式微调独有
混合同领域相似数据类型不同业务数据,可以提升若干点
数据分布不能差异太大,特别是文本长度,否则混入这种数据反而会让效果下滑(一个平均长度1.2K,一个平均长度5k) 混入顺序(我使用的是随机采样,没验证是否分开先后训练顺序是否有影响) 优化提示词(提示词中,增加各类别标签的精要描述;短文本可尝试few-shot)
2)分类头微调 + 生成式微调
数据量大时(10K以上,平均每个标签样本充足),尝试微调Base,而不是微调Instruct 数据增强:尝试无标注数据上跑的伪标签样本(提示词抽的标签 + 微调后的模型抽的标签) LoRA微调时,加入LLM的embedding层(未验证过) 尝试蒸馏更大模型到小模型(痛点:大模型难调参,训练成本更高,部署上线还是得小模型) 尝试调参(试过optuna自动搜索,效果也不太好;一般就调lr, epoch, rank) 3)重量级
Base模型,领域数据增量预训练后,再进行指令微调
方案待验证 (若验证成功,其好处是训出来的基座在各个领域任务上的微调都能提点) 这边尝试了在Qwen2-7B-Instruct的领域数据指令微调后的模型,微调效果反而比直接微调Qwen2-7B-Instruct效果差些。由于不清楚该模型训练步骤细节,所以原因尚不明确) 第一优先级,还是搞数据。其次,才是尝试各种方案的加加减减。
4、注意点
标签噪声:样本标注错误,需要在错误分析时进行剔除和校正 分类业务规则:复杂场景,需要提前确定好完备的标注规则,避免返工(那些模型可以做,那么模型不能做) 待改进点:
二、文本分类-从BERT到LLM Qwen2ForSequenceClassification和LlamaForSequenceClassification,以及BERTForSequenceClassification,都可以用来完成文本分类任务。
都可以通过HuggingFace transformers的AutoModelForSequenceClassification库,自动加载相应的模型类,进行序列分类任务。
Qwen2ForSequenceClassification和BERTForSequenceClassification,逻辑上是一致的。都是在模型的输出层,加上一个Linear层,用来完成分类任务。
之前在,BERT上做的所有改动,都可以迁移到LLM上。譬如,BERT-CRF、BERT-SUM。
2.1 BERTForSequenceClassification BertForSequenceClassification是一个已经实现好的用来进行文本分类的类,继承自BertPreTrainedModel,一般用来进行文本分类任务。
通过num_labels传递分类的类别数,从构造函数可以看出这个类大致由3部分组成,1个是BertModel,1个是Dropout,1个是用于分类的线性分类器Linear。
class BertForSequenceClassification(BertPreTrainedModel): def __init__(self, config): super(BertForSequenceClassification, self).__init__(config) self.num_labels = config.num_labels python self.bert = BertModel(config) self.dropout = nn.Dropout(config.hidden_dropout_prob) self.classifier = nn.Linear(config.hidden_size, self.config.num_labels) self.init_weights()
Bert用于提取文本特征进行Embedding,Dropout防止过拟合,Linear是一个弱分类器,进行分类,如果需要用更复杂的网络结构进行分类可以参考它进行改写。
forward()函数里面已经定义了损失函数,训练时可以不用自己额外实现,返回值包括4个内容
def forward(...): ... if labels is not None: if self.num_labels == 1: # We are doing regression loss_fct = MSELoss() loss = loss_fct(logits.view(-1), labels.view(-1)) else : loss_fct = CrossEntropyLoss() loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) outputs = (loss,) + outputs return outputs # (loss), logits, (hidden_states), (attentions)
2.2 Qwen2ForSequenceClassification 接下来。看看Qwen2ForSequenceClassification。
Qwen2ForSequenceClassification( (model): Qwen2Model( (embed_tokens): Embedding(151936, 1024, padding_idx=151643) (layers): ModuleList( (0-23): 24 x Qwen2DecoderLayer( (self_attn): Qwen2SdpaAttention( (q_proj): Linear(in_features=1024, out_features=1024, bias=True) (k_proj): Linear(in_features=1024, out_features=1024, bias=True) (v_proj): Linear(in_features=1024, out_features=1024, bias=True) (o_proj): Linear(in_features=1024, out_features=1024, bias=False) (rotary_emb): Qwen2RotaryEmbedding() ) (mlp): Qwen2MLP( (gate_proj): Linear(in_features=1024, out_features=2816, bias=False) (up_proj): Linear(in_features=1024, out_features=2816, bias=False) (down_proj): Linear(in_features=2816, out_features=1024, bias=False) (act_fn): SiLU() ) (input_layernorm): Qwen2RMSNorm() (post_attention_layernorm): Qwen2RMSNorm() ) ) (norm): Qwen2RMSNorm() ) (score): Linear(in_features=1024, out_features=3, bias=False) )
Qwen2官方代码实现,内置三种模式:
single_label_classification 单标签分类
multi_label_classification 多标签分类
标签为muilti-hot, 预测logits计算sigmoid,实际取对应维度标签Logit,损失求和 regression 回归
默认为单维度回归(回归可以作为奖励模型,预测打分) 这3种模式的输入标签不同。
class Qwen2ForSequenceClassification(Qwen2PreTrainedModel): def __init__(self, config): super().__init__(config) self.num_labels = config.num_labels self.model = Qwen2Model(config) self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False) # Initialize weights and apply final processing self.post_init() def get_input_embeddings(self): return self.model.embed_tokens def set_input_embeddings(self, value): self.model.embed_tokens = value @add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING) def forward( self, input_ids: torch.LongTensor = None, attention_mask: Optional[torch.Tensor] = None, position_ids: Optional[torch.LongTensor] = None, past_key_values: Optional[List[torch.FloatTensor]] = None, inputs_embeds: Optional[torch.FloatTensor] = None, labels: Optional[torch.LongTensor] = None, use_cache: Optional[bool] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, ) -> Union[Tuple, SequenceClassifierOutputWithPast]: r"" " labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*): Labels for computing the sequence classification/regression loss. Indices should be in `[0, ..., config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If `config.num_labels > 1` a classification loss is computed (Cross-Entropy). " "" return_dict = return_dict if return_dict is not None else self.config.use_return_dict transformer_outputs = self.model( input_ids, attention_mask=attention_mask, position_ids=position_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, ) hidden_states = transformer_outputs[0] logits = self.score(hidden_states) if input_ids is not None: batch_size = input_ids.shape[0] else : batch_size = inputs_embeds.shape[0] if self.config.pad_token_id is None and batch_size != 1: raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined." ) if self.config.pad_token_id is None: sequence_lengths = -1 else : if input_ids is not None: # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1 sequence_lengths = sequence_lengths % input_ids.shape[-1] sequence_lengths = sequence_lengths.to(logits.device) else : sequence_lengths = -1 pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths] loss = None if labels is not None: labels = labels.to(logits.device) if self.config.problem_type is None: if self.num_labels == 1: self.config.problem_type = "regression" elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int): self.config.problem_type = "single_label_classification" else : self.config.problem_type = "multi_label_classification" if self.config.problem_type == "regression" : loss_fct = MSELoss() if self.num_labels == 1: loss = loss_fct(pooled_logits.squeeze(), labels.squeeze()) else : loss = loss_fct(pooled_logits, labels) elif self.config.problem_type == "single_label_classification" : loss_fct = CrossEntropyLoss() loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1)) elif self.config.problem_type == "multi_label_classification" : loss_fct = BCEWithLogitsLoss() loss = loss_fct(pooled_logits, labels) if not return_dict: output = (pooled_logits,) + transformer_outputs[1:] return ((loss,) + output) if loss is not None else output return SequenceClassifierOutputWithPast( loss=loss, logits=pooled_logits, past_key_values=transformer_outputs.past_key_values, hidden_states=transformer_outputs.hidden_states, attentions=transformer_outputs.attentions, )
三、LoRA微调 Qwen2ForSequenceClassification 在LoRA微调后,将合并LoRA权重,并存储模型。因为,目前PEFT代码没有,把分类头Linear层的参数存储下来。只靠LoRA权重无法,复现训练的Qwen2ForSequenceClassification模型。
有需要可以小改下代码
这边在modelscope提供的环境,完成了代码的测试。
from modelscope import AutoModelForCausalLM, AutoTokenizer model_name_or_path = "qwen/Qwen2.5-3B-Instruct" model = AutoModelForCausalLM.from_pretrained( model_name_or_path, torch_dtype="auto" , device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) prompt = "不想学习怎么办?有兴趣,但是拖延症犯了" messages = [ {"role" : "system" , "content" : "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." }, {"role" : "user" , "content" : prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt" ).to(model.device) generated_ids = model.generate( **model_inputs, max_new_tokens=512 ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] ```shell Downloading [config.json]: 100%|██████████| 661/661 [00:00<00:00, 998B/s] Downloading [configuration.json]: 100%|██████████| 2.00/2.00 [00:00<00:00, 2.30B/s] Downloading [generation_config.json]: 100%|██████████| 242/242 [00:00<00:00, 557B/s] Downloading [LICENSE]: 100%|██████████| 7.21k/7.21k [00:00<00:00, 11.7kB/s] Downloading [merges.txt]: 100%|██████████| 1.59M/1.59M [00:00<00:00, 3.01MB/s] Downloading [model-00001-of-00002.safetensors]: 100%|██████████| 3.70G/3.70G [00:10<00:00, 373MB/s] Downloading [model-00002-of-00002.safetensors]: 100%|██████████| 2.05G/2.05G [00:06<00:00, 332MB/s] Downloading [model.safetensors.index.json]: 100%|██████████| 34.7k/34.7k [00:00<00:00, 56.8kB/s] Downloading [README.md]: 100%|██████████| 4.79k/4.79k [00:00<00:00, 10.3kB/s] Downloading [tokenizer.json]: 100%|██████████| 6.71M/6.71M [00:00<00:00, 8.58MB/s] Downloading [tokenizer_config.json]: 100%|██████████| 7.13k/7.13k [00:00<00:00, 13.8kB/s] Downloading [vocab.json]: 100%|██████████| 2.65M/2.65M [00:00<00:00, 5.09MB/s] /usr/local /lib/python3.10/site-packages/accelerate/utils/modeling.py:1405: UserWarning: Current model requires 234882816 bytes of buffer for offloaded layers, which seems does not fit any GPU's remaining memory. If you are experiencing a OOM later, please consider using offload_buffers=True. warnings.warn(
面对兴趣与拖延之间的矛盾,确实会让人感到困扰。这里有一些建议或许能帮助你克服拖延,更好地坚持学习:
设定小目标 :将大目标分解为一系列小目标。完成每一个小目标都是一次小小的胜利,这可以增加你的动力和成就感。制定计划 :为自己规划一个详细的学习计划,并尽量按照计划执行。记得为休息时间留出空间,保持良好的工作与休息平衡。保持积极心态 :对自己保持耐心和理解,不要因为一时的困难而放弃。记住,进步的过程就是成长的过程。由于modelscope不支持LORA, 这边查看了本地路径
print (model.model_dir) /mnt/workspace/.cache/modelscope/hub/qwen/Qwen2___5-3B-Instruct
查看文件
config.json merges.txt README.md configuration.json model-00001-of-00002.safetensors tokenizer_config.json generation_config.json model-00002-of-00002.safetensors tokenizer.json LICENSE model.safetensors.index.json vocab.json huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
微调代码
### 初始化设定和随机种子 import os os.environ["CUDAVISIBLE_DEVICES" ] = "0" import torch import numpy as np import pandas as pd import random seed = 42 random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False
基于prompt用大模型构造了20个样本。
import json x = '' '这基金表现也太差了吧,买了半年了还亏着呢。 管理费收得比别的基金都高,感觉就是在给基金公司打工。 想查查具体投了啥,结果发现透明度低得要命,啥也看不清楚。 基金经理换来换去的,都不知道到底谁在管我的钱。 客服电话打过去半天才有人接,问个问题还得等上好几天才有回复。 市场稍微有点风吹草动,这基金就跌得比谁都快。 投资组合里全是同一行业的股票,风险大得让人睡不着觉。 长期持有也没见赚多少钱,还不如存银行定期。 分红政策一会儿一个样,根本没法做财务规划。 当初宣传时说得好听,实际操作起来完全不是那么回事。' '' x_samples = x.split("\n" ) y = '' '这基金真的稳啊,买了之后收益一直挺不错的,感觉很靠谱! 管理团队超级专业,每次市场波动都能及时调整策略,让人放心。 透明度很高,随时都能查到投资组合的情况,心里有数。 基金经理经验老道,看准了几个大机会,赚了不少。 客服态度特别好,有问题总能很快得到解答,服务真是没得说。 即使在市场不好的时候,这基金的表现也比大多数同类产品强。 分散投资做得很好,风险控制得很到位,睡个安稳觉没问题。 长期持有的话,回报率真的非常可观,值得信赖。 分红政策明确而且稳定,每年都能按时收到分红,计划财务很方便。 宣传时承诺的那些好处都实现了,真心觉得选对了这只基金。' '' y_samples = y.split("\n" )# 创建一个Python字典 x_data = [{"content" : i, "label" : 0, "标注类别" : "正向" } for i in x_samples] y_data = [{"content" : i, "label" : 1, "标注类别" : "负向" } for i in y_samples] def save_json(path, data): # 将Python字典转换为JSON字符串 with open(path, 'w' , encoding='utf-8' ) as f: json.dump(data, f, ensure_ascii=False, indent=4) save_json('data/classify_train.json' , x_data[:6]+y_data[:6]) save_json('data/classify_valid.json' , x_data[6:8]+y_data[6:8]) save_json('data/classify_test.json' , x_data[8:]+y_data[8:])
数据加载
import json from tqdm import tqdm from loguru import logger from datasets import Dataset, load_dataset def get_dataset_from_json(json_path, cols): with open(json_path, "r" ) as file: data = json.load(file) df = pd.DataFrame(data) dataset = Dataset.from_pandas(df[cols], split='train' ) return dataset# load_dataset加载json的dataset太慢了 cols = ['content' , 'label' , '标注类别' ] train_ds = get_dataset_from_json('data/classify_train.json' , cols) logger.info(f"TrainData num: {len(train_ds)}" ) valid_ds = get_dataset_from_json('data/classify_valid.json' , cols) logger.info(f"ValidData num: {len(valid_ds)}" ) test_ds = get_dataset_from_json('data/classify_test.json' , cols) logger.info(f"TestData num: {len(test_ds)}" )
print (train_ds[0]) {'content' : '这基金表现也太差了吧,买了半年了还亏着呢。' , 'label' : 0, '标注类别' : '正向' }
准备dataset(简单实现截断和padding, 无动态padding)
id2label = {0: "正向" , 1: "负向" } label2id = {v:k for k,v in id2label.items()} from transformers import AutoTokenizer, DataCollatorWithPadding# from modelscope import AutoTokenizer, DataCollatorwithPadding model_name_or_path = "/mnt/workspace/.cache/modelscope/hub/qwen/Qwen2___5-3B-Instruct" model_name = model_name_or_path.split("/" )[-1]print (model_name) tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side='left' ) tokenizer.add_special_tokens({'pad_token' : '' }) data_collator = DataCollatorWithPadding(tokenizer=tokenizer) MAX_LEN = 24 txt_colname = 'content' def preprocess_function(examples): # padding后处理效率不高,需要动态batch padding return tokenizer(examples[txt_colname], max_length=MAX_LEN, padding=True, truncation=True) tokenized_train = train_ds.map(preprocess_function, num_proc=64, batched=True) tokenized_valid = valid_ds.map(preprocess_function, num_proc=64, batched=True)
sklearn评测代码
from sklearn.metrics import ( classification_report, confusion_matrix, accuracy_score, f1_score, precision_score, recall_score ) def evals(test_ds, model): k_list = [x[txt_colname] for x in test_ds] model.eval() k_result = [] for idx, txt in tqdm(enumerate(k_list)): model_inputs = tokenizer([txt], max_length=MAX_LEN, truncation=True, return_tensors="pt" ).to(model.device) logits = model(**model_inputs).logits res = int(torch.argmax(logits, axis=1).cpu()) k_result.append(id2label.get(res)) y_true = np.array(test_ds['label' ]) y_pred = np.array([label2id.get(x) for x in k_result]) return y_true, y_pred def compute_metrics(eval_pred): predictions, label = eval_pred predictions = np.argmax(predictions, axis=1) return {"f1" : f1_score(y_true=label, y_pred=predictions, average='weighted' )} def compute_valid_metrics(eval_pred): predictions, label = eval_pred y_true, y_pred = label, predictions accuracy = accuracy_score(y_true, y_pred) print (f'Accuracy: {accuracy}' ) metric_types = ['micro' , 'macro' , 'weighted' ] for metric_type in metric_types: precision = precision_score(y_true, y_pred, average=metric_type) recall = recall_score(y_true, y_pred, average=metric_type) f1 = f1_score(y_true, y_pred, average=metric_type) print (f'{metric_type} Precision: {precision}' ) print (f'{metric_type} Recall: {recall}' ) print (f'{metric_type} F1 Score: {f1}' )
模型加载,使用Trainer进行训练
import torch from transformers import AutoModelForSequenceClassification from transformers import Trainer, TrainingArguments from peft import get_peft_config, PeftModel, PeftConfig, get_peft_model, LoraConfig, TaskType rank = 64 alpha = rank*2 training_args = TrainingArguments( output_dir=f"./output/{model_name}/seqence_classify/" , learning_rate=5e-5, per_device_train_batch_size=8, per_device_eval_batch_size=4, num_train_epochs=3, weight_decay=0.01, evaluation_strategy="epoch" , save_strategy="epoch" , load_best_model_at_end=True ) peft_config = LoraConfig( task_type=TaskType.SEQ_CLS, target_modules=["q_proj" , "k_proj" , "v_proj" , "o_proj" , "gate_proj" , "up_proj" , "down_proj" ], inference_mode=False, r=rank, lora_alpha=alpha, lora_dropout=0.1 ) model = AutoModelForSequenceClassification.from_pretrained( model_name_or_path, num_labels=len(id2label), id2label=id2label, label2id=label2id, torch_dtype=torch.bfloat16, device_map="auto" , trust_remote_code=True, attn_implementation="flash attention2" ) model.config.pad_token_id = tokenizer.pad_token_id model = get_peft_model(model, peft_config) model.print_trainable_parameters() trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_train, eval_dataset=tokenized_valid, tokenizer=tokenizer, data_collator=data_collator, compute_metrics=compute_metrics ) logger.info(f"start Trainingrank: {rank}" ) trainer.train() logger.info(f"Valid Set, rank: {rank}" ) y_true, y_pred = evals(valid_ds, model) metrics = compute_valid_metrics((y_pred, y_true)) logger.info(metrics) logger.info(f"Test Set, rank: {rank}" ) y_true, y_pred = evals(test_ds, model) metrics = compute_valid_metrics((y_pred, y_true)) logger.info(metrics) saved_model = model.merge_and_unload() saved_model.save_pretrained('/model/qwen2-3b/seqcls' )
将LoraConfig和get_peft_model去掉,就是SFT的代码。
model的结构
PeftModelForSequenceClassification( (base_model): LoraModel( (model): Qwen2ForSequenceClassification( (model): Qwen2Model( (embed_tokens): Embedding(151936, 2048) (layers): ModuleList( (0-35): 36 x Qwen2DecoderLayer( (self_attn): Qwen2SdpaAttention( (q_proj): Linear(in_features=2048, out_features=2048, bias=True) (k_proj): Linear(in_features=2048, out_features=256, bias=True) (v_proj): Linear(in_features=2048, out_features=256, bias=True) (o_proj): Linear(in_features=2048, out_features=2048, bias=False) (rotary_emb): Qwen2RotaryEmbedding() ) (mlp): Qwen2MLP( (gate_proj): Linear(in_features=2048, out_features=11008, bias=False) (up_proj): Linear(in_features=2048, out_features=11008, bias=False) (down_proj): Linear(in_features=11008, out_features=2048, bias=False) (act_fn): SiLU() ) (input_layernorm): Qwen2RMSNorm((2048,), eps=1e-06) (post_attention_layernorm): Qwen2RMSNorm((2048,), eps=1e-06)