在免费的 T4 GPU 上优化小型语言模型-开发者社区

原文：towardsdatascience.com/optimizing-small-language-models-on-a-free-t4-gpu-008c37700d57

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/89c20ff6b5fa89c36d5f78bb9d4cea28.png

由 Donald Wu 在 Unsplash 拍摄的照片

“小型”语言模型（LLM）正在迅速成为人工智能领域的颠覆者。

与需要大量计算资源的传统 LLM 不同，这些模型要小得多，效率更高。虽然它们的性能与大型模型相当，但它们可以轻松地在标准设备上运行，甚至可以扩展到边缘。这也意味着它们可以很容易地定制和集成，以便在您的数据集上使用。

在这篇文章中，我将首先解释模型微调和对齐过程的基础和内部机制。然后，我将指导您通过使用一种称为直接偏好优化（DPO）的新方法来微调 Phi 2，这是一个拥有 20 亿参数的小型 LLM。

多亏了模型的小尺寸和量化以及 QLoRA 等优化技术，我们能够通过 Google Colab 使用免费的 T4 GPU 来完成这个过程！这需要调整 Hugging Face 用于训练其 Zephyr 7B 模型所使用的设置和超参数。

为什么我们需要微调以及直接偏好优化（DPO）的机制1.1. 为什么我们需要微调 LLM 1.2. 什么是 DPO 以及 DPO 与 RLHF 的比较 1.3. 为什么使用 DPO？ 1.2. 如何实现 DPO？
DPO 过程中的关键组件概述2.1. Hugging Face Transformers 强化学习（TRL）库 2.2. 准备数据集 2.3. 微软的 Phi2 模型
T4 GPU 上微调 Phi2 的逐步指南
结束语

为什么我们需要微调和直接偏好优化的机制

为什么我们需要微调 LLM？

尽管功能强大，大型语言模型（LLM）有其局限性，尤其是在处理公司存储库中捕获的最新或特定领域知识方面。为了解决这个问题，我们有两种选择：

微调，这涉及到在特定领域进一步训练模型
检索增强生成（RAG）. RAG 将外部数据库数据集成到 LLM 提示中，使响应更加接地和及时。

微调比 RAG 更复杂且资源密集，但它提供了几个好处，如增强数据隐私、更好的任务完成度和准确性，以及更大的控制和透明度。

典型的微调过程包括三个关键步骤：

指令数据集准备：准备一个针对您特定用例的指令数据集。
监督微调 (SFT)：这一步骤通过调整预训练模型的权重（使用较小的一组标记数据）来教会语言模型遵循指令。
[对齐]：这一步骤使模型与人类偏好保持一致。通过增强 SFT（结构化微调）与人类（或 AI）偏好，可以获得在有用性和安全性方面的显著提升。

从模型非常广泛的知识和能力中选择模型期望的响应和行为对于构建安全、性能良好且可控的 AI 系统至关重要。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/79429775c103aa5cd79f25b86a71664b.png

构建高级 LLM 的高级概述。图由作者绘制。

什么是 DPO 以及 DPO 与 RLHF 的比较

传统的微调无监督 LMs 方法，如 RLHF（基于人类反馈的强化学习），复杂且涉及训练多个语言模型。

直接偏好优化 (DPO) 是一种更简单的对齐方法，与常见的 RHLF 方法训练语言模型的目标相同。它直接训练语言模型，以根据定义的奖励函数对齐人类偏好。

DPO 与 RLHF 对比

RLHF 通常遵循以下步骤：

手动标注偏好选择：使用人员（在某些情况下是另一个 LLM）审查针对同一问题提供的两个答案，并选择与其兴趣更一致的答案。他们通常选择更有帮助且毒性更低的答案。
训练奖励模型：使用此偏好数据集来训练奖励模型。该模型将更高的奖励分配给人们特别偏好的响应。
强化学习：通过强化学习改进 LLM，以生成人们评分最高的答案。

虽然这个过程有效，但复杂且需要大量资源。

相反，DPO 提供了一种更简化的方法。它涉及根据指定的偏好直接训练 LLM，消除了需要单独奖励模型的需求。这种直接方法简化了微调过程，并在某些场景中可以同样有效。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/0de561f5ebf5bf530af2d53ca29c98c1.png

为什么使用 DPO？

[简化与稳定性]：DPO 为将 LMs 与人类偏好对齐提供了一种稳定且性能良好的方法。它比传统的 RLHF 方法更简单、更高效，计算上更轻量。
直接从偏好中训练：与 RLHF 不同，DPO 直接通过简单的交叉熵损失训练 LLM 满足人类偏好，简化了偏好学习过程。
比较性能：DPO 的性能与现有的 RLHF 算法相似或更好，包括基于 PPO 的方法，特别是在控制生成文本的情感和改善某些任务中的响应质量方面。

如何实现 DPO？

我们可以从 Hugging Face 的 Zephyr 7B Beta 从 Mistral 7B 的训练过程中获得灵感，以实现 DPO 算法。这是一个三步过程：

在其他大型模型生成的指令数据集上进行监督微调（SFT）
使用偏好标签标注数据：使用最先进的 LLM 对 LLMs 的输出进行评分/排名
在第二步中获得的数据上使用第一步获得的模型进行 DPO 训练
使用 DPO 微调 Llama 2 by Kashif Rasul, Younes Belkada, and Leandro von Werra.

DPO 过程中的关键组件概述

Hugging Face TRL 库

Hugging Face Transformer Reinforcement Learning (TRL)库是一个全栈库，提供了一套工具，用于使用强化学习训练 transformer 语言模型，从监督微调步骤（SFT）、奖励建模步骤（RM）到近端策略优化（PPO）步骤。

TRL 支持 DPO 训练器，用于从偏好数据训练语言模型。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/8bb6f5c67fa14f4c29388aafaea1f775.png

准备数据集

DPO 训练器期望偏好数据集具有非常特定的格式。

它应包含 3 个条目：
prompt这包括在推理时给模型用于文本生成的上下文提示
chosen包含对应提示的偏好生成响应
rejected包含了不受欢迎或不应作为给定提示的采样响应的响应
来自 Hugging Face Transformer Reinforcement Learning 文档

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/372562840cd3c67d9f61284a77c48257.png

Intel orca_dpo_pairs 偏好数据集

微软的 Phi2 模型

Phi-2，由微软制造，是一个 27 亿参数的语言模型。它展示了卓越的推理和语言理解能力，在小于 130 亿参数的基语言模型中表现出最先进的性能。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/c501a3f984f8c7addf052fcce2c7dd43.png

与流行的开源 SLMs 相比，在分组基准测试中的平均性能。来源

我选择 Phi2 进行我们的实验，因为它体积紧凑但性能仍然很高。值得注意的是，Phi2 尚未通过 RLHF 进行微调。这使得 Phi2 成为使用免费的 Google Colab T4 实例探索偏好微调的理想候选者。

在 T4 GPU 上微调 Phi2 的逐步指南

第一步，一如既往，是训练你的监督微调（SFT）模型。

在本指南中，我们的重点将放在使用 Google Colab 中的 T4 GPU 对 Phi2 进行偏好调优，以使模型与人类偏好对齐。

我们假设你已经有一个 SFT 训练好的模型，并且已经注册了 Google Colab。我们将直接从对齐步骤开始。要详细了解 SFT 过程以及量化、QLoRa 等优化技术的解释，请务必查看下面的这篇文章。

释放 Mistral 7B 的强大功能：如何高效地使用自己的数据微调 LLM

第 1 步：安装依赖项并导入包

在此步骤中，我们将通过安装必要的依赖项和导入相关包来设置我们的环境。我们将直接从 GitHub 存储库安装 Hugging Face 的transformers和peft库，以确保我们使用的是最新版本。

# Install necessary libraries!pip install-q datasets trl bitsandbytes sentencepiece !pip install-q-U git+https://github.com/huggingface/transformers.git !pip install-q-U git+https://github.com/huggingface/peft.git# Importing packagesimportosimportgcimporttorchimporttransformersfromtransformersimportAutoModelForCausalLM,AutoTokenizer,TrainingArguments,BitsAndBytesConfigfromdatasetsimportload_datasetfrompeftimportLoraConfig,PeftModel,get_peft_model,prepare_model_for_kbit_training,AutoPeftModelForCausalLMfromtrlimportDPOTrainerimportbitsandbytesasbnb# Define model names and tokenshf_token="[YOUR_HF_TOKEN]"# Replace [YOUR_HF_TOKEN] with your Hugging Face tokenpeft_model_name="Ronal999/phi2_finance_SFT"# The model obtained after the SFT stepnew_model="phi2_DPO"#the name of the DPO trained model

第 2 步：准备偏好数据集

为了演示目的，我们将使用 Hugging Face 的一个现有偏好数据集。具体来说，我们将利用英特尔提供的“orca_dpo_pairs”数据集。

偏好数据集：huggingface.co/datasets/Intel/orca_dpo_pairs

我们通过辅助函数chatml_format将数据集条目映射到返回所需的字典，并丢弃所有原始列。

# Tokenizer setuptokenizer=AutoTokenizer.from_pretrained(peft_model_name)tokenizer.pad_token=tokenizer.eos_token tokenizer.padding_side="left"# Helper function to format the datasetdefchatml_format(example):# Formatting system responseiflen(example['system'])>0:message={"role":"system","content":example['system']}system=tokenizer.apply_chat_template([message],tokenize=False)else:system=""# Formatting user instructionmessage={"role":"user","content":example['question']}prompt=tokenizer.apply_chat_template([message],tokenize=False,add_generation_prompt=True)# Formatting the chosen answerchosen=example['chosen']+"n"# Formatting the rejected answerrejected=example['rejected']+"n"return{"prompt":system+prompt,"chosen":chosen,"rejected":rejected,}# Loading the datasetdataset=load_dataset("Intel/orca_dpo_pairs")['train']# Saving original columns for removaloriginal_columns=dataset.column_names# Applying formatting to the datasetdataset=dataset.map(chatml_format,remove_columns=original_columns)# Displaying a sample from the datasetprint(dataset[1])

第 3 步：使用 DPO 训练模型

此步骤涉及使用 DPO 对 Phi2 模型进行实际训练。我们将使用适当的配置设置 DPOTrainer 并启动训练过程。

要做到这一点，我们需要使用我们想要训练的模型以及一个参考模型（ref_model）初始化 DPOTrainer，该参考模型将用于计算首选以及拒绝响应的隐式奖励。

Hugging Face 建议初始化参考模型的三种主要选项：

创建两个模型实例：每个实例都加载你的适配器。这种方法效果不错，但效率相当低。
将适配器合并到基础模型中：在顶部创建另一个适配器，并将 model_ref 参数设置为 null。在这种情况下，DPOTrainer 将卸载用于参考推理的适配器。这种方法更高效，但可能会降低性能。
两次加载微调适配器：为了减轻选项 2 的缺点，但略微增加了 VRAM 的使用，你可以将微调适配器加载到模型中两次，使用不同的名称，并在 DPOTrainer 中设置相应的模型/ref 适配器名称。

在本指南中，我们将使用选项 3，它在性能和计算资源需求之间提供了良好的平衡。

# LoRA configurationpeft_config=LoraConfig(r=16,lora_alpha=16,lora_dropout=0.05,bias="none",task_type="CAUSAL_LM",target_modules=['k_proj','v_proj','q_proj','dense'])# Load the base model with BitsAndBytes configurationbnb_config=BitsAndBytesConfig(load_in_4bit=True,llm_int8_threshold=6.0,llm_int8_has_fp16_weight=False,bnb_4bit_compute_dtype=torch.bfloat16,bnb_4bit_use_double_quant=True,bnb_4bit_quant_type="nf4",)model=AutoPeftModelForCausalLM.from_pretrained(peft_model_name,low_cpu_mem_usage=True,torch_dtype=torch.float16,quantization_config=bnb_config,is_trainable=True,)model.config.use_cache=Falsemodel.load_adapter(peft_model_name,adapter_name="training2")model.load_adapter(peft_model_name,adapter_name="reference")# Initialize Training argumentstraining_args=TrainingArguments(per_device_train_batch_size=2,max_steps=100,# we set up the max_steps to 100 for demo purposegradient_accumulation_steps=4,gradient_checkpointing=True,learning_rate=5e-5,lr_scheduler_type="cosine",save_strategy="no",logging_steps=1,output_dir=new_model,optim="paged_adamw_32bit",warmup_steps=5,remove_unused_columns=False,)# Initialize DPO Trainerdpo_trainer=DPOTrainer(model,model_adapter_name="training2",ref_adapter_name="reference",args=training_args,train_dataset=dataset,tokenizer=tokenizer,peft_config=peft_config,beta=0.1,# The parameter 'beta' is the hyperparameter of the implicit reward and is normally set from 0.1 to 0.5\. It's important to note that if beta tends to zero, we tend to ignore the reference model.max_prompt_length=512,max_length=1024,)# Start Fine-tuning with DPOdpo_trainer.train()

第 4 步：保存和上传模型

使用 DPO 训练 Phi2 模型后，下一步是保存微调模型并上传以供将来使用或分享。

# Saving the fine-tuned model and tokenizerdpo_trainer.model.save_pretrained("final_checkpoint")tokenizer.save_pretrained("final_checkpoint")# Releasing memory resourcesdeldpo_trainer,model gc.collect()torch.cuda.empty_cache()# Loading the base model and tokenizerbase_model=AutoPeftModelForCausalLM.from_pretrained(peft_model_name,low_cpu_mem_usage=True,torch_dtype=torch.float16,return_dict=True)tokenizer=AutoTokenizer.from_pretrained(peft_model_name)# Merging the base model with the adapter and unloadingmodel=PeftModel.from_pretrained(base_model,"final_checkpoint")model=model.merge_and_unload()# Saving the merged model and tokenizermodel.save_pretrained(new_model)tokenizer.save_pretrained(new_model)# Pushing the model and tokenizer to Hugging Face Hubmodel.push_to_hub(new_model,use_temp_dir=False,token=hf_token)tokenizer.push_to_hub(new_model,use_temp_dir=False,token=hf_token)

第 5 步：运行推理

现在，让我们通过测试和运行推理来结束我们的新 DPO 微调的 Phi-2 模型。我想要快速通过 Gradio 展示它，这样我就可以轻松地与我的朋友和同事分享结果。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/db0ef59eb3335875894090c0ef93c214.png

使用 Gradio 调整的 Phi-2 DPO 演示。

因此，这是快速实现聊天机器人界面的代码片段。

# Install Gradio for creating an interface!pip install-q gradioimportgradioasgrimporttorchfromtransformersimportAutoPeftModelForCausalLM,AutoTokenizer,StoppingCriteria,StoppingCriteriaList,TextIteratorStreamerfromthreadingimportThread# Load the fine-tuned model and tokenizernew_model="Ronal999/phi2_DPO"model=AutoPeftModelForCausalLM.from_pretrained(new_model,low_cpu_mem_usage=True,torch_dtype=torch.float16,load_in_4bit=True,)tokenizer=AutoTokenizer.from_pretrained(new_model)model=model.to('cuda:0')# Define stopping criteriaclassStopOnTokens(StoppingCriteria):def__call__(self,input_ids:torch.LongTensor,scores:torch.FloatTensor,**kwargs)->bool:stop_ids=[29,0]# Token IDs to stop the generationforstop_idinstop_ids:ifinput_ids[0][-1]==stop_id:returnTruereturnFalse# Define the prediction functiondefpredict(message,history):# Transform history into the required formathistory_transformer_format=history+[[message,""]]stop=StopOnTokens()# Format messages for the modelmessages="".join(["".join(["n<human>:"+item[0],"n<bot>:"+item[1]])foriteminhistory_transformer_format])model_inputs=tokenizer([messages],return_tensors="pt").to("cuda")# Set up the streamer and generate responsesstreamer=TextIteratorStreamer(tokenizer,timeout=10.,skip_prompt=True,skip_special_tokens=True)generate_kwargs=dict(model_inputs,streamer=streamer,max_new_tokens=1024,do_sample=True,top_p=0.95,top_k=1000,temperature=1.0,num_beams=1,stopping_criteria=StoppingCriteriaList([stop]))t=Thread(target=model.generate,kwargs=generate_kwargs)t.start()# Yield partial messages as they are generatedpartial_message=""fornew_tokeninstreamer:ifnew_token!='<':partial_message+=new_tokenyieldpartial_message# Launch Gradio Chat Interfacegr.ChatInterface(predict).queue().launch(debug=True)

结束语

RLHF 是最先进 LLM 的关键构建块。

在这篇文章中，我们探讨了直接偏好优化（DPO），这是 RLHF 的一个强大替代品，它极大地简化了对齐过程，为更安全、更可靠的 LLM 开辟了道路。

我们展示了如何使用 DPO 在免费的 T4 GPU 上微调小型语言模型，特别是微软的 Phi2。像 Phi2 这样的小型 LLM 的潜力巨大，尤其是在使用 DPO 等高效方法微调时，它具有令人兴奋的可能性。

此外，对于那些感兴趣的人来说，结合使用 Weights & Biases（wandb）等工具可以非常有助于跟踪实验和评估模型性能。

如常，您可以在这里找到我的 Google Colab 笔记本。

如何免费使用 iPhone 与任何开源 LLM 进行聊天

在你离开之前！🦸🏻‍♀️

如果你喜欢我的故事，并且想支持我：

给 Medium 一些爱💕（点赞、评论和突出显示），您的支持对我来说意义重大。👏
在 Medium 上关注我并订阅以获取我的最新文章🫶

每当 Yanli Liu 发布时，获取电子邮件通知

参考文献

直接偏好优化：你的语言模型实际上是奖励模型 by Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
使用 DPO 微调 Llama 2 by Kashif Rasul, Younes Belkada, and Leandro von Werra.
使用直接偏好优化方法对 LLM 进行偏好调整 by Kashif Rasul, Edward Beeching, Lewis Tunstall, Leandro von Werra 和 Omar Sanseviero.
Hugging Face Transformer 强化学习文档
在 Intel Gaudi2 上进行监督微调和直接偏好优化