Qwen2.5-32B-Instruct在自然语言处理中的应用：文本分类实战-开发者社区

Qwen2.5-32B-Instruct在自然语言处理中的应用：文本分类实战

最近在做一个内容审核的项目，需要把用户提交的文本快速分到几十个不同的类别里。一开始我们试了传统的机器学习方法，效果总是不太理想，要么分类不准，要么对新出现的类别束手无策。后来团队里有人提议试试大语言模型，我们就把目光投向了Qwen2.5-32B-Instruct。

用了一段时间后，我发现这个模型在文本分类任务上确实有两把刷子。它不仅能理解复杂的语义，还能根据我们给的指令灵活调整输出格式。今天我就结合实际的代码，跟大家分享一下怎么用Qwen2.5-32B-Instruct来做文本分类，希望能给正在做类似项目的朋友一些参考。

1. 为什么选择Qwen2.5-32B-Instruct做文本分类？

在做技术选型的时候，我们对比了好几个模型。最后选择Qwen2.5-32B-Instruct，主要是看中了它的几个特点。

首先是指令跟随能力特别强。我们做分类的时候，经常需要模型按照特定的格式输出结果，比如要求它“输出为JSON格式，包含category和confidence两个字段”。Qwen2.5-32B-Instruct在这方面表现很稳定，基本上都能按照要求来，不会自作主张乱改格式。

其次是上下文长度够用。32K的上下文对于大多数文本分类场景来说都绰绰有余了。我们处理的文本一般也就几百到几千字，完全在它的能力范围内。有时候需要把一些背景信息或者分类规则一起传给模型，这么长的上下文也完全够用。

还有一个很重要的点是多语言支持。我们的用户来自不同国家，提交的文本有中文、英文，还有其他一些语言。Qwen2.5-32B-Instruct支持29种以上的语言，基本上覆盖了我们的需求。用同一个模型处理多种语言的文本，省去了很多麻烦。

当然，32B的参数量也不算小，对硬件有一定要求。不过现在云计算资源越来越便宜，租用合适的GPU实例成本也在可接受范围内。相比起雇佣人工标注团队或者维护多个专门的分类模型，用大语言模型一次性解决所有问题，从长期来看可能更划算。

2. 环境准备与模型加载

开始之前，你需要准备好Python环境。我建议用Python 3.9或更高版本，这样兼容性会更好一些。

# 安装必要的库 pip install transformers torch accelerate

如果你的GPU显存足够大（比如有24GB以上），可以直接加载完整的模型。如果显存不够，可以考虑用量化版本或者使用CPU推理，不过速度会慢一些。

from transformers import AutoModelForCausalLM, AutoTokenizer import torch def load_model_and_tokenizer(model_name="Qwen/Qwen2.5-32B-Instruct", device_map="auto"): """ 加载Qwen2.5-32B-Instruct模型和分词器 参数: model_name: 模型名称，默认使用指令调优版本 device_map: 设备映射，auto表示自动分配 """ print(f"正在加载模型: {model_name}") # 加载分词器 tokenizer = AutoTokenizer.from_pretrained( model_name, trust_remote_code=True ) # 加载模型 model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, # 使用半精度减少显存占用 device_map=device_map, trust_remote_code=True ) print("模型加载完成!") return model, tokenizer # 使用示例 model, tokenizer = load_model_and_tokenizer()

如果你显存不够，可以考虑使用4位或8位量化：

from transformers import BitsAndBytesConfig # 配置4位量化 bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4" ) model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen2.5-32B-Instruct", quantization_config=bnb_config, device_map="auto", trust_remote_code=True )

加载完模型后，建议先做个简单的测试，确保一切正常：

def test_model_response(): """测试模型是否能正常响应""" prompt = "请用一句话介绍你自己。" messages = [ {"role": "user", "content": prompt} ] # 使用聊天模板格式化输入 text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer(text, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=100, temperature=0.7, do_sample=True ) response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) print(f"模型回复: {response}") return response # 运行测试 test_model_response()

如果看到模型正常回复了，说明环境配置没问题，可以继续下一步。

3. 文本分类的基本实现方法

用大语言模型做文本分类，跟传统方法不太一样。传统方法需要先标注大量数据，然后训练一个分类器。而用Qwen2.5-32B-Instruct，我们可以通过设计合适的提示词（prompt）来让模型直接分类。

3.1 单标签分类

单标签分类就是一篇文本只属于一个类别。这是最常见的分类场景。

def single_label_classification(text, categories, model, tokenizer): """ 单标签文本分类 参数: text: 待分类的文本 categories: 类别列表 model: 加载的模型 tokenizer: 分词器 """ # 构建分类指令 categories_str = "\n".join([f"{i+1}. {cat}" for i, cat in enumerate(categories)]) system_prompt = "你是一个专业的文本分类助手。请根据文本内容，从给定的类别中选择最合适的一个。" user_prompt = f"""请对以下文本进行分类： 文本内容： {text} 可选类别： {categories_str} 请只输出类别编号和类别名称，格式为：编号. 类别名称 例如：3. 科技新闻 你的分类结果：""" messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt} ] # 格式化输入 text_input = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer(text_input, return_tensors="pt").to(model.device) # 生成结果 with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=50, temperature=0.1, # 低温度确保输出稳定 do_sample=False # 使用贪婪解码保证一致性 ) response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) # 解析结果 try: # 提取编号和类别 lines = response.strip().split('\n') for line in lines: if '.' in line and any(cat in line for cat in categories): parts = line.split('.', 1) if len(parts) == 2: category_num = parts[0].strip() category_name = parts[1].strip() return category_name except: pass # 如果解析失败，尝试直接匹配类别 for category in categories: if category in response: return category return "未知类别" # 使用示例 categories = ["体育新闻", "科技动态", "财经报道", "娱乐八卦", "时事政治"] sample_text = "苹果公司今日发布了新一代iPhone，搭载了更强大的A系列芯片和升级的摄像头系统。" result = single_label_classification(sample_text, categories, model, tokenizer) print(f"分类结果: {result}") # 应该输出: 科技动态

3.2 多标签分类

有些文本可能同时属于多个类别，这时候就需要多标签分类。

def multi_label_classification(text, categories, model, tokenizer, threshold=0.5): """ 多标签文本分类 参数: text: 待分类的文本 categories: 类别列表 model: 加载的模型 tokenizer: 分词器 threshold: 置信度阈值（通过多次采样估算） """ system_prompt = "你是一个专业的文本分类助手。请分析文本内容，判断它可能属于哪些类别。" categories_str = "\n".join(categories) user_prompt = f"""请分析以下文本，判断它可能属于哪些类别： 文本内容： {text} 可选类别： {categories_str} 请按照以下格式输出： 1. 首先，分析文本的主要内容 2. 然后，列出所有相关的类别 3. 最后，给出最终的多标签分类结果，格式为：类别1, 类别2, 类别3 请开始：""" messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt} ] text_input = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer(text_input, return_tensors="pt").to(model.device) # 为了获得更稳定的结果，可以采样多次 all_responses = [] for _ in range(3): # 采样3次 with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=200, temperature=0.7, do_sample=True ) response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) all_responses.append(response) # 分析多次采样的结果 category_counts = {} for response in all_responses: for category in categories: if category in response: category_counts[category] = category_counts.get(category, 0) + 1 # 根据阈值确定最终类别 selected_categories = [] for category, count in category_counts.items(): confidence = count / len(all_responses) if confidence >= threshold: selected_categories.append(category) return selected_categories # 使用示例 categories = ["人工智能", "机器学习", "深度学习", "自然语言处理", "计算机视觉", "数据科学"] sample_text = "本文介绍了使用Transformer模型进行文本分类的最新进展，包括BERT、GPT等预训练模型的应用。" results = multi_label_classification(sample_text, categories, model, tokenizer) print(f"多标签分类结果: {results}") # 可能输出: ['自然语言处理', '机器学习', '深度学习']

3.3 带置信度的分类

在实际应用中，我们往往不仅想知道分类结果，还想知道模型对这个结果有多大的把握。

def classification_with_confidence(text, categories, model, tokenizer, num_samples=5): """ 带置信度的文本分类 参数: text: 待分类的文本 categories: 类别列表 model: 加载的模型 tokenizer: 分词器 num_samples: 采样次数，用于计算置信度 """ system_prompt = "你是一个专业的文本分类助手。请对文本进行分类，并给出你的置信度。" categories_str = "\n".join([f"{i+1}. {cat}" for i, cat in enumerate(categories)]) user_prompt = f"""请对以下文本进行分类： 文本内容： {text} 可选类别： {categories_str} 请输出格式为：类别名称 (置信度百分比) 例如：科技新闻 (85%) 你的分类结果：""" messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt} ] text_input = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer(text_input, return_tensors="pt").to(model.device) # 多次采样获取统计信息 predictions = [] for _ in range(num_samples): with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=50, temperature=0.7, do_sample=True ) response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) # 解析响应，提取类别和置信度 for category in categories: if category in response: # 尝试提取置信度 import re confidence_match = re.search(r'\((\d+)%\)', response) confidence = int(confidence_match.group(1)) if confidence_match else 50 predictions.append((category, confidence)) break # 统计结果 if not predictions: return "未知类别", 0 # 找出最常预测的类别 from collections import Counter category_counter = Counter([cat for cat, _ in predictions]) most_common = category_counter.most_common(1)[0] final_category = most_common[0] frequency = most_common[1] / num_samples # 计算平均置信度 avg_confidence = sum(conf for cat, conf in predictions if cat == final_category) / frequency / num_samples return final_category, round(avg_confidence * frequency * 100) # 使用示例 categories = ["正面评价", "负面评价", "中性评价"] sample_text = "这款产品的性能非常出色，完全超出了我的预期，但价格稍微有点高。" category, confidence = classification_with_confidence(sample_text, categories, model, tokenizer) print(f"分类结果: {category}, 置信度: {confidence}%")

4. 实际应用场景与优化技巧

在实际项目中，我们遇到了各种各样的情况，也总结出了一些优化技巧。

4.1 处理长文本

Qwen2.5-32B-Instruct支持32K的上下文，但有时候我们处理的文本可能更长。这时候需要一些技巧。

def classify_long_text(long_text, categories, model, tokenizer, chunk_size=4000): """ 处理长文本的分类 参数: long_text: 长文本内容 categories: 类别列表 model: 加载的模型 tokenizer: 分词器 chunk_size: 分块大小（字符数） """ # 如果文本不长，直接分类 if len(long_text) <= chunk_size: return single_label_classification(long_text, categories, model, tokenizer) # 将长文本分块 chunks = [] for i in range(0, len(long_text), chunk_size): chunk = long_text[i:i + chunk_size] chunks.append(chunk) print(f"文本被分为 {len(chunks)} 个块") # 对每个块进行分类 chunk_results = [] for i, chunk in enumerate(chunks): print(f"处理第 {i+1}/{len(chunks)} 块...") result = single_label_classification(chunk, categories, model, tokenizer) chunk_results.append(result) # 汇总结果 from collections import Counter result_counter = Counter(chunk_results) # 选择出现次数最多的类别 if result_counter: final_result = result_counter.most_common(1)[0][0] # 计算置信度（基于一致性的比例） confidence = result_counter[final_result] / len(chunks) return final_result, confidence else: return "未知类别", 0 # 使用示例 long_text = """ 这里是长文本内容...（实际内容可能长达数万字） """ result, confidence = classify_long_text(long_text, categories, model, tokenizer) print(f"长文本分类结果: {result}, 一致性: {confidence:.2%}")

4.2 少样本学习

有时候我们只有很少的标注数据，这时候可以用少样本学习（few-shot learning）的方法。

def few_shot_classification(text, categories, examples, model, tokenizer): """ 少样本文本分类 参数: text: 待分类的文本 categories: 类别列表 examples: 示例列表，每个示例是(文本, 类别)的元组 model: 加载的模型 tokenizer: 分词器 """ system_prompt = "你是一个文本分类专家。请参考给出的示例，对新文本进行分类。" # 构建示例部分 examples_text = "示例：\n" for i, (example_text, example_category) in enumerate(examples, 1): examples_text += f"\n示例{i}:\n文本：{example_text}\n类别：{example_category}\n" categories_str = "\n".join([f"- {cat}" for cat in categories]) user_prompt = f"""{examples_text} 请根据以上示例，对以下新文本进行分类： 新文本： {text} 可选类别： {categories_str} 请只输出类别名称。""" messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt} ] text_input = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer(text_input, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=20, temperature=0.1, do_sample=False ) response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) # 检查响应中是否包含已知类别 for category in categories: if category in response: return category return "未知类别" # 使用示例 categories = ["产品咨询", "技术支持", "投诉建议", "售后服务"] examples = [ ("我的手机无法开机了，怎么办？", "技术支持"), ("你们的产品什么时候有优惠？", "产品咨询"), ("客服态度太差了，我要投诉", "投诉建议"), ("上次维修后问题又出现了", "售后服务") ] new_text = "我想了解一下你们最新款笔记本电脑的配置和价格" result = few_shot_classification(new_text, categories, examples, model, tokenizer) print(f"少样本分类结果: {result}")

4.3 批量处理优化

在实际生产环境中，我们经常需要处理大量文本。这时候批量处理可以显著提高效率。

def batch_classification(texts, categories, model, tokenizer, batch_size=4): """ 批量文本分类 参数: texts: 文本列表 categories: 类别列表 model: 加载的模型 tokenizer: 分词器 batch_size: 批处理大小 """ system_prompt = "你是一个高效的文本分类助手。请快速准确地对文本进行分类。" categories_str = ", ".join(categories) results = [] # 分批处理 for i in range(0, len(texts), batch_size): batch_texts = texts[i:i + batch_size] print(f"处理批次 {i//batch_size + 1}/{(len(texts)-1)//batch_size + 1}") # 构建批量提示 batch_prompts = [] for text in batch_texts: user_prompt = f"""文本：{text} 请从以下类别中选择最合适的一个：{categories_str} 只输出类别名称。""" messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt} ] text_input = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) batch_prompts.append(text_input) # 批量编码 batch_inputs = tokenizer( batch_prompts, padding=True, truncation=True, return_tensors="pt" ).to(model.device) # 批量生成 with torch.no_grad(): batch_outputs = model.generate( **batch_inputs, max_new_tokens=20, temperature=0.1, do_sample=False ) # 解码结果 for j in range(len(batch_texts)): output_ids = batch_outputs[j][batch_inputs.input_ids[j].shape[0]:] response = tokenizer.decode(output_ids, skip_special_tokens=True) # 匹配类别 predicted_category = "未知类别" for category in categories: if category in response: predicted_category = category break results.append(predicted_category) return results # 使用示例 texts_to_classify = [ "今天股市大涨，科技股表现尤其突出", "昨晚的足球比赛非常精彩，双方打得难解难分", "新款智能手机发布，摄像头性能大幅提升", "国际形势紧张，各国领导人紧急磋商" ] categories = ["财经", "体育", "科技", "时事"] batch_results = batch_classification(texts_to_classify, categories, model, tokenizer) for text, result in zip(texts_to_classify, batch_results): print(f"文本: {text[:30]}... -> 分类: {result}")

5. 性能优化与部署建议

在实际部署时，我们需要考虑性能和成本的平衡。下面是一些实用的建议。

5.1 使用vLLM加速推理

vLLM是一个高性能的推理引擎，可以显著提高吞吐量。

# 安装vLLM # pip install vLLM from vllm import LLM, SamplingParams def classify_with_vllm(texts, categories, model_name="Qwen/Qwen2.5-32B-Instruct"): """ 使用vLLM进行批量分类 """ # 初始化vLLM llm = LLM( model=model_name, tensor_parallel_size=1, # 根据GPU数量调整 gpu_memory_utilization=0.9, max_model_len=32768 ) # 构建提示词 prompts = [] for text in texts: system_prompt = "你是一个文本分类助手。" user_prompt = f"""文本：{text} 请从以下类别中选择最合适的一个：{', '.join(categories)} 只输出类别名称。""" messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt} ] # 使用vLLM的聊天模板 from vllm import Chat chat = Chat() prompt = chat.apply_chat_template(messages, add_generation_prompt=True) prompts.append(prompt) # 设置采样参数 sampling_params = SamplingParams( temperature=0.1, max_tokens=20, stop=["\n"] # 遇到换行符停止 ) # 批量推理 outputs = llm.generate(prompts, sampling_params) # 解析结果 results = [] for output in outputs: response = output.outputs[0].text.strip() predicted_category = "未知类别" for category in categories: if category in response: predicted_category = category break results.append(predicted_category) return results

5.2 量化部署减少资源消耗

如果资源有限，可以考虑使用量化模型。

def load_quantized_model(): """ 加载量化版本的模型 """ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig # 配置8位量化 bnb_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_threshold=6.0, llm_int8_has_fp16_weight=False ) model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen2.5-32B-Instruct", quantization_config=bnb_config, device_map="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained( "Qwen/Qwen2.5-32B-Instruct", trust_remote_code=True ) return model, tokenizer # 或者使用GGUF格式的量化模型 def load_gguf_model(): """ 加载GGUF格式的量化模型（需要llama-cpp-python） """ from llama_cpp import Llama # 下载GGUF模型文件 # 可以从Hugging Face或官方渠道获取 llm = Llama( model_path="Qwen2.5-32B-Instruct-Q4_K_M.gguf", n_ctx=32768, # 上下文长度 n_gpu_layers=-1, # 所有层都放在GPU上 verbose=False ) return llm

5.3 缓存优化

对于重复的查询，可以使用缓存来提高响应速度。

import hashlib from functools import lru_cache class CachedClassifier: def __init__(self, model, tokenizer, categories): self.model = model self.tokenizer = tokenizer self.categories = categories def _get_cache_key(self, text, method): """生成缓存键""" content = f"{text}_{method}_{'_'.join(self.categories)}" return hashlib.md5(content.encode()).hexdigest() @lru_cache(maxsize=1000) def classify_cached(self, text, method="single"): """ 带缓存的分类方法 """ if method == "single": return single_label_classification(text, self.categories, self.model, self.tokenizer) elif method == "multi": return multi_label_classification(text, self.categories, self.model, self.tokenizer) else: raise ValueError(f"未知的分类方法: {method}") def batch_classify_cached(self, texts, method="single"): """批量分类，利用缓存""" results = [] for text in texts: result = self.classify_cached(text, method) results.append(result) return results # 使用示例 classifier = CachedClassifier(model, tokenizer, categories) # 第一次调用会实际推理 result1 = classifier.classify_cached("今天天气真好") print(f"第一次结果: {result1}") # 相同的文本会直接从缓存读取 result2 = classifier.classify_cached("今天天气真好") print(f"第二次结果（来自缓存）: {result2}")

6. 总结

用了这么久的Qwen2.5-32B-Instruct做文本分类，我的感受是它确实是一个很强大的工具。相比传统的分类方法，它最大的优势就是灵活。不需要大量的标注数据，不需要训练专门的模型，只需要设计好提示词，就能处理各种复杂的分类任务。

在实际使用中，我发现有几个点特别重要。首先是提示词的设计，要尽量清晰明确，告诉模型你想要什么格式的输出。其次是对于重要的分类任务，最好能加上置信度评估，这样我们就能知道哪些结果比较可靠，哪些可能需要人工复核。

性能方面，如果处理量比较大，建议用vLLM这样的推理引擎，速度会快很多。如果资源有限，量化版本也是一个不错的选择，虽然精度会有一点损失，但在很多场景下完全够用。

当然，它也不是万能的。对于特别专业的领域分类，或者对准确率要求极高的场景，可能还是需要结合传统的机器学习方法，或者进行专门的微调。但作为快速原型开发或者处理多样化文本分类需求，Qwen2.5-32B-Instruct绝对是一个值得尝试的选择。

如果你刚开始接触大语言模型做文本分类，我建议先从简单的单标签分类开始，熟悉基本的流程和提示词设计。等掌握了基本方法后，再尝试更复杂的多标签分类、少样本学习等高级技巧。在实际项目中，根据具体需求选择合适的策略，平衡好效果、速度和成本，才能发挥出最大的价值。

获取更多AI镜像
想探索更多AI镜像和应用场景？访问 CSDN星图镜像广场，提供丰富的预置镜像，覆盖大模型推理、图像生成、视频生成、模型微调等多个领域，支持一键部署。

Qwen2.5-32B-Instruct在自然语言处理中的应用：文本分类实战