Qwen2.5-7B API封装教程：FastAPI集成部署实战-开发者社区

Qwen2.5-7B API封装教程：FastAPI集成部署实战

1. 引言

1.1 模型背景与应用场景

通义千问 2.5-7B-Instruct 是阿里于 2024 年 9 月随 Qwen2.5 系列发布的 70 亿参数指令微调语言模型，定位为“中等体量、全能型、可商用”的高性能开源模型。凭借其在多项基准测试中的优异表现和对多语言、多模态任务的广泛支持，该模型已成为中小型企业及开发者构建智能应用的理想选择。

随着大模型在客服系统、代码辅助、内容生成等场景的深入应用，将本地部署的大模型通过标准化 API 接口对外提供服务，已成为工程落地的关键环节。本文聚焦Qwen2.5-7B-Instruct的本地化部署与 API 封装，使用FastAPI框架实现一个高并发、低延迟的 RESTful 接口服务，适用于生产环境下的快速集成。

1.2 教程目标与前置知识

本教程旨在帮助读者完成以下目标：

掌握 Qwen2.5-7B-Instruct 模型的本地加载方法
使用 FastAPI 构建稳定高效的推理接口
实现请求校验、异步响应、流式输出等实用功能
完成容器化打包与简易性能优化

前置知识要求：

Python 基础（熟悉 async/await）
FastAPI 或 Flask 类 Web 框架使用经验
Hugging Face Transformers 库基本操作
GPU 环境配置基础（CUDA/cuDNN）

2. 环境准备与模型加载

2.1 依赖安装

首先创建独立虚拟环境并安装核心依赖库：

python -m venv qwen-env source qwen-env/bin/activate # Linux/Mac # 或 qwen-env\Scripts\activate # Windows pip install --upgrade pip pip install torch==2.3.0+cu121 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121 pip install transformers==4.40.0 accelerate==0.28.0 fastapi==0.110.0 uvicorn==0.27.0 pydantic==2.7.0

注意：建议使用 CUDA 12.1 版本 PyTorch 以获得最佳推理性能。若仅 CPU 运行，可替换为 CPU-only 版本。

2.2 模型下载与本地加载

通过 Hugging Face Hub 下载 Qwen2.5-7B-Instruct 模型（需登录 huggingface.co 获取权限）：

git lfs install git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct ./models/qwen2.5-7b-instruct

加载模型代码如下：

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline import torch model_path = "./models/qwen2.5-7b-instruct" tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_path, device_map="auto", torch_dtype=torch.float16, trust_remote_code=True ) # 创建生成管道 generator = pipeline( "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=2048, temperature=0.7, top_p=0.9, repetition_penalty=1.1, do_sample=True )

提示：trust_remote_code=True是必须的，因 Qwen 模型包含自定义架构组件。

3. FastAPI 接口设计与实现

3.1 API 请求与响应结构定义

我们使用 Pydantic 定义清晰的输入输出 Schema，提升接口健壮性：

from pydantic import BaseModel from typing import List, Optional class Message(BaseModel): role: str content: str class ChatCompletionRequest(BaseModel): messages: List[Message] temperature: Optional[float] = 0.7 max_tokens: Optional[int] = 2048 stream: Optional[bool] = False class ChatCompletionResponse(BaseModel): id: str object: str = "chat.completion" created: int model: str = "qwen2.5-7b-instruct" choices: List[dict] usage: dict

3.2 核心推理接口开发

接下来实现/v1/chat/completions接口，支持标准 OpenAI 兼容格式：

from fastapi import FastAPI, HTTPException from datetime import datetime import uuid import asyncio app = FastAPI(title="Qwen2.5-7B-Instruct API", version="1.0") @app.post("/v1/chat/completions", response_model=ChatCompletionResponse) async def chat_completions(request: ChatCompletionRequest): try: # 构造 prompt conversation = "" for msg in request.messages: if msg.role == "system": conversation += f"<|system|>\n{msg.content}\n" elif msg.role == "user": conversation += f"<|user|>\n{msg.content}\n" elif msg.role == "assistant": conversation += f"<|assistant|>\n{msg.content}\n" conversation += "<|assistant|>\n" # 调用模型生成 outputs = generator( conversation, max_new_tokens=request.max_tokens, temperature=request.temperature, return_full_text=False ) generated_text = outputs[0]["generated_text"] # 组装响应 response_id = str(uuid.uuid4()) created_time = int(datetime.now().timestamp()) return { "id": response_id, "created": created_time, "choices": [ { "index": 0, "message": {"role": "assistant", "content": generated_text}, "finish_reason": "stop" } ], "usage": { "prompt_tokens": len(tokenizer.encode(conversation)), "completion_tokens": len(tokenizer.encode(generated_text)), "total_tokens": len(tokenizer.encode(conversation + generated_text)) } } except Exception as e: raise HTTPException(status_code=500, detail=str(e))

3.3 流式输出支持（SSE）

为提升用户体验，支持 Server-Sent Events (SSE) 流式返回 token：

from fastapi.responses import StreamingResponse import json async def generate_stream(messages, temperature, max_tokens): conversation = "" for msg in messages: conversation += f"<|{msg['role']}|>\n{msg['content']}\n" conversation += "<|assistant|>\n" inputs = tokenizer(conversation, return_tensors="pt").to(model.device) for _ in range(max_tokens): with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=1, temperature=temperature, top_p=0.9, do_sample=True, pad_token_id=tokenizer.eos_token_id ) new_token = tokenizer.decode(output[0][-1], skip_special_tokens=True) if new_token: chunk = { "id": str(uuid.uuid4()), "object": "chat.completion.chunk", "created": int(datetime.now().timestamp()), "model": "qwen2.5-7b-instruct", "choices": [{"delta": {"content": new_token}, "finish_reason": None}] } yield f"data: {json.dumps(chunk)}\n\n" await asyncio.sleep(0.01) # 模拟流速控制 @app.post("/v1/chat/completions") async def chat_completions_stream(request: ChatCompletionRequest): if request.stream: return StreamingResponse( generate_stream( [m.dict() for m in request.messages], request.temperature, request.max_tokens ), media_type="text/event-stream" ) else: # 同步逻辑见上节 pass

4. 部署优化与工程实践

4.1 性能调优建议

尽管 Qwen2.5-7B 在 RTX 3060 上即可运行，但生产级部署仍需优化：

使用 vLLM 加速推理：替换transformers.pipeline为 vLLM 提供的LLM类，吞吐量可提升 3-5 倍。
启用 FlashAttention-2：在支持的 GPU 上开启 FA2 可显著降低显存占用并提高速度。
量化压缩：采用 GGUF 或 AWQ 量化至 4-bit，显存需求从 14GB 降至 6GB 左右。

示例（vLLM 集成）：

from vllm import LLM, SamplingParams llm = LLM(model="./models/qwen2.5-7b-instruct", gpu_memory_utilization=0.9) sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=2048) outputs = llm.generate([prompt], sampling_params)

4.2 Docker 容器化部署

编写Dockerfile实现一键部署：

FROM nvidia/cuda:12.1-runtime-ubuntu22.04 WORKDIR /app COPY . . RUN apt-get update && apt-get install -y python3-pip git-lfs RUN pip install torch==2.3.0+cu121 --extra-index-url https://download.pytorch.org/whl/cu121 RUN pip install -r requirements.txt EXPOSE 8000 CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]

构建并运行：

docker build -t qwen-api . docker run --gpus all -p 8000:8000 -v ./models:/app/models qwen-api

4.3 接口安全与限流

添加基础认证与速率限制：

from fastapi.security import HTTPBearer from slowapi import Limiter from slowapi.util import get_remote_address security = HTTPBearer() limiter = Limiter(key_func=get_remote_address) app.state.limiter = limiter @app.get("/health") @limiter.limit("10/minute") def health_check(): return {"status": "healthy"}