Spring Boot 实现文档智能解析与向量化:支持 Tika、MinerU、OCR 与 SSE 实时进度反馈
适用场景:AI 知识库、RAG 系统、智能文档管理、企业搜索
技术栈:Spring Boot 3 + Apache Tika + MinerU + PaddleOCR + JodConverter(隐含)+ SSE + 向量数据库
一、为什么需要文档智能解析?
在构建基于检索增强生成(RAG)的 AI 应用时,第一步往往是将非结构化文档(PDF/Word/PPT/图片等)转化为可检索的文本向量。这个过程通常包含两个关键步骤:
- 文档解析(Parsing):提取原始文本、表格、图像等内容;
- 向量化(Embedding):将文本送入 Embedding 模型,生成向量存入向量数据库。
而不同格式的文档,需要不同的解析策略:
.docx/.xlsx→Apache Tika- 高精度 PDF(含公式/图表)→MinerU
- 图片/PDF 扫描件 →OCR(如 PaddleOCR)
本文将带你实现一个统一入口、多引擎支持、带实时进度反馈的文档解析与向量化服务。
二、整体架构设计
+---------------------+ | DocumentParseController | +----------+----------+ | +------------------+------------------+ | | | [Tika 解析] [MinerU 解析] [OCR 图片识别] | | | v v v +----------------+ +----------------+ +----------------+ | 文本提取 | | 结构化输出 | | 文字识别 | +----------------+ +----------------+ +----------------+ | | | +--------+---------+--------+---------+ | 向量化 (Embedding) v +--------------+ | 向量数据库存储 | +--------------+同时提供两种接口:
- 同步接口:适用于小文件,直接返回结果;
- SSE 异步流式接口:适用于大文件,前端可实时显示“解析中 → 向量化中 → 完成”。
三、核心代码解析
1. Controller 层:统一入口,智能路由
@RestController @RequestMapping("${spring.application.name}/document/parse") @AllArgsConstructor @Tag(name = "DocumentParseController", description = "文档解析API") public class DocumentParseController { private final DocumentParseService documentParseService; private final EmbeddingFileService embeddingFileService; @PostMapping(value = "/tikaParseEmbeddingFileSse", consumes = MediaType.MULTIPART_FORM_DATA_VALUE) @Operation(summary = "tika文件解析并向量") public SseEmitter tikaParseEmbeddingFileSse(@RequestPart("file") MultipartFile file) { AssertUtil.notNull(file, "请上传文件"); String filename = file.getOriginalFilename(); AssertUtil.notNull(filename, "请上传文件"); String fileSuffix = filename.substring(filename.lastIndexOf(".") + 1); // 判断文件后缀如果不是jpg/png/jpeg等图片类型 if (fileSuffix.equals("jpg") || fileSuffix.equals("png") || fileSuffix.equals("jpeg")) { return embeddingFileService.ocrImage(file); } else { return embeddingFileService.tikaParseEmbeddingFileSse(file); } } @PostMapping(value = "/tikaParseEmbeddingFile", consumes = MediaType.MULTIPART_FORM_DATA_VALUE) @Operation(summary = "tika文件解析并向量") public Result<ParseResultVo> tikaParseEmbeddingFile(@RequestPart("file") MultipartFile file) { AssertUtil.notNull(file, "请上传文件"); String filename = file.getOriginalFilename(); AssertUtil.notNull(filename, "请上传文件"); String fileSuffix = filename.substring(filename.lastIndexOf(".") + 1); ParseResultVo result = new ParseResultVo(); // 判断文件后缀如果不是jpg/png/jpeg等图片类型 if (fileSuffix.equals("jpg") || fileSuffix.equals("png") || fileSuffix.equals("jpeg")) { String ocrImageText = documentParseService.ocr(file); result.setText(ocrImageText); } else { result = embeddingFileService.tikaParseEmbeddingFile(file); } return R.successWithData(result); } @PostMapping(value = "/tika", consumes = MediaType.MULTIPART_FORM_DATA_VALUE) @Operation(summary = "tika文件解析") public Result<FileInfoVo> tika(@RequestPart("file") MultipartFile file) { ParseResultVo result = documentParseService.tikaParse(file); return R.successWithData(result.getInfo()); } @PostMapping(value = "/minerU", consumes = MediaType.MULTIPART_FORM_DATA_VALUE) @Operation(summary = "minerU文件解析") @Parameters({@Parameter(name = "pageStartNo", description = "解析开始页码"), @Parameter(name = "pageEndNo", description = "解析结束页码")}) public Result<FileInfoVo> minerU(@RequestPart("file") MultipartFile file, @RequestParam(value = "pageStartNo", required = false) Integer pageStartNo, @RequestParam(value = "pageEndNo", required = false) Integer pageEndNo) { ParseResultVo result = documentParseService.minerUParse(file, pageStartNo, pageEndNo); return R.successWithData(result.getInfo()); } @PostMapping(value = "/ocr", consumes = MediaType.MULTIPART_FORM_DATA_VALUE) @Operation(summary = "图片识别提取文字") public Result<String> ocr(@RequestPart("file") MultipartFile file) { return R.successWithData(documentParseService.ocr(file)); } }✅亮点:自动根据文件后缀选择解析引擎,对前端透明。
2. DocumentParseService:多引擎解析实现
(1)Apache Tika —— 通用文档解析
@Override public ParseResultVo tikaParse(MultipartFile file) { if (file == null || file.isEmpty()) { throw new ServiceException("文件信息为空"); } //临时目录 String relativeTempPath = ossProperties.getLocal().getPath() + File.separator + FileTypeEnum.TEMP.getFolder() + File.separator + IdGeneratorUtil.getSnowflakeNextIdStr(); try { //保存到临时文件 String tempPath = relativeTempPath + File.separator + file.getOriginalFilename(); File tempFile = FileUtil.writeFromStream(file.getInputStream(), tempPath); return this.tikaParse(tempFile, file); } catch (IOException e) { log.error(e.getMessage(), e); throw new ServiceException(BaseExceptionEnum.SERVER_ERROR); } finally { FileUtil.del(relativeTempPath); } } @Override public ParseResultVo tikaParse(File file, MultipartFile multipartFile) { //文件空验证 if (file == null || file.isDirectory() || file.length() <= 0) { throw new ServiceException("请传入有效文件!"); } //将解析内容存入临时文件 String sourcePath = ossProperties.getLocal().getPath() + File.separator + FileTypeEnum.TEMP.getFolder() + File.separator + IdGeneratorUtil.getSnowflakeNextIdStr() + File.separator + multipartFile.getOriginalFilename(); // String resultPath = ossProperties.getLocal().getPath() + File.separator + FileTypeEnum.TEMP.getFolder() + File.separator + IdGeneratorUtil.getSnowflakeNextIdStr() + File.separator + IdGeneratorUtil.getSnowflakeNextIdStr() + ".txt"; // 使用Tika解析文件内容 try (InputStream inputStream = FileUtil.getInputStream(file)) { BodyContentHandler handler = new BodyContentHandler(10000000); AutoDetectParser parser = new AutoDetectParser(); Metadata metadata = new Metadata(); parser.parse(inputStream, handler, metadata); StringBuilder resultString = new StringBuilder(handler.toString()); //解析结果中添加元信息 for (String name : metadata.names()) { //跳过X-TIKA元信息 if (name.startsWith("X-TIKA")) { continue; } resultString.append(name).append("\t").append(metadata.get(name)).append("\n"); } //解析结果为空 if (StrUtil.isBlank(resultString)) { log.info(metadata.toString()); throw new ServiceException("文件解析结果为空!"); } // 解析结果中添加元信息 Map<String, Object> fileMetadata = new HashMap<>(); fileMetadata.put("file_name", file.getName()); fileMetadata.put("file_size", file.length()); fileMetadata.put("file_type", file); fileMetadata.put("parse_time", LocalDateTime.now().toString()); fileMetadata.put("document_id", IdGeneratorUtil.getSnowflakeNextIdStr()); //将解析结果存入临时文件、然后上传到文件服务器 File result = FileUtil.writeUtf8String(resultString.toString(), sourcePath); FileInfoVo fileInfo = ossService.save(result, FileTypeEnum.OTHER); return new ParseResultVo(resultString.toString(), fileMetadata, fileInfo); } catch (ServiceException ex) { throw ex; } catch (Exception ex) { log.error(ex.getMessage(), ex); throw new ServiceException(BaseExceptionEnum.SERVER_ERROR); } finally { FileUtil.del(sourcePath); } }⚠️ 注意:Tika 对复杂 PDF(扫描件、公式)支持有限,此时应使用 MinerU。
(2)MinerU —— 高精度 PDF 解析(含公式/表格)
MinerU 是一个基于深度学习的 PDF 解析工具,能输出结构化 JSON + Markdown + 图像。
@Override public ParseResultVo minerUParse(MultipartFile file, Integer pageStartNo, Integer pageEndNo) { if (file == null || file.isEmpty()) { throw new ServiceException("文件信息为空"); } //临时目录 String relativeTempPath = ossProperties.getLocal().getPath() + File.separator + FileTypeEnum.TEMP.getFolder() + File.separator + IdGeneratorUtil.getSnowflakeNextIdStr(); try { //保存到临时文件 String tempPath = relativeTempPath + File.separator + file.getOriginalFilename(); File tempFile = FileUtil.writeFromStream(file.getInputStream(), tempPath); return this.minerUParse(tempFile, pageStartNo, pageEndNo); } catch (IOException e) { log.error(e.getMessage(), e); throw new ServiceException(BaseExceptionEnum.SERVER_ERROR); } finally { //删除临时目录 FileUtil.del(relativeTempPath); } } @Override public ParseResultVo minerUParse(File file, Integer pageStartNo, Integer pageEndNo) { //文件空验证 if (file == null || file.isDirectory() || file.length() <= 0) { throw new ServiceException("请传入有效文件!"); } //检查文件类型 String extName = FileUtil.getSuffix(file); if (!StrUtil.equalsAnyIgnoreCase(extName, FileExtNameConstant.PDF)) { throw new ServiceException("请传入一个pdf文件!"); } return this.invokeMinerUParse(file, pageStartNo, pageEndNo); } @Override public ParseResultVo minerUParse(FileInfoVo fileInfo, Integer pageStartNo, Integer pageEndNo) { //检查文件类型 String extName = FileUtil.getSuffix(fileInfo.getFileName()); if (!StrUtil.equalsAnyIgnoreCase(extName, FileExtNameConstant.PDF)) { throw new ServiceException("请传入一个pdf文件!"); } //下载文件 InputStream is = ossService.getInputStream(fileInfo.getFilePath()); //保存文件 String relativeTempPath = ossProperties.getLocal().getPath() + File.separator + FileTypeEnum.TEMP.getFolder() + File.separator + IdGeneratorUtil.getSnowflakeNextIdStr(); String tempPath = relativeTempPath + File.separator + fileInfo.getFileName(); File file = FileUtil.writeFromStream(is, tempPath); try { return this.invokeMinerUParse(file, pageStartNo, pageEndNo); } finally { //删除临时文件 FileUtil.del(relativeTempPath); } } /** * 调用minerU解析文件 * * @param file 文件 * @param pageStartNo 开始页码 * @param pageEndNo 结束页码 * @return 文件信息 */ private ParseResultVo invokeMinerUParse(File file, Integer pageStartNo, Integer pageEndNo) { //文件空验证 if (file == null || file.isDirectory() || file.length() <= 0) { throw new ServiceException("请传入有效文件!"); } //minerU 文件解析命令 StringJoiner command = new StringJoiner(StrUtil.SPACE); command.add(dictProperties.getDict().get("miner-u-command")); // 存在开始结束页码,且结束页码小于开始页码时,交换值 if (pageStartNo != null && pageEndNo != null) { if (pageStartNo > pageEndNo) { int temp = pageStartNo; pageStartNo = pageEndNo; pageEndNo = temp; } } //拼接参数开始页码 if (pageStartNo != null && pageStartNo >= 0) { command.add("-s").add(pageStartNo.toString()); } //拼接参数结束页码 if (pageEndNo != null && pageEndNo >= 0) { command.add("-e").add(pageEndNo.toString()); } //拼接参数输入目录 command.add("-p").add(file.getAbsolutePath()); //拼接参数输出目录 String relativeOutputPath = ossProperties.getLocal().getPath() + File.separator + FileTypeEnum.TEMP.getFolder() + File.separator + IdGeneratorUtil.getSnowflakeNextIdStr(); command.add("-o").add(relativeOutputPath); log.info("执行命令:{}", command); Process process = null; try { //执行命令 process = RuntimeUtil.exec(command.toString()); //获取打印,阻塞等待执行完成 String consoleResult = IoUtil.read(process.getInputStream(), CharsetUtil.defaultCharset()); log.info("命令执行完成:{}", consoleResult); //处理执行结果 FileInfoVo fileInfo = this.handlerExecComplete(file.getName(), relativeOutputPath); // 解析结果中添加元信息 Map<String, Object> fileMetadata = new HashMap<>(); fileMetadata.put("file_name", file.getName()); fileMetadata.put("file_size", file.length()); fileMetadata.put("file_type", file); fileMetadata.put("parse_time", LocalDateTime.now().toString()); fileMetadata.put("document_id", IdGeneratorUtil.getSnowflakeNextIdStr()); return new ParseResultVo(null, fileMetadata, fileInfo); } finally { //删除临时目录 FileUtil.del(relativeOutputPath); if (process != null) { process.destroy(); } } } /** * 处理命令执行完成 * * @param fileName 文件名 * @param outputPath 输出目录 * @return 文件信息 */ private FileInfoVo handlerExecComplete(String fileName, String outputPath) { //文件名,不含后缀 String name = FileNameUtil.getPrefix(fileName); //检查结果文件是否存在,验证是否执行成功 String resultFilePath = outputPath + File.separator + name + File.separator + "vlm" + File.separator + name + "_model.json"; if (!FileUtil.isFile(resultFilePath)) { log.warn("minerU文件解析失败,结果文件不存在:{}", resultFilePath); throw new ServiceException("文件解析失败,失败原因请查看日志!"); } //压缩结果文件 String outputResult = outputPath + File.separator + name + File.separator + "vlm"; String zipPath = outputPath + File.separator + name + File.separator + IdGeneratorUtil.getSnowflakeNextIdStr() + StrUtil.DOT + FileExtNameConstant.ZIP; File zip = ZipUtil.zip(outputResult, zipPath, StandardCharsets.UTF_8, false); //上传到文件服务器,返回文件信息 return ossService.save(zip, FileTypeEnum.OTHER); }💡 输出示例:
{ "pages": [ { "blocks": [...] } ] },可直接用于前端渲染。
(3)OCR —— 图片文字识别(PaddleOCR)
使用mymonstercat/ocr-java封装的 PaddleOCR:
@Override public String ocr(MultipartFile file) { if (file == null || file.isEmpty()) { throw new ServiceException("文件为空!"); } //相对路径 String relativePath = ossProperties.getLocal().getPath() + File.separator + FileTypeEnum.TEMP.getFolder() + File.separator + IdGeneratorUtil.getSnowflakeNextIdStr(); try { //图片预处理,提高OCR识别率 // InputStream is = ImageUtil.preprocessImage(file.getInputStream()); //写入临时文件 String imgPath = relativePath + File.separator + file.getOriginalFilename(); File imgFile = FileUtil.writeFromStream(file.getInputStream(), imgPath); //验证文件格式 String type = FileTypeUtil.getType(imgFile); if (type == null || type.startsWith("image/")) { throw new ServiceException("请传入图片文件!"); } // 获取OCR引擎实例 InferenceEngine engine = InferenceEngine.getInstance(Model.ONNX_PPOCR_V4); //参数,启用方向识别 ParamConfig config = ParamConfig.getDefaultConfig(); config.setDoAngle(true); config.setMostAngle(true); //执行OCR识别 OcrResult result = engine.runOcr(imgPath, config); //替换空白字符 return result.getStrRes().replaceAll("\\s", StrUtil.EMPTY); } catch (Exception e) { log.error(e.getMessage(), e); throw new ServiceException(BaseExceptionEnum.SERVER_ERROR); } finally { FileUtil.del(relativePath); } }✅ 支持中文、英文、数字、表格,准确率高。
3. EmbeddingFileService:向量化 + SSE 进度反馈
这是整个系统的“智能大脑”:
import com.baomidou.mybatisplus.core.conditions.query.LambdaQueryWrapper; import jakarta.annotation.Resource; import lombok.extern.slf4j.Slf4j; import org.springframework.stereotype.Service; import org.springframework.web.multipart.MultipartFile; import org.springframework.web.servlet.mvc.method.annotation.SseEmitter; import java.io.IOException; import java.util.Map; import java.util.concurrent.ConcurrentHashMap; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.atomic.AtomicReference; @Slf4j @Service public class EmbeddingFileServiceImpl implements EmbeddingFileService { public static final ExecutorService THREAD_POOL = Executors.newFixedThreadPool(10); private final Map<String, AtomicReference<String>> processingStatus = new ConcurrentHashMap<>(); @Resource private OssService ossService; @Resource private LLMHandler llmHandler; @Resource private ModelService modelService; @Resource private DocumentParseService documentParseService; @Override public ParseResultVo tikaParseEmbeddingFile(MultipartFile file) { ParseResultVo parseResultVo = documentParseService.tikaParse(file); FileInfoVo info = parseResultVo.getInfo(); Map<String, Object> metadata = parseResultVo.getMetadata(); String documentId = (String) metadata.get("document_id"); info.setDocumentId(documentId); // String text = parseResultVo.getText(); return parseResultVo; } @Override public SseEmitter tikaParseEmbeddingFileSse(MultipartFile file) { String emitterId = IdGeneratorUtil.getSnowflakeNextIdStr(); SseEmitter emitter = new SseEmitter(300_000L); // 5分钟超时(大文件需要) // 注册SSE生命周期监听 emitter.onCompletion(() -> processingStatus.remove(emitterId)); emitter.onTimeout(() -> { processingStatus.get(emitterId).set("TIMEOUT"); processingStatus.remove(emitterId); }); // 初始化进度状态 processingStatus.put(emitterId, new AtomicReference<>("STARTED")); // 异步处理(整个流程放入线程池) THREAD_POOL.execute(() -> processFileAsync(file, emitter, emitterId)); return emitter; } @Override public SseEmitter ocrImage(MultipartFile file) { String emitterId = IdGeneratorUtil.getSnowflakeNextIdStr(); SseEmitter emitter = new SseEmitter(300_000L); emitter.onCompletion(() -> processingStatus.remove(emitterId)); emitter.onTimeout(() -> processingStatus.remove(emitterId)); FileInfoVo upload = ossService.upload(file, FileTypeEnum.OTHER); String filePath = upload.getFilePath(); processingStatus.put(emitterId, new AtomicReference<>("STARTED")); THREAD_POOL.execute(() -> { try { updateProgress(emitterId, "OCR_PROCESSING", emitter); // String text = documentParseService.ocr(file); updateProgress(emitterId, "VECTORIZING_IMAGE", emitter); updateProgress(emitterId, "COMPLETED", emitter); emitter.send(SseEmitter.event() .name("result") .data(Map.of("text", "", "filePath", filePath, "status", "SUCCESS"))); } catch (Exception e) { log.error("图片处理失败", e); updateProgress(emitterId, "FAILED", emitter); try { emitter.send(SseEmitter.event() .name("error") .data(Map.of("message", "OCR失败: " + e.getMessage()))); } catch (IOException ignored) {} } finally { try { emitter.complete(); } catch (Exception ignored) {} processingStatus.remove(emitterId); } }); return emitter; } private void processFileAsync(MultipartFile file, SseEmitter emitter, String emitterId) { try { // ========= 阶段1: 文件解析 ========= updateProgress(emitterId, "PARSING", emitter); ParseResultVo parseResultVo = documentParseService.tikaParse(file); FileInfoVo info = parseResultVo.getInfo(); Map<String, Object> metadata = parseResultVo.getMetadata(); String documentId = (String) metadata.get("document_id"); info.setDocumentId(documentId); String text = parseResultVo.getText(); // ========= 阶段2: 向量化处理 ========= updateProgress(emitterId, "VECTORIZING", emitter); AIParams params = this.getEmbeddingParams(); llmHandler.vectorizeAndStore(documentId, text, params); // ========= 阶段3: 完成通知 ========= updateProgress(emitterId, "COMPLETED", emitter); emitter.send(SseEmitter.event() .name("result") .data(Map.of("documentId", documentId, "fileUrl", info.getFilePath(), "status", "SUCCESS"))); } catch (Exception e) { log.error("文件处理失败: {}", file.getOriginalFilename(), e); updateProgress(emitterId, "FAILED", emitter); try { emitter.send(SseEmitter.event() .name("error") .data(Map.of("message", "处理失败: " + e.getMessage()))); } catch (IOException ioEx) { log.warn("SSE error send failed", ioEx); } } finally { try { emitter.complete(); // 确保关闭连接 } catch (Exception ex) { log.warn("SSE complete failed", ex); } processingStatus.remove(emitterId); // 清理状态 } } // 安全发送进度更新 private void updateProgress(String emitterId, String status, SseEmitter emitter) { try { // 1. 更新状态 processingStatus.get(emitterId).set(status); // 2. 发送SSE事件(带重试机制) int retry = 3; while (retry-- > 0) { try { emitter.send(SseEmitter.event() .name("progress") .data(Map.of("status", status, "timestamp", System.currentTimeMillis()))); break; } catch (IOException | IllegalStateException e) { if (retry == 0) throw e; Thread.sleep(100); // 短暂等待后重试 } } } catch (Exception e) { log.error("进度更新失败", e); processingStatus.get(emitterId).set("ERROR"); } } private AIParams getEmbeddingParams() { LambdaQueryWrapper<AiModel> query = new LambdaQueryWrapper<>(); query.eq(AiModel::getStatus, StatusEnum.ENABLE); query.eq(AiModel::getModelType, ModelTypeEnum.VSM); query.last("limit 1"); AiModel model = modelService.getOne(query); AssertUtil.notNull(model.getModelParams(), "未找到可用的向量模型配置!"); AIParams params = new AIParams(); params.setApiKey(model.getApiKey()); params.setBaseUrl(model.getBaseUrl()); params.setModelName(model.getModelName()); return params; } }前端如何监听 SSE?
const eventSource = new EventSource('/your-app/document/parse/tikaParseEmbeddingFileSse'); eventSource.addEventListener('progress', (e) => { const data = JSON.parse(e.data); console.log('当前状态:', data.status); // 如 "PARSING", "VECTORIZING" }); eventSource.addEventListener('result', (e) => { console.log('处理完成:', e.data); eventSource.close(); }); eventSource.addEventListener('error', (e) => { console.error('处理失败'); eventSource.close(); });✅ 用户体验极佳:不再是“转圈等待”,而是“正在解析第3页... 正在生成向量...”
四、部署与优化建议
1. 依赖安装
- Tika:无需额外安装,Java 库即可;
- MinerU:需 Python 环境 + 安装
mineru包; - PaddleOCR:Java 调用 ONNX 模型,需下载模型文件;
- LibreOffice(隐含):若支持
.doc转 PDF,需后台运行。
2. 性能优化
- 线程池隔离:OCR、MinerU、Tika 使用不同线程池,避免互相阻塞;
- 临时文件清理:所有
finally块确保删除临时文件; - OSS 存储:解析结果不存本地,直接上传对象存储;
- 限流熔断:防止恶意上传导致系统崩溃。
3. 安全加固
- 文件类型白名单校验(禁止
.exe等); - 文件大小限制(如 ≤ 50MB);
- MinIO/OSS 权限控制,防止 URL 泄露。
五、总结
本文实现了一个生产级文档智能处理平台,具备以下能力:
| 功能 | 技术方案 |
|---|---|
| 通用文档解析 | Apache Tika |
| 高精度 PDF 解析 | MinerU |
| 图片 OCR | PaddleOCR (ONNX) |
| 向量化存储 | LLMHandler + 向量模型 |
| 实时进度反馈 | SSE (Server-Sent Events) |
| 文件安全存储 | OSS / MinIO |
🔮未来扩展:
- 支持分块(Chunking)与元数据注入;
- 集成 LangChain / LlamaIndex;
- 添加 Webhook 回调通知。
如需要完整整体项目源码私信!
如果你觉得这篇文章对你有帮助,欢迎点赞 ❤️、收藏 ⭐、转发 🔄!
有任何问题或优化建议,欢迎在评论区交流!