Python调用cv_resnet18_ocr-detection模型推理全流程演示-智慧文博士

Python调用cv_resnet18_ocr-detection模型推理全流程演示

1. 引言

1.1 OCR文字检测的应用背景

光学字符识别（OCR）技术在现代信息处理中扮演着至关重要的角色，广泛应用于文档数字化、票据识别、车牌识别、证件扫描等场景。其中，文字检测作为OCR流程的第一步，负责从图像中定位出所有包含文本的区域，是整个系统准确性的基础。

近年来，基于深度学习的文字检测方法取得了显著进展，尤其是以DBNet为代表的分割式检测算法，在处理不规则、倾斜和弯曲文本方面表现出色。cv_resnet18_ocr-detection正是基于此类先进架构构建的高效OCR文字检测模型，具备高精度与快速推理能力。

1.2 模型简介

cv_resnet18_ocr-detection是一个由科哥构建并优化的OCR文字检测模型，其核心特点包括：

主干网络：采用轻量级 ResNet-18 提取图像特征，兼顾性能与速度
检测头设计：基于 DBNet 架构，使用可微分二值化（Differentiable Binarization）提升小文本和模糊文本的检出率
输出格式：支持文本框坐标、置信度分数及可视化结果导出
部署友好：提供 WebUI 界面与 ONNX 导出功能，便于多平台集成

本文将详细介绍如何通过 Python 调用该模型进行端到端的推理，并涵盖环境准备、代码实现、参数调优与结果解析等关键环节。

2. 环境准备与模型加载

2.1 运行环境配置

为确保模型正常运行，请先确认以下依赖已安装：

# 基础依赖 pip install torch torchvision opencv-python numpy onnxruntime flask gradio # 可选：用于ONNX推理加速 pip install onnxruntime-gpu # 若使用GPU

⚠️ 注意：若使用官方镜像cv_resnet18_ocr-detection OCR文字检测模型构建by科哥，则上述环境已在容器内预装，无需手动配置。

2.2 启动服务与访问接口

进入项目目录并启动 WebUI 服务：

cd /root/cv_resnet18_ocr-detection bash start_app.sh

服务成功启动后，终端会显示如下提示：

============================================================ WebUI 服务地址: http://0.0.0.0:7860 ============================================================

此时可通过浏览器访问http://<服务器IP>:7860查看交互界面，或通过 Python 发起 HTTP 请求完成自动化调用。

3. 使用Python调用API进行推理

3.1 单图检测API调用

接口说明

WebUI 提供了标准 RESTful API 接口，支持上传图片并返回检测结果。主要字段如下：

参数	类型	说明
`image`	file	待检测图片文件
`threshold`	float	检测阈值（0.0~1.0），默认 0.2

响应结构包含：

文本内容列表
检测框坐标（JSON 格式）
可视化图像路径
推理耗时

完整调用示例

import requests import json from PIL import Image import matplotlib.pyplot as plt # 设置服务地址 url = "http://localhost:7860/api/predict" # 准备请求数据 files = { 'image': open('/path/to/test_image.jpg', 'rb') } data = { 'threshold': 0.25 # 自定义检测阈值 } # 发起POST请求 response = requests.post(url, files=files, data=data) if response.status_code == 200: result = response.json() # 解析结果 texts = result['texts'] # 提取文本内容 boxes = result['boxes'] # 检测框坐标 scores = result['scores'] # 置信度分数 vis_path = result['vis_path'] # 可视化图像路径 infer_time = result['inference_time'] print(f"✅ 推理完成，耗时: {infer_time:.2f}s") print("📝 识别文本:") for i, text in enumerate(texts): print(f"{i+1}. {text[0]} (置信度: {scores[i]:.2f})") # 显示可视化结果 vis_img = Image.open(vis_path) plt.figure(figsize=(10, 8)) plt.imshow(vis_img) plt.axis('off') plt.title("Detection Result") plt.show() else: print("❌ 请求失败:", response.text)

✅最佳实践建议：
图片建议尺寸不超过 1536×1536，避免内存溢出
对于模糊图像，可适当降低threshold至 0.1~0.2 提高召回率

3.2 批量检测实现

当需要处理多张图片时，可通过循环调用单图接口实现批量处理：

import os from concurrent.futures import ThreadPoolExecutor def process_single_image(image_path): """封装单张图片处理逻辑""" try: with open(image_path, 'rb') as f: files = {'image': f} data = {'threshold': 0.2} response = requests.post("http://localhost:7860/api/predict", files=files, data=data) if response.status_code == 200: result = response.json() return { 'filename': os.path.basename(image_path), 'num_texts': len(result['texts']), 'inference_time': result['inference_time'] } else: return {'filename': image_path, 'error': response.text} except Exception as e: return {'filename': image_path, 'error': str(e)} # 批量处理函数 def batch_detect(image_dir, max_workers=4): image_paths = [os.path.join(image_dir, f) for f in os.listdir(image_dir) if f.lower().endswith(('.jpg', '.png', '.jpeg'))] results = [] with ThreadPoolExecutor(max_workers=max_workers) as executor: results = list(executor.map(process_single_image, image_paths)) # 统计汇总 success_count = sum(1 for r in results if 'error' not in r) total_time = sum(r.get('inference_time', 0) for r in results if 'inference_time' in r) print(f"✅ 成功处理 {success_count}/{len(results)} 张图片") print(f"⏱️ 总耗时: {total_time:.2f}s, 平均每张: {total_time/success_count:.2f}s") # 调用示例 batch_detect("/path/to/images/")

💡性能优化提示：
使用线程池并发提高吞吐量
控制max_workers避免资源争抢（推荐 4~8）

4. ONNX模型本地推理（无服务依赖）

除了调用 WebUI API，还可将模型导出为 ONNX 格式，在本地直接执行推理，适用于生产环境部署。

4.1 导出ONNX模型

在 WebUI 界面中切换至“ONNX 导出”Tab，设置输入尺寸（如 800×800），点击“导出 ONNX”按钮。导出完成后，模型文件位于：

outputs/onnx/model_800x800.onnx

4.2 加载ONNX模型并推理

import onnxruntime as ort import cv2 import numpy as np # 加载ONNX模型 session = ort.InferenceSession("outputs/onnx/model_800x800.onnx", providers=['CUDAExecutionProvider']) # 使用GPU加速 # 图像预处理 def preprocess(image_path, target_size=(800, 800)): image = cv2.imread(image_path) h, w = image.shape[:2] scale = min(target_size[0] / h, target_size[1] / w) new_h, new_w = int(h * scale), int(w * scale) resized = cv2.resize(image, (new_w, new_h)) pad_h = target_size[0] - new_h pad_w = target_size[1] - new_w padded = cv2.copyMakeBorder(resized, 0, pad_h, 0, pad_w, cv2.BORDER_CONSTANT, value=[0,0,0]) # 归一化 & 转换维度 input_blob = padded.astype(np.float32) / 255.0 input_blob = input_blob.transpose(2, 0, 1)[np.newaxis, ...] # NCHW return input_blob, (scale, new_h, new_w) # 执行推理 input_data, meta = preprocess("/path/to/test.jpg") outputs = session.run(None, {"input": input_data}) # 输出解析（根据实际模型输出结构调整） prob_map = outputs[0][0] # 假设第一个输出为概率图 threshold_map = outputs[1][0] # 第二个为阈值图 # 后处理：DB后处理算法提取文本框（简化版） _, binary = cv2.threshold(prob_map, 0.3, 1, cv2.THRESH_BINARY) contours, _ = cv2.findContours(binary.astype(np.uint8), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) boxes = [] for cnt in contours: rect = cv2.minAreaRect(cnt) box = cv2.boxPoints(rect).astype(int) boxes.append(box.tolist()) print(f"🔍 检测到 {len(boxes)} 个文本区域")

🔍注意事项：
ONNX模型输入需保持与训练时一致的预处理方式
后处理部分建议复用原项目的db_postprocess.py模块以保证一致性

5. 结果分析与调优策略

5.1 输出结果详解

每次推理返回的核心字段如下：

字段名	类型	描述
`texts`	List[List[str]]	按顺序排列的识别文本内容
`boxes`	List[List[int]]	四点坐标`[x1,y1,x2,y2,x3,y3,x4,y4]`
`scores`	List[float]	每个文本框的置信度分数
`success`	bool	是否成功
`inference_time`	float	推理耗时（秒）

示例 JSON 输出：

{ "texts": [["欢迎使用OCR服务"], ["科哥出品"]], "boxes": [[102, 320, 450, 320, 450, 360, 102, 360]], "scores": [0.96, 0.93], "success": true, "inference_time": 1.24 }

5.2 检测阈值调优指南

场景	推荐阈值	说明
清晰文档/印刷体	0.3 ~ 0.4	减少误检，提高精度
模糊截图/低分辨率	0.1 ~ 0.2	提高召回率，容忍更多噪声
复杂背景（广告牌等）	0.35 ~ 0.5	抑制非文本区域激活
手写文字	0.1 ~ 0.15	文字连续性差，需宽松阈值