Qwen1.5-0.5B-Chat离线部署：内网环境安装实战教程-智慧文博士

Qwen1.5-0.5B-Chat离线部署：内网环境安装实战教程

1. 引言

1.1 场景背景与技术需求

在企业级AI应用中，数据安全与网络隔离是核心要求。许多实际业务场景（如金融、医疗、政务系统）需要在无外网连接的内网环境中运行大模型服务，以确保敏感信息不外泄。然而，主流的大模型通常依赖公网下载权重、GPU加速推理和复杂的依赖管理，难以直接适配内网部署需求。

为解决这一问题，本文聚焦于轻量级开源对话模型Qwen1.5-0.5B-Chat，结合 ModelScope 生态能力，提供一套完整的纯离线、CPU 可用、低资源消耗的本地化部署方案。该方案特别适用于仅有基础服务器资源且不具备 GPU 的内网环境。

1.2 项目价值与学习目标

通过本教程，你将掌握： - 如何在无外网环境下完成 Qwen1.5-0.5B-Chat 模型的本地化部署 - 基于 Conda 的 Python 环境隔离与依赖管理 - 使用 Transformers + Flask 构建轻量 Web 对话界面 - 实现流式响应输出，提升用户体验

最终实现一个可通过浏览器访问的智能对话服务，支持多轮交互，内存占用低于 2GB，适合嵌入到私有系统中作为辅助问答模块。

2. 环境准备与依赖配置

2.1 系统要求与前置条件

本方案适用于以下环境：

项目	要求
操作系统	Linux (CentOS/Ubuntu) 或 Windows WSL
内存	≥4GB（推荐）
存储空间	≥6GB（含模型缓存）
Python 版本	3.8 - 3.10
是否需要 GPU	否（纯 CPU 推理）

注意：由于内网环境无法实时访问 PyPI 或 Hugging Face，所有依赖包需提前在外网机器打包并迁移至目标主机。

2.2 创建独立 Conda 环境

使用 Conda 进行环境隔离，避免污染系统 Python 环境：

# 创建名为 qwen_env 的虚拟环境 conda create -n qwen_env python=3.9 -y # 激活环境 conda activate qwen_env

2.3 离线依赖安装策略

若目标服务器无外网连接，请按以下流程操作：

在可联网机器上导出所需包列表：txt torch==2.1.0 transformers==4.37.0 modelscope==1.13.0 flask==2.3.3 gevent==2.2.0
使用pip download下载.whl文件：bash pip download -r requirements.txt -d ./offline_packages
将offline_packages目录拷贝至内网服务器，并执行：bash pip install --no-index --find-links ./offline_packages -r requirements.txt

3. 模型获取与本地加载

3.1 外网预下载模型（关键步骤）

Qwen1.5-0.5B-Chat 托管于 ModelScope 平台，必须通过modelscopeSDK 下载。建议在外网环境中预先拉取模型：

from modelscope import snapshot_download model_dir = snapshot_download('qwen/Qwen1.5-0.5B-Chat') print(f"模型已保存至: {model_dir}")

该命令会自动下载模型权重、Tokenizer 和配置文件，默认路径为~/.cache/modelscope/hub/qwen/Qwen1.5-0.5B-Chat。

3.2 内网迁移模型文件

将整个Qwen1.5-0.5B-Chat文件夹打包并复制到内网服务器的指定目录，例如：

scp -r ~/.cache/modelscope/hub/qwen user@intranet-server:/opt/models/

设置环境变量以指定本地模型路径：

export MODELSCOPE_CACHE=/opt/models

3.3 验证模型本地加载

编写测试脚本验证是否能成功加载模型：

from modelscope import AutoModelForCausalLM, AutoTokenizer model_path = "/opt/models/qwen/Qwen1.5-0.5B-Chat" tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_path, device_map="cpu", trust_remote_code=True ) inputs = tokenizer("你好，请介绍一下你自己", return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response)

预期输出应包含对模型功能的简要介绍，表明模型已正确加载。

4. Web服务构建与接口开发

4.1 Flask应用结构设计

创建项目目录结构如下：

qwen-chat-deploy/ ├── app.py ├── config.py ├── utils.py └── templates/ └── index.html

4.2 核心代码实现

`app.py`：主服务入口

# -*- coding: utf-8 -*- from flask import Flask, request, jsonify, render_template, Response import json from utils import generate_stream app = Flask(__name__) @app.route('/') def index(): return render_template('index.html') @app.route('/chat', methods=['POST']) def chat(): data = request.json prompt = data.get("prompt", "") history = data.get("history", []) def generate(): for text in generate_stream(prompt, history): yield json.dumps({"text": text}, ensure_ascii=False) + "\n" return Response(generate(), content_type='application/x-ndjson') if __name__ == '__main__': app.run(host='0.0.0.0', port=8080, threaded=True)

`utils.py`：模型推理封装

# -*- coding: utf-8 -*- from modelscope import AutoModelForCausalLM, AutoTokenizer import torch model_path = "/opt/models/qwen/Qwen1.5-0.5B-Chat" tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_path, device_map="cpu", torch_dtype=torch.float32, trust_remote_code=True ) def generate_stream(prompt, history): # 构造输入文本 input_text = "" for h in history: input_text += f"用户：{h[0]}\n助手：{h[1]}\n" input_text += f"用户：{prompt}\n助手：" inputs = tokenizer(input_text, return_tensors="pt") streamer = TextIteratorStreamer(tokenizer) generation_kwargs = dict( inputs=inputs.input_ids, streamer=streamer, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.9 ) thread = Thread(target=model.generate, kwargs=generation_kwargs) thread.start() for token in streamer: yield token

说明：TextIteratorStreamer来自transformers，用于实现流式输出。需手动导入：
python from transformers import TextIteratorStreamer from threading import Thread

`templates/index.html`：前端交互页面

<!DOCTYPE html> <html> <head> <title>Qwen1.5-0.5B-Chat 本地对话</title> <style> body { font-family: sans-serif; padding: 20px; } #chat { height: 70vh; overflow-y: scroll; border: 1px solid #ccc; padding: 10px; margin-bottom: 10px; } .user { color: blue; margin: 5px 0; } .assistant { color: green; margin: 5px 0; } input, button { padding: 10px; font-size: 16px; } #input-box { width: 70%; } </style> </head> <body> <h2>Qwen1.5-0.5B-Chat 轻量对话系统</h2> <div id="chat"></div> <input type="text" id="input-box" placeholder="请输入你的问题..." /> <button onclick="send()">发送</button> <script> const chatBox = document.getElementById("chat"); let history = []; function send() { const input = document.getElementById("input-box"); const prompt = input.value.trim(); if (!prompt) return; // 显示用户消息 addMessage(prompt, "user"); fetch("/chat", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ prompt, history }) }) .then(stream => { const reader = stream.body.getReader(); let response = ""; function read() { reader.read().then(({ done, value }) => { if (done) { history.push([prompt, response]); input.value = ""; return; } const lines = new TextDecoder().decode(value).trim().split("\n"); lines.forEach(line => { try { const obj = JSON.parse(line); response += obj.text; updateLastMessage(response); } catch(e) {} }); scrollChat(); read(); }); } read(); }); input.value = ""; } function addMessage(text, role) { const div = document.createElement("div"); div.className = role; div.textContent = text; chatBox.appendChild(div); scrollChat(); } function updateLastMessage(text) { const items = chatBox.children; if (items.length > 0) items[items.length - 1].textContent = text; } function scrollChat() { chatBox.scrollTop = chatBox.scrollHeight; } </script> </body> </html>

5. 服务启动与访问验证

5.1 启动命令

确保当前处于qwen_env环境后，运行主程序：

cd /path/to/qwen-chat-deploy python app.py

正常启动日志如下：

* Running on http://0.0.0.0:8080 INFO:werkzeug:Press CTRL+C to quit

5.2 访问 Web 界面

打开浏览器，访问：

http://<服务器IP>:8080

即可看到聊天界面。输入“你好”等简单指令，观察是否返回合理回复。

提示：首次生成可能耗时较长（约 10-20 秒），后续响应速度将有所提升。

5.3 性能优化建议

启用 float16 推理（若有支持）：可减少显存/内存占用，但需注意 CPU 兼容性。
限制 max_new_tokens：防止生成过长内容导致延迟过高。
增加 swap 分区：当物理内存不足时，适当 swap 可避免 OOM 错误。

6. 总结

6.1 实践成果回顾

本文详细介绍了如何在无外网、无 GPU的内网环境中，成功部署Qwen1.5-0.5B-Chat轻量级对话模型。我们完成了以下关键任务：

利用 ModelScope SDK 预下载模型并迁移至内网
基于 Conda 实现环境隔离与依赖管理
使用 Transformers + Flask 构建支持流式输出的 Web 服务
实现低资源消耗（<2GB RAM）、高可用性的本地 AI 对话能力

该方案具备良好的工程实用性，可用于知识库问答、内部培训助手、自动化客服等场景。

6.2 最佳实践建议

定期更新模型缓存：在外网环境定期同步最新版本模型，保障功能迭代。
加强权限控制：生产环境中建议添加身份认证中间件（如 Nginx + Basic Auth）。
监控资源使用：通过psutil或 Prometheus 记录 CPU/内存占用情况，及时预警。

6.3 后续扩展方向

集成 RAG（检索增强生成）机制，接入企业文档库
支持多模型切换，构建本地模型路由网关
添加对话记录持久化功能，便于审计与分析

获取更多AI镜像
想探索更多AI镜像和应用场景？访问 CSDN星图镜像广场，提供丰富的预置镜像，覆盖大模型推理、图像生成、视频生成、模型微调等多个领域，支持一键部署。

Qwen1.5-0.5B-Chat离线部署：内网环境安装实战教程