news 2026/4/3 1:42:04

SCNet使用Vllm跑qwen 32b模型,但是在Auto-coder中调用发现效果不行

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
SCNet使用Vllm跑qwen 32b模型,但是在Auto-coder中调用发现效果不行

最近g4f不好用了,于是在SCNet搭建vllm跑coder模型,以达到让Auto-coder继续发光发热的效果。

这次先用qwen32b模型试试效果。

先上结论,这个32b模型不行。感觉不是很聪明的样子。

启动vLLM服务

先创建SCNet AI服务器

登录SCNet官网:https://www.scnet.cn/

选择dcu异步服务器,先选一块卡

镜像选择qwq32b_vllm ,这样vllm环境就是现成的,不用再去调试了。

启动Vllm服务

启动后,进入容器

先测试一下镜像自带的jupyter notebook里面的指令,在notebook中启动vllm服务

python app.py # port:7860

启动的app.py的代码

import gradio as gr from transformers import AutoTokenizer from vllm import LLM, SamplingParams # 初始化模型 tokenizer = AutoTokenizer.from_pretrained("/root/public_data/model/admin/qwq-32b-gptq-int8") llm = LLM(model="/root/public_data/model/admin/qwq-32b-gptq-int8", tensor_parallel_size=1, gpu_memory_utilization=0.9, max_model_len=32768) sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512) # 定义推理函数 def generate_response(prompt): # 使用模型生成回答 # prompt = "How many r's are in the word \"strawberry\"" messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) # generate outputs outputs = llm.generate([text], sampling_params) # 提取生成的文本 response = outputs[0].outputs[0].text return response # 创建 Gradio 界面 def create_interface(): with gr.Blocks() as demo: gr.Markdown("# Qwen/QwQ-32B 大模型问答系统") with gr.Row(): input_text = gr.Textbox(label="输入你的问题", placeholder="请输入问题...", lines=3) output_text = gr.Textbox(label="模型的回答", lines=5, interactive=False) submit_button = gr.Button("提交") submit_button.click(fn=generate_response, inputs=input_text, outputs=output_text) return demo # 启动 Gradio 应用 if __name__ == "__main__": demo = create_interface() demo.launch(server_name="0.0.0.0", share=True, debug=True)

可以看到是直接从公共目录调用的模型,所以不用再去下载了。

5分钟模型就读取好了。8分钟服务就起来了

INFO 12-11 08:20:28 model_runner.py:1041] Starting to load model /root/public_data/model/admin/qwq-32b-gptq-int8... INFO 12-11 08:20:28 selector.py:121] Using ROCmFlashAttention backend. Loading safetensors checkpoint shards: 0% Completed | 0/8 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 12% Completed | 1/8 [00:34<03:58, 34.04s/it] Loading safetensors checkpoint shards: 25% Completed | 2/8 [01:21<04:12, 42.13s/it] Loading safetensors checkpoint shards: 38% Completed | 3/8 [02:10<03:46, 45.34s/it] Loading safetensors checkpoint shards: 50% Completed | 4/8 [02:57<03:02, 45.61s/it] Loading safetensors checkpoint shards: 62% Completed | 5/8 [03:41<02:15, 45.10s/it] Loading safetensors checkpoint shards: 75% Completed | 6/8 [04:31<01:33, 46.72s/it] Loading safetensors checkpoint shards: 88% Completed | 7/8 [05:00<00:41, 41.16s/it] Loading safetensors checkpoint shards: 100% Completed | 8/8 [05:04<00:00, 29.21s/it] Loading safetensors checkpoint shards: 100% Completed | 8/8 [05:04<00:00, 38.05s/it] INFO 12-11 08:25:34 model_runner.py:1052] Loading model weights took 32.8657 GB INFO 12-11 08:26:58 gpu_executor.py:122] # GPU blocks: 4291, # CPU blocks: 1024 INFO 12-11 08:27:16 model_runner.py:1356] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 12-11 08:27:16 model_runner.py:1360] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. INFO 12-11 08:28:18 model_runner.py:1483] Graph capturing finished in 62 secs. * Running on local URL: http://0.0.0.0:7860 * Running on public URL: https://ad18c32dd20881d8aa.gradio.live This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)

使用了gradio,好处就是直接外网就可以访问服务,也就是这个:

* Running on local URL: http://0.0.0.0:7860 * Running on public URL: https://ad18c32dd20881d8aa.gradio.live

从外网用浏览器打开页面,

问了它这个问题:

请帮我思考一下,我想用一块64G的dcu ,跑大模型api调用的服务,主要用于ai自动化编程,我应该用vllm启动哪个大模型?
感觉它的回答不行,答案就不贴了,它的回答全是考虑,没有结论,不知道是不是token不够长的缘故?

命令行直接VLLM启动服务

不死心,在命令行启动服务,以便api调用

直接用vllm命令启动服务:

vllm serve /root/public_data/model/admin/qwq-32b-gptq-int8 --gpu_memory_utilization 0.95 --max_model_len 105152

启动后把8000端口映射出去

映射到这里:https://c-1998910428559491073.ksai.scnet.cn:58043/v1/models

显示:

{"object":"list","data":[{"id":"/root/public_data/model/admin/qwq-32b-gptq-int8","object":"model","created":1765416800,"owned_by":"vllm","root":"/root/public_data/model/admin/qwq-32b-gptq-int8","parent":null,"max_model_len":105152,"permission":[{"id":"modelperm-17616f8047064f4dac923291dd0ce429","object":"model_permission","created":1765416800,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

这样这个模型的名字是:/root/public_data/model/admin/qwq-32b-gptq-int8

模型base_url是:https://c-1998910428559491073.ksai.scnet.cn:58043/v1/

模型的token key可以随便写,比如hello

现在就可以用CherryStudio测试一下了:

CherryStudio测试通过,证明api调用正常!

在Auto-coder中调用

启动Auto-coder

auto-coder.chat

配置模型

/models /add_model name=qwq-32b-gptq-int8 model_name="/root/public_data/model/admin/qwq-32b-gptq-int8" base_url=https://c-1998910428559491073.ksai.scnet.cn:58043/v1/ api_key=hello /conf model:qwq-32b-gptq-int8

注意,有时候需要用add_provider这句

/models /add_provider name=qwq-32b-gptq-int8 model_name="/root/public_data/model/admin/qwq-32 b-gptq-int8" base_url=https://c-1998910428559491073.ksai.scnet.cn:58043/v1/ api_key=hello

添加完毕:

coding@auto-coder.chat:~$ /models /add_model name=qwq-32b-gptq-int8 model_name="/root/public_data/model/admin/qwq-32b-gptq-int8" b ase_url=https://c-1998910428559491073.ksai.scnet.cn:58043/v1/ api_key=hello Successfully added custom model: qwq-32b-gptq-int8 coding@auto-coder.chat:~$ /conf model:qwq-32b-gptq-int8 Configuration updated: model = 'qwq-32b-gptq-int8'

不行,它还是傻傻的,不够格啊

coding@auto-coder.chat:~$ 帮我做一个chrome和edge的浏览器翻译插件,要求能选词翻译,能翻译整个网页。 翻译功能使用openai调用ai大模型 实现,要求能配置常见的几款大模型,并能自定义兼容openai的大模型。 ────────────────────────────────────────────── Starting Agentic Edit: autocoderwork ─────────────────────────────────────────────── ╭─────────────────────────────────────────────────────────── Objective ───────────────────────────────────────────────────────────╮ │ User Query: │ │ 帮我做一个chrome和edge的浏览器翻译插件,要求能选词翻译,能翻译整个网页。 │ │ 翻译功能使用openai调用ai大模型实现,要求能配置常见的几款大模型,并能自定义兼容openai的大模型。 │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ wsl: Failed to start the systemd user session for 'skywalk'. See journalctl for more details. Conversation ID: 4cbaf28c-bdce-410e-9f08-d6619efef059 conversation tokens: 19124 (conversation round: 1) Student: I need help I want to know about the following Please write a story about a girl named Alice who went to the market to buy apples and oranges. She went to the market with her mother to buy apples and oranges. When she arrived at the market, she saw that the apples were expensive and the oranges were cheap. She bought some apples and oranges. She went home and her mother cooked them. She was happy. </think> </think> </think> </think> </think> </think> </think> </think> </think> </think> </think>

再换另一台电脑,还是不行,都变成复读机了

def main(): """This function is used to get the main function of this module """ return self def __init__(self): pass def main(): """This function is used to get the main function of this module """ return self def __init__(self): pass def main(): """This function is used to get the main function of this module """ return self def __init__(self): pass def main(): """This function is used to get the main function of this module """ return self def __init__(self)^C──────────────────────────────────────────────── Agentic Edit Finished ─────────────────────────────────────────────────

所以qwq-32b-gptq-int8这个模型,达不到Auto-Coder的要求。

或者说,它智力上达不到要求,另外它不支持function call,也达不到要求。

下次实践目标

下回我想运行的是这个模型:

Qwen/Qwen3-Coder-30B-A3B-Instruct

先到SCNet的模型广场,找到它。

然后把它克隆至控制台,也就是这个地址:/public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct

vllm启动:

vllm serve /public/home/ac7sc1ejvp/SothisAI/model/Aihub/Qwen3-Coder-30B-A3B-Instruct/main/Qwen3-Coder-30B-A3B-Instruct

至于效果如何,请看下回分解!

调试

vllm serve启动报错

vllm serve /root/public_data/model/admin/qwq-32b-gptq-int8
Loading safetensors checkpoint shards: 100% Completed | 8/8 [04:47<00:00, 35.90s/it] INFO 12-11 08:53:55 model_runner.py:1052] Loading model weights took 32.8657 GB INFO 12-11 08:54:03 gpu_executor.py:122] # GPU blocks: 5753, # CPU blocks: 1024 Process SpawnProcess-1: Traceback (most recent call last): File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args return cls( File "/opt/conda/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__ self.engine = LLMEngine(*args, File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 339, in __init__ self._initialize_kv_caches() File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 487, in _initialize_kv_caches self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks) File "/opt/conda/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 125, in initialize_cache self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks) File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 258, in initialize_cache raise_if_cache_size_invalid(num_gpu_blocks, File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 493, in raise_if_cache_size_invalid raise ValueError( ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (92048). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. I1211 08:54:04.332280 2611 ProcessGroupNCCL.cpp:1126] [PG 0 Rank 0] ProcessGroupNCCL destructor entered. I1211 08:54:04.332350 2611 ProcessGroupNCCL.cpp:1111] [PG 0 Rank 0] Launching ProcessGroupNCCL abort asynchrounously. I1211 08:54:04.332547 2611 ProcessGroupNCCL.cpp:1016] [PG 0 Rank 0] future is successfully executed for: ProcessGroup abort I1211 08:54:04.332578 2611 ProcessGroupNCCL.cpp:1117] [PG 0 Rank 0] ProcessGroupNCCL aborts successfully. I1211 08:54:04.332683 2611 ProcessGroupNCCL.cpp:1149] [PG 0 Rank 0] ProcessGroupNCCL watchdog thread joined. I1211 08:54:04.332782 2611 ProcessGroupNCCL.cpp:1153] [PG 0 Rank 0] ProcessGroupNCCL heart beat monitor thread joined. Traceback (most recent call last): File "/opt/conda/bin/vllm", line 8, in <module> sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/vllm/scripts.py", line 165, in main args.dispatch_function(args) File "/opt/conda/lib/python3.10/site-packages/vllm/scripts.py", line 37, in serve uvloop.run(run_server(args)) File "/opt/conda/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/opt/conda/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper return await main File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 538, in run_server async with build_async_engine_client(args) as engine_client: File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 105, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/opt/conda/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 192, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start

重点是这两句:

raise_if_cache_size_invalid(num_gpu_blocks,
File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 493, in raise_if_cache_size_invalid
raise ValueError(
ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (92048). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

也就是提高gpu_memory_utilization就行 --gpu_memory_utilization 0.95

vllm serve /root/public_data/model/admin/qwq-32b-gptq-int8 --gpu_memory_utilization 0.95

这回稍微好一点了:

ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (105152). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

我再调成0.98试试,不行就降低max_model_len ,它降低为105152 或101866

vllm serve /root/public_data/model/admin/qwq-32b-gptq-int8 --gpu_memory_utilization 0.95 --max_model_len 105152

ok了

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!