千问3-8B 私有化部署方案(sglang方式启动)

张开发
2026/4/14 14:20:46 15 分钟阅读

分享文章

千问3-8B 私有化部署方案(sglang方式启动)
一、环境准备组件推荐版本说明Python3.11.x你当前环境已用 3.11完全适配CUDA12.4服务器最高支持版本直接用PyTorch (torch)2.4.1SGLang 0.4.x 官方稳定兼容版SGLang0.4.6.post1与 torch 2.4.1 完美配对支持 Qwen3-8BcuDNN9.1.0.70适配 CUDA 12.4 torch 2.4.11. 进入服务器先更新基础依赖aptupdateaptinstall-ygitgit-lfsgitlfsinstall2. 安装 Python 依赖sglang 推理框架pipinstall-Upip pipinstallsglang[all]0.4.6.post1--default-timeout300或者 pipinstall-Usglang --default-timeout1000或者 pipinstallsglang-U--index-url https://pypi.org/simple --default-timeout300pipinstalltorch torchvision torchaudio--upgrade二、下载 千问3-8B 模型创建模型目录mkdir-p/hy-tmp/models/Qwencd/hy-tmp/models/Qwen下载模型#魔搭社区 (ModelScope) —— 国内用户首选#这是阿里云提供的国内镜像站下载体验最好速度最快。#模型主页https://modelscope.cn/models/Qwen/Qwen3-8B#命令行下载 (推荐)#先安装依赖库pipinstallmodelscope modelscope download--modelQwen/Qwen3-8B--local_dir/hy-tmp/models/Qwen/Qwen3-8B下载完成后模型路径/hy-tmp/models/Qwen/Qwen3-8B-Instruct三、启动模型服务sglang 高性能推理sglang serve\--model-path /hy-tmp/models/Qwen/Qwen3-8B\--served-model-name qwen3-8b\--context-length8192\--trust-remote-code\--host0.0.0.0\--port8080\--mem-fraction-static0.85nohupsglang serve\--model-path /hy-tmp/models/Qwen/Qwen3-8B\--served-model-name qwen3-8b\--context-length8192\--trust-remote-code\--host0.0.0.0\--port8080\--mem-fraction-static0.85sglang.log21# 1. 设置环境变量exportSGLANG_API_KEYsk-123456789abcdefghijklmnopqrstuvwxyz#重新启动四、验证服务是否启动成功浏览器/ curl 访问curlhttp://localhost:8080/v1/models返回如下说明成功{data:[{id:qwen3-8b,object:model,created:...}]}五 可能出现的问题5.1 缺少系统库 libnuma报错ootI2804f44a0803101755:/hy-tmp/models/Qwen# sglang serve --model-path /hy-tmp/models/Qwen/Qwen3-8B --served-model-name qwen3-8b --context-length 8192 --trust-remote-code --host 0.0.0.0 --port 8080 --mem-fraction-static 0.85Traceback(most recent call last):File/usr/local/bin/sglang,line8,inmodulesys.exit(main())^^^^^^File/usr/local/lib/python3.11/dist-packages/sglang/cli/main.py,line40,inmain serve(args,extra_argv)File/usr/local/lib/python3.11/dist-packages/sglang/cli/serve.py,line122,inserve server_argsprepare_server_args(dispatch_argv)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File/usr/local/lib/python3.11/dist-packages/sglang/srt/server_args.py,line6539,inprepare_server_argsreturnServerArgs.from_cli_args(raw_args)^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File/usr/local/lib/python3.11/dist-packages/sglang/srt/server_args.py,line5975,infrom_cli_argsreturncls(**{attr:getattr(args,attr)forattrinattrs})^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^Filestring,line352,in__init__ File/usr/local/lib/python3.11/dist-packages/sglang/srt/server_args.py,line778,in__post_init__ self._handle_piecewise_cuda_graph()File/usr/local/lib/python3.11/dist-packages/sglang/srt/server_args.py,line1086,in_handle_piecewise_cuda_graphifself.get_model_config().is_piecewise_cuda_graph_disabled_model:^^^^^^^^^^^^^^^^^^^^^^^File/usr/local/lib/python3.11/dist-packages/sglang/srt/server_args.py,line6021,inget_model_configfromsglang.srt.configs.model_configimportModelConfig File/usr/local/lib/python3.11/dist-packages/sglang/srt/configs/model_config.py,line27,inmodulefromsglang.srt.layers.quantizationimportQUANTIZATION_METHODS File/usr/local/lib/python3.11/dist-packages/sglang/srt/layers/quantization/__init__.py,line19,inmodulefromsglang.srt.layers.quantization.auto_roundimportAutoRoundConfig File/usr/local/lib/python3.11/dist-packages/sglang/srt/layers/quantization/auto_round.py,line12,inmodulefromsglang.srt.layers.quantization.utilsimportget_scalar_types File/usr/local/lib/python3.11/dist-packages/sglang/srt/layers/quantization/utils.py,line13,inmodulefromsglang.srt.layers.quantization.fp8_kernelimportscaled_fp8_quant File/usr/local/lib/python3.11/dist-packages/sglang/srt/layers/quantization/fp8_kernel.py,line55,inmodulefromsgl_kernelimportsgl_per_token_quant_fp8 File/usr/local/lib/python3.11/dist-packages/sgl_kernel/__init__.py,line6,inmodulecommon_ops_load_architecture_specific_ops()^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File/usr/local/lib/python3.11/dist-packages/sgl_kernel/load_utils.py,line197,in_load_architecture_specific_opsraiseImportError(error_msg)ImportError:[sgl_kernel]CRITICAL:Couldnotloadanycommon_ops library! Attempted locations:1.Architecture-specific pattern:/usr/local/lib/python3.11/dist-packages/sgl_kernel/sm100/common_ops.*-found files:[/usr/local/lib/python3.11/dist-packages/sgl_kernel/sm100/common_ops.abi3.so]2.Fallback pattern:/usr/local/lib/python3.11/dist-packages/sgl_kernel/common_ops.*-found files:[]3.Standard Pythonimport:common_ops-failed GPU Info:-Compute capability:75-Expected variant:SM75(precise mathforcompatibility)-CUDA version:12.8Please ensure sgl_kernelisproperly installedwith:pip install--upgrade sglang-kernel Error detailsfrompreviousimportattempts:-ImportError:libnuma.so.1:cannotopensharedobjectfile:No suchfileordirectory-ModuleNotFoundError:No module namedcommon_opsrootI2804f44a0803101755:/hy-tmp/models/Qwen#修复aptupdateaptinstall-ylibnuma-dev

更多文章