使用EGS创建vLLM推理服务并运行性能测试

我们下面教大家如何运行vllm大模型推理服务，这里以ECS带GPU的虚拟机（EGS）作为演示：
阿里云 - 使用vLLM镜像快速构建模型的推理环境

安装驱动和拉取vLLM镜像

首先，我们需要拉取vLLM镜像，该镜像包含vLLM, CUDA, Pytorch等相关依赖。阿里云EGS本身有这个镜像，我们需要登录容器镜像服务控制台，然后在左侧导航栏，单击制品中心。
在仓库名称搜索框，搜索vllm或egs, 找到egs/vllm即可。也就是如下：

1	egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/vllm:[tag]

阿里云还提供其他AI类镜像，例如ac2/vllm等，具体可以参考Alibaba Cloud AI Containers镜像列表

在创建ECS中，直接GPU实例上使用vLLM容器镜像，需要提前在该实例上安装Tesla驱动且驱动版本应为535或更高，建议购买GPU实例时，同步选中安装GPU驱动。

之后安装Docker，这里可以直接用linuxmirror安装docker。

然后安装nvidia-container-toolkit

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

设置Docker开机自启动并重启Docker服务

1 2	sudo systemctl enable docker sudo systemctl restart docker

拉取vLLM镜像

1 2	# docker pull <vLLM镜像地址> docker pull egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/vllm:0.9.0.1-pytorch2.7-cu128-20250612

运行vLLM容器

sudo docker run -d -t --net=host --gpus all \
 --privileged \
 --ipc=host \
 --name vllm \
 -v /root:/root \
 egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/vllm:0.9.0.1-pytorch2.7-cu128-20250612

然后docker ps查看状态

运行模型

本测试以Qwen3-0.6B模型为例，展示使用vLLM的推理效果。

执行以下命令，安装git-lfs便于下载大语言模型。

1 2	apt install git-lfs cd /root

执行以下命令，下载modelscope格式的Qwen3-0.6B模型。外国区域可以直接Huggingface。

1
2
3

# git lfs clone https://www.modelscope.cn/Qwen/Qwen3-0.6B.git
# git lfs clone 已替换为git clone, 不需要加lfs了。
git clone https://huggingface.co/Qwen/Qwen3-0.6B

执行以下命令，进入vLLM容器。

1	docker exec -it vllm bash

测试vLLM的在线推理测试效果

执行以下命令，启动vLLM推理服务

python3 -m vllm.entrypoints.openai.api_server \
--model /root/Qwen3-0.6B \
--trust-remote-code \
--tensor-parallel-size 1

只需要看到Application Startup Complete，则说明vLLM服务启动成功。我们在另一个SSH里面进入容器后执行以下命令，测试推理效果。

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "/root/Qwen3-0.6B",   
    "messages": [
    {"role": "system", "content": "你是个友善的AI助手。"},
    {"role": "user", "content": "介绍一下什么是大模型推理。" }
    ]}'

性能测试

1
2
3

git lfs install
# git lfs clone https://www.modelscope.cn/datasets/gliang1001/ShareGPT_V3_unfiltered_cleaned_split.git
git lfs clone https://github.com/vllm-project/vllm.git

进入容器后运行
python3 /root/vllm/benchmarks/benchmark_serving.py的方式已经不支持，只支持vllm bench serve的方式，且不支持ShareGPT。

vllm bench serve \
--model /root/Qwen3-0.6B \
--num-prompts 10 \
--dataset-name random \
--random-input-len 512 \
--random-output-len 4096 \
--max-concurrency 1

我还继续保留以前使用的老命令用于参考：

pip install pandas datasets &&
 python3 /root/vllm/benchmarks/benchmark_serving.py \
 --backend vllm \
 --model /root/llm-model/Qwen/QwQ-32B \
 --served-model-name Qwen/QwQ-32B \
 --sonnet-input-len 1024 \ # Maximum input length
 --sonnet-output-len 4096 \ # Maximum output length
 --sonnet-prefix-len 50 \ # Prefix length
 --num-prompts 400 \ # Randomly select or process 400 prompts from the dataset for performance testing.
 --request-rate 20 \ # Simulate a stress test of 20 concurrent requests per second, lasting 20 seconds, with a total of 400 requests.Evaluate the throughput and latency of the model service under load.
 --port 8000 \
 --trust-remote-code \
 --dataset-name sharegpt \
 --save-result \
 --dataset-path /root/ShareGPT_V3_unfiltered_cleaned_split/ShareGPT_V3_unfiltered_cleaned_split.json

本次教程结束。

vLLM基础

使用并创建vllm容器镜像

sudo docker run -d -t --net=host --gpus all \
 --privileged \
 --ipc=host \
 --name vllm \
 -v /root:/root \
 ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/cuda-profiling:1.2.2-cuda12.8.1-runtime-cudnn9-ubuntu24.04

安装pytorch
官网有安装命令指导:https://pytorch.org/get-started/locally/?ajs_aid=f57af34c-85d7-41ce-9ab4-b66c4cd82442

1	pip3 install torch torchvision

先用linuxmirror重置ubuntu源，然后安装vim

测试安装是否成功

# 创建文件test.py
import torch
x = torch.rand(5, 3)
print(x)
################
#然后：
python test.py
# 出现如下类似即可证明安装成功
tensor([[0.1387, 0.9382, 0.3777],
        [0.4984, 0.9640, 0.2412],
        [0.1817, 0.5296, 0.3098],
        [0.8049, 0.7036, 0.2136],
        [0.0949, 0.3358, 0.5394]])

更高级做法：

或者 python -c “import torch; print(‘PyTorch Version:’, torch.version)”

import torch

# Print PyTorch version
print("PyTorch Version:", torch.__version__)

# Check CUDA (GPU) availability
print("CUDA Available:", torch.cuda.is_available())

# If CUDA is available, print GPU details
if torch.cuda.is_available():
    print("CUDA Version:", torch.version.cuda)
    print("GPU Device Name:", torch.cuda.get_device_name(0))
else:
    print("CUDA is not available. Using CPU.")

然后python3 test.py即可获得如下展示

PyTorch Version: 2.8.0+cu128
CUDA Available: True
CUDA Version: 12.8
GPU Device Name: Tesla T4

检查vLLM版本

1
2
3

python -c "import vllm; print('vLLM Version:', vllm.__version__)"
#
vLLM Version: 0.10.2

1	vllm serve /root/Qwen3-0.6B --trust-remote-code --tensor-parallel-size 1

安装uv和vllm

1	curl -LsSf https://astral.sh/uv/install.sh \| sh

1
2
3

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto

下次docker登录容器后，需要source .venv/bin/activate激活该环境。