8卡RTX 5090D服务器部署Qwen3-32B-AWQ模型执行性能测试

一、背景

最近得了一台8卡5090D服务器进行测试评估。

GPU拓扑情况如下

(test) root@ubuntu:/opt/models# nvidia-smi topo -m

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU1    NODE     X      NODE    NODE    SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU2    NODE    NODE     X      NODE    SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU3    NODE    NODE    NODE     X      SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU4    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    32-63,96-127    1               N/A
GPU5    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    32-63,96-127    1               N/A
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    32-63,96-127    1               N/A
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      32-63,96-127    1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

最开始是准备做vllm直接推理大模型,使用vllm0.9.0版本,由于一直在版本不兼容或直接推理出错中反复横跳,最后从github上找到了一个方法,直接用nvidia自己释出来的docker镜像进行vllm推理。

二、影响测试的错误

2.1. 直接vllm推理错误。原因是torch组件兼容性问题

RuntimeError: CUDA error: no kernel image is available for execution on the device
Compile with 'TORCH_USE _CUDA_DSA' to enable device-side assertions.

错误截图:

2.2. 解决兼容性问题又会碰到NCCL error

raise RuntimeError(f"NCCL error: {error_str})
RuntimeError NCCL error: unhandled cuda error (run with NCCL DEBUG-INFO for details)


尝试然后重新安装torch

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 --ignore-installed

重新安装nccl

python -m pip install "nvidia-nccl-cu12>=2.26.5"

并且启动成功了,但是一推理就会跳到no kernel image is available for execution on the device错误

CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve /opt/models/Qwen3-32B-AWQ \
  --gpu-memory-utilization 0.8 \
  --tensor-parallel-size 4 \
  --served-model-name Qwen3-32B \
  --port 8801 \
  --host 0.0.0.0 \
  --distributed-executor-backend mp

最后参考[Doc]: Steps to run vLLM on your RTX5080 or 5090!
#14452使用nvcr.io/nvidia/tritonserver:25.05-vllm-python-py3镜像启动了测试,看起来vllm还没有做好对5090的兼容性。

三、使用Docker启动测试

3.1. 准备大模型推理环境

3.1.1. 安装 docker

snap install docker

3.1.2. 安装 docker 需要 NVIDIA Container Toolkit

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt install nvidia-container-toolkit nvidia-container-toolkit-base libnvidia-container-tools libnvidia-container1

3.1.3. 重启 docker 服务,使配置生效

systemctl restart snap.docker.dockerd

3.1.4. 准备使用英伟达提供的基础镜像进行大模型推理,编写docker-compose.yml文件

services:
  vllm-server:
    image: nvcr.io/nvidia/tritonserver:25.05-vllm-python-py3
    command: >
      vllm serve /opt/tritonserver/models/Qwen3-32B-AWQ
      --trust-remote-code
      --enable-prefix-caching
      --disable-sliding-window
      --gpu-memory-utilization 0.9
      --port 8005
      --max-model-len 32768
      --max-num-seqs 2
      --tensor-parallel-size 4
    environment:
      - VLLM_ATTENTION_BACKEND=xformers
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
      - ~/.cache:/root/.cache
      - /root/models:/opt/tritonserver/models
    runtime: nvidia
    network_mode: host
    ipc: host
    restart: "no"

3.1.5. 启动镜像服务

docker-compose up -d

3.2. 使用evalscope执行压力测试

3.2.1. 安装evalscope环境

miniconda环境安装略过,参考官网去做。

conda create -n evalscope python=3.10
pip install 'evalscope[app,perf]' -U

3.2.2 执行4卡压力测试

(test) root@ubuntu:/opt/models# evalscope perf     --model /opt/tritonserver/models/Qwen3-32B-AWQ     --url "http://127.0.0.1:8005/v1/chat/completions"     --parallel 5     --number 20     --api openai     --dataset openqa     --stream
2025-06-29 00:50:50,964 - evalscope - INFO - Save the result to: outputs/20250629_005050/Qwen3-32B-AWQ
2025-06-29 00:50:50,964 - evalscope - INFO - Starting benchmark with args:
2025-06-29 00:50:50,964 - evalscope - INFO - {
    "model": "/opt/tritonserver/models/Qwen3-32B-AWQ",
    "model_id": "Qwen3-32B-AWQ",
    "attn_implementation": null,
    "api": "openai",
    "tokenizer_path": null,
    "port": 8877,
    "url": "http://127.0.0.1:8005/v1/chat/completions",
    "headers": {},
    "connect_timeout": 600,
    "read_timeout": 600,
    "api_key": null,
    "no_test_connection": false,
    "number": 20,
    "parallel": 5,
    "rate": -1,
    "log_every_n_query": 10,
    "debug": false,
    "wandb_api_key": null,
    "swanlab_api_key": null,
    "name": null,
    "outputs_dir": "outputs/20250629_005050/Qwen3-32B-AWQ",
    "max_prompt_length": 9223372036854775807,
    "min_prompt_length": 0,
    "prefix_length": 0,
    "prompt": null,
    "query_template": null,
    "apply_chat_template": true,
    "dataset": "openqa",
    "dataset_path": null,
    "frequency_penalty": null,
    "repetition_penalty": null,
    "logprobs": null,
    "max_tokens": 2048,
    "min_tokens": null,
    "n_choices": null,
    "seed": 0,
    "stop": null,
    "stop_token_ids": null,
    "stream": true,
    "temperature": 0.0,
    "top_p": null,
    "top_k": null,
    "extra_args": {}
}
2025-06-29 00:50:51,718 - evalscope - INFO - Test connection successful.
2025-06-29 00:50:52,308 - evalscope - INFO - Save the data base to: outputs/20250629_005050/Qwen3-32B-AWQ/benchmark_data.db
Processing:  45%|█████████████████████████████████████████████████████                                                                 | 9/20 [01:31<01:52, 10.24s/it]2025-06-29 00:52:27,819 - evalscope - INFO - {
  "Time taken for tests (s)": 95.5079,
  "Number of concurrency": 5,
  "Total requests": 10,
  "Succeed requests": 10,
  "Failed requests": 0,
  "Output token throughput (tok/s)": 133.7795,
  "Total token throughput (tok/s)": 136.8055,
  "Request throughput (req/s)": 0.1047,
  "Average latency (s)": 39.6521,
  "Average time to first token (s)": 21.0828,
  "Average time per output token (s)": 0.0145,
  "Average input tokens per request": 28.9,
  "Average output tokens per request": 1277.7,
  "Average package latency (s)": 0.0145,
  "Average package per request": 1277.7
}
Processing:  95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▏     | 19/20 [03:01<00:07,  7.96s/it]2025-06-29 00:54:19,768 - evalscope - INFO - {
  "Time taken for tests (s)": 207.457,
  "Number of concurrency": 5,
  "Total requests": 20,
  "Succeed requests": 20,
  "Failed requests": 0,
  "Output token throughput (tok/s)": 128.4025,
  "Total token throughput (tok/s)": 131.2224,
  "Request throughput (req/s)": 0.0964,
  "Average latency (s)": 44.8448,
  "Average time to first token (s)": 25.5177,
  "Average time per output token (s)": 0.0145,
  "Average input tokens per request": 29.25,
  "Average output tokens per request": 1331.9,
  "Average package latency (s)": 0.0145,
  "Average package per request": 1331.9
}
Processing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [03:27<00:00, 10.37s/it]
2025-06-29 00:54:19,874 - evalscope - INFO -
Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |  207.457  |
+-----------------------------------+-----------+
| Number of concurrency             |    5      |
+-----------------------------------+-----------+
| Total requests                    |   20      |
+-----------------------------------+-----------+
| Succeed requests                  |   20      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |  128.403  |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    |  131.222  |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    0.0964 |
+-----------------------------------+-----------+
| Average latency (s)               |   44.8448 |
+-----------------------------------+-----------+
| Average time to first token (s)   |   25.5177 |
+-----------------------------------+-----------+
| Average time per output token (s) |    0.0145 |
+-----------------------------------+-----------+
| Average input tokens per request  |   29.25   |
+-----------------------------------+-----------+
| Average output tokens per request | 1331.9    |
+-----------------------------------+-----------+
| Average package latency (s)       |    0.0145 |
+-----------------------------------+-----------+
| Average package per request       | 1331.9    |
+-----------------------------------+-----------+
2025-06-29 00:54:19,886 - evalscope - INFO -
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     | 14.8518  | 0.0142  |  0.0144  |   36.7898   |      21      |     1002      |    20.0306     |    20.8404    |
|     25%     |  22.124  | 0.0143  |  0.0145  |   40.9366   |      26      |     1111      |    25.1348     |    25.6956    |
|     50%     | 30.0105  | 0.0144  |  0.0145  |   48.3621   |      28      |     1289      |    28.3309     |    28.8152    |
|     66%     | 31.3366  | 0.0145  |  0.0146  |   52.2564   |      31      |     1305      |    31.6099     |    32.3916    |
|     75%     | 32.8376  | 0.0145  |  0.0146  |   52.5391   |      34      |     1630      |    35.9473     |    36.3159    |
|     80%     | 33.5577  | 0.0146  |  0.0146  |   52.8828   |      37      |     1645      |    40.1009     |    41.1051    |
|     90%     | 37.6004  | 0.0148  |  0.0147  |   56.9723   |      41      |     2048      |    67.9679     |    69.3509    |
|     95%     | 38.6689  |  0.015  |  0.0148  |   59.6889   |      45      |     2048      |    68.0682     |    69.528     |
|     98%     | 38.6689  | 0.0154  |  0.0148  |   59.6889   |      45      |     2048      |    68.0682     |    69.528     |
|     99%     | 38.6689  | 0.0157  |  0.0148  |   59.6889   |      45      |     2048      |    68.0682     |    69.528     |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-06-29 00:54:19,887 - evalscope - INFO - Save the summary to: outputs/20250629_005050/Qwen3-32B-AWQ

3.2.3. 压测期间GPU使用情况

Sun Jun 29 01:47:33 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.169                Driver Version: 570.169        CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090 D      Off |   00000000:16:00.0 Off |                  N/A |
| 39%   47C    P1            147W /  575W |   30263MiB /  32607MiB |     61%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 5090 D      Off |   00000000:38:00.0 Off |                  N/A |
| 40%   47C    P1            145W /  575W |   30235MiB /  32607MiB |     81%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 5090 D      Off |   00000000:49:00.0 Off |                  N/A |
| 43%   49C    P1            148W /  575W |   30235MiB /  32607MiB |     83%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 5090 D      Off |   00000000:5A:00.0 Off |                  N/A |
| 40%   47C    P1            141W /  575W |   30235MiB /  32607MiB |     33%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA GeForce RTX 5090 D      Off |   00000000:98:00.0 Off |                  N/A |
| 30%   33C    P8             12W /  575W |       3MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA GeForce RTX 5090 D      Off |   00000000:B8:00.0 Off |                  N/A |
| 30%   33C    P8             20W /  575W |       3MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA GeForce RTX 5090 D      Off |   00000000:C8:00.0 Off |                  N/A |
| 30%   34C    P8             22W /  575W |       3MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA GeForce RTX 5090 D      Off |   00000000:D8:00.0 Off |                  N/A |
| 30%   34C    P8             12W /  575W |       3MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

3.2.4. 压测期间CPU使用情况

top - 01:48:17 up 3 days, 17:28,  9 users,  load average: 4.42, 3.99, 3.48
Tasks: 1367 total,   5 running, 1362 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.1 us,  0.6 sy,  0.0 ni, 97.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 1031697.+total, 832427.9 free,  12809.9 used, 186459.2 buff/cache
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used. 995540.9 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 739296 root      20   0  195.6g   6.5g   4.9g R  85.4   0.6   5:46.54 python3
 739200 root      20   0  198.8g   6.8g   4.9g R  85.1   0.7   5:51.75 python3
 739294 root      20   0  195.7g   6.6g   5.0g R  85.1   0.7   5:43.39 python3
 739295 root      20   0  195.7g   6.6g   5.0g R  77.8   0.7   5:46.38 python3
 739897 root      20   0   49.1g 677088 281532 S   5.3   0.1   0:33.57 evalscope
 738281 root      20   0   16.8g 955408 318480 S   4.0   0.1   0:26.31 vllm
 740992 root      20   0   11964   5656   3428 R   1.3   0.0   0:00.70 top
 741137 root      20   0  166360  19872   5420 S   1.0   0.0   0:00.03 nvidia-smi
  20210 nvidia-+  20   0    5476   2068   1896 S   0.3   0.0   0:05.53 nvidia-persiste
 143225 root      20   0    7304   4084   2792 S   0.3   0.0   0:23.48 watch
 733652 ubuntu    20   0   17476   8548   5928 S   0.3   0.0   0:00.02 sshd
      1 root      20   0  166596  12252   8492 S   0.0   0.0   0:28.93 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:21.25 kthreadd
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
      4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp
      5 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 slub_flushwq
      6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 netns
      8 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/0:0H-events_highpri
     11 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq
     12 root      20   0       0      0      0 S   0.0   0.0   0:00.00 rcu_tasks_rude_
     13 root      20   0       0      0      0 S   0.0   0.0   0:00.00 rcu_tasks_trace
     14 root      20   0       0      0      0 S   0.0   0.0   0:00.52 ksoftirqd/0
     15 root      20   0       0      0      0 I   0.0   0.0   0:30.41 rcu_sched
     16 root      rt   0       0      0      0 S   0.0   0.0   0:02.42 migration/0
     17 root     -51   0       0      0      0 S   0.0   0.0   0:00.00 idle_inject/0
     19 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/0
     20 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/1
     21 root     -51   0       0      0      0 S   0.0   0.0   0:00.00 idle_inject/1
     22 root      rt   0       0      0      0 S   0.0   0.0   0:12.44 migration/1
     23 root      20   0       0      0      0 S   0.0   0.0   0:00.16 ksoftirqd/1
     25 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/1:0H-events_highpri
     26 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/2
     27 root     -51   0       0      0      0 S   0.0   0.0   0:00.00 idle_inject/2
     28 root      rt   0       0      0      0 S   0.0   0.0   0:12.04 migration/2
     29 root      20   0       0      0      0 S   0.0   0.0   0:00.14 ksoftirqd/2
     31 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/2:0H-kblockd
     32 root      20   0       0      0      0 S   0.0   0.0   0:00.00 cpuhp/3
     33 root     -51   0       0      0      0 S   0.0   0.0   0:00.00 idle_inject/3
     34 root      rt   0       0      0      0 S   0.0   0.0   0:11.00 migration/3

3.2.5. 压测期间内存使用情况

(base) root@ubuntu:~# free -lm
               total        used        free      shared  buff/cache   available
Mem:         1031697       12803      832434       17298      186459      995547
Low:         1031697      199262      832434
High:              0           0           0
Swap:           8191           0        8191

3.2.6. 测试期间,容器日志输出

INFO 06-29 02:02:12 [engine.py:310] Added request cmpl-766a4b9729f2404aa7528a162026809b-0.
INFO 06-29 02:02:14 [metrics.py:489] Avg prompt throughput: 5718.9 tokens/s, Avg generation throughput: 50.7 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 28 reqs, GPU KV cache usage: 4.3%, CPU KV cache usage: 0.0%.
INFO 06-29 02:02:14 [metrics.py:505] Prefix cache hit rate: GPU: 91.25%, CPU: 0.00%
INFO 06-29 02:02:19 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 51.4 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 28 reqs, GPU KV cache usage: 4.4%, CPU KV cache usage: 0.0%.
INFO 06-29 02:02:19 [metrics.py:505] Prefix cache hit rate: GPU: 91.25%, CPU: 0.00%
INFO 06-29 02:02:24 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 51.3 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 28 reqs, GPU KV cache usage: 4.5%, CPU KV cache usage: 0.0%.
INFO 06-29 02:02:24 [metrics.py:505] Prefix cache hit rate: GPU: 91.25%, CPU: 0.00%

3.3. 8卡测试情况

中间测试过多种参数,细节不表,以最后一个全成功的为例。

3.3.1. 大模型推理参数

version: '3.8'

services:
  vllm-server:
    image: nvcr.io/nvidia/tritonserver:25.05-vllm-python-py3
    command: >
      vllm serve /opt/tritonserver/models/Qwen3-32B-AWQ
      --trust-remote-code
      --enable-prefix-caching
      --disable-sliding-window
      --gpu-memory-utilization 0.9
      --port 8005
      --max-model-len 32768
      --max-num-seqs 32
      --tensor-parallel-size 8
      --max-num-batched-tokens 32768
      --block-size 16
      --swap-space 4
      --enforce-eager
    environment:
      - VLLM_ATTENTION_BACKEND=xformers
      - NVIDIA_VISIBLE_DEVICES=all
      - NCCL_P2P_LEVEL=NVL
      - NCCL_IB_DISABLE=1
    volumes:
      - ~/.cache:/root/.cache
      - /root/models:/opt/tritonserver/models
    runtime: nvidia
    network_mode: host
    ipc: host
    restart: "no"

3.3.2. 压测参数

evalscope perf \
 --parallel 1 10 50 100 200 \
 --number 10 20 100 200 400 \
 --url "http://127.0.0.1:8005/v1/completions" \
 --model /opt/tritonserver/models/Qwen3-32B-AWQ \
 --log-every-n-query 5 \
 --connect-timeout 6000 \
 --read-timeout 6000 \
 --max-tokens 2048 \
 --min-tokens 2048 \
 --min-prompt-length 2048 \
 --max-prompt-length 2048 \
 --api openai \
 --dataset speed_benchmark 

3.3.3. 容器日志输出

INFO 07-01 07:07:07 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%
INFO 07-01 07:07:12 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 722.1 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 4.3%, CPU KV cache usage: 0.0%.
INFO 07-01 07:07:12 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%
INFO 07-01 07:07:17 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 722.0 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 4.4%, CPU KV cache usage: 0.0%.
INFO 07-01 07:07:17 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%
INFO 07-01 07:07:22 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 718.4 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 4.5%, CPU KV cache usage: 0.0%.
INFO 07-01 07:07:22 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%
INFO 07-01 07:07:27 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 719.7 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 4.6%, CPU KV cache usage: 0.0%.
INFO 07-01 07:07:27 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%
INFO 07-01 07:07:32 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 718.1 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 4.7%, CPU KV cache usage: 0.0%.
INFO 07-01 07:07:32 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%
INFO 07-01 07:07:37 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 717.7 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 4.8%, CPU KV cache usage: 0.0%.
INFO 07-01 07:07:37 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%
INFO 07-01 07:07:42 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 716.2 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 4.9%, CPU KV cache usage: 0.0%.
INFO 07-01 07:07:42 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%
INFO 07-01 07:07:47 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 714.7 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 5.0%, CPU KV cache usage: 0.0%.
INFO 07-01 07:07:47 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%
INFO 07-01 07:07:52 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 715.0 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 5.1%, CPU KV cache usage: 0.0%.
INFO 07-01 07:07:52 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%

3.3.4. 测试结果

2025-07-01 07:10:01,050 - evalscope - INFO -
Benchmarking summary:
+-----------------------------------+------------+
| Key                               |      Value |
+===================================+============+
| Time taken for tests (s)          |  1191.02   |
+-----------------------------------+------------+
| Number of concurrency             |   200      |
+-----------------------------------+------------+
| Total requests                    |   400      |
+-----------------------------------+------------+
| Succeed requests                  |   400      |
+-----------------------------------+------------+
| Failed requests                   |     0      |
+-----------------------------------+------------+
| Output token throughput (tok/s)   |   687.816  |
+-----------------------------------+------------+
| Total token throughput (tok/s)    |  4986.75   |
+-----------------------------------+------------+
| Request throughput (req/s)        |     0.3358 |
+-----------------------------------+------------+
| Average latency (s)               |   453.865  |
+-----------------------------------+------------+
| Average time to first token (s)   |   362.35   |
+-----------------------------------+------------+
| Average time per output token (s) |     0.0447 |
+-----------------------------------+------------+
| Average input tokens per request  | 12800.2    |
+-----------------------------------+------------+
| Average output tokens per request |  2048      |
+-----------------------------------+------------+
| Average package latency (s)       |     0.0447 |
+-----------------------------------+------------+
| Average package per request       |  2047      |
+-----------------------------------+------------+
2025-07-01 07:10:01,431 - evalscope - INFO -
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     | 92.0465  | 0.0431  |  0.0435  |  181.9807   |      1       |     2048      |     3.1836     |    3.7153     |
|     25%     | 271.3524 | 0.0438  |  0.0439  |  360.5895   |     6144     |     2048      |     3.6678     |    14.6708    |
|     50%     | 459.0579 | 0.0444  |  0.0447  |  551.4757   |    14336     |     2048      |     3.7184     |    29.6172    |
|     66%     | 461.576  | 0.0447  |  0.0447  |  553.1902   |    14336     |     2048      |     4.5274     |    50.3719    |
|     75%     | 466.7801 |  0.045  |  0.0449  |  558.3789   |    30720     |     2048      |     5.6796     |    51.1859    |
|     80%     | 466.7892 | 0.0451  |  0.0449  |  558.7151   |    30720     |     2048      |     7.5505     |    59.4955    |
|     90%     | 552.2437 | 0.0465  |  0.0451  |  643.2943   |    30720     |     2048      |     11.254     |    72.4224    |
|     95%     | 556.1562 | 0.0477  |  0.0478  |  647.7583   |    30720     |     2048      |    22.2845     |    90.0249    |
|     98%     | 558.924  | 0.0484  |  0.0478  |  650.5183   |    30720     |     2048      |    22.2869     |    178.27     |
|     99%     | 558.9281 | 0.0494  |  0.0478  |  650.5304   |    30720     |     2048      |    22.2874     |   356.5268    |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-07-01 07:10:01,487 - evalscope - INFO -
Speed Benchmark Results:
+---------------+-----------------+----------------+
| Prompt Tokens | Speed(tokens/s) | GPU Memory(GB) |
+---------------+-----------------+----------------+
|       1       |      7.04       |      0.0       |
|     6144      |       7.0       |      0.0       |
|     14336     |      6.69       |      0.0       |
|     30720     |      4.27       |      0.0       |
+---------------+-----------------+----------------+
2025-07-01 07:10:01,487 - evalscope - INFO - Save the summary to: outputs/20250701_061700/Qwen3-32B-AWQ/parallel_200_number_400
╭──────────────────────────────────────────────────────────╮
│ Performance Test Summary Report                          │
╰──────────────────────────────────────────────────────────╯

Basic Information:
┌───────────────────────┬──────────────────────────────────┐
│ Model                 │ Qwen3-32B-AWQ                    │
│ Total Generated       │ 1,495,040.0 tokens               │
│ Total Test Time       │ 3152.19 seconds                  │
│ Avg Output Rate       │ 474.29 tokens/sec                │
└───────────────────────┴──────────────────────────────────┘


									Detailed Performance Metrics
┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃      ┃      ┃      Avg ┃      P99 ┃    Gen. ┃      Avg ┃     P99 ┃      Avg ┃     P99 ┃   Success┃
┃Conc. ┃  RPS ┃  Lat.(s) ┃  Lat.(s) ┃  toks/s ┃  TTFT(s) ┃ TTFT(s) ┃  TPOT(s) ┃ TPOT(s) ┃      Rate┃
┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│    10.0180.11281.99325.560.5462.4130.0390.039100.0%│
│   100.1283.78284.111244.420.2390.3570.0410.041100.0%│
│   500.28127.826183.693572.9336.85492.0830.0440.045100.0%│
│  1000.32238.166369.664645.05146.954278.0910.0450.045100.0%│
│  2000.34453.865650.530687.82362.350558.9280.0450.048100.0%│
└──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘


			   Best Performance Configuration
 Highest RPS         Concurrency 200 (0.34 req/sec)
 Lowest Latency      Concurrency 1 (80.112 seconds)

Performance Recommendations:
• The system seems not to have reached its performance bottleneck, try higher concurrency

四、性能测试总结

压力测试结果

  • 最优配置:max-model-len=32768 + max-num-seqs=32
  • 峰值性能:
    • 输出吞吐量 687.8 tokens/s(200并发)
    • 总吞吐量 4986 tokens/s
  • GPU显存:单卡占用28GB/32.6GB
  • 时延特征:首token平均362ms,生成token平均44.7ms

关键发现

  • Prefix caching命中率达99.66%显著提升效率
  • 高并发下请求成功率100%,系统尚未达性能瓶颈
  • 容器日志显示KV缓存利用率仅5%,仍有优化空间

你可能感兴趣的:(8卡RTX 5090D服务器部署Qwen3-32B-AWQ模型执行性能测试)