最近得了一台8卡5090D服务器进行测试评估。
GPU拓扑情况如下
(test) root@ubuntu:/opt/models# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE NODE NODE SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU1 NODE X NODE NODE SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU2 NODE NODE X NODE SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU3 NODE NODE NODE X SYS SYS SYS SYS 0-31,64-95 0 N/A
GPU4 SYS SYS SYS SYS X NODE NODE NODE 32-63,96-127 1 N/A
GPU5 SYS SYS SYS SYS NODE X NODE NODE 32-63,96-127 1 N/A
GPU6 SYS SYS SYS SYS NODE NODE X NODE 32-63,96-127 1 N/A
GPU7 SYS SYS SYS SYS NODE NODE NODE X 32-63,96-127 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
最开始是准备做vllm直接推理大模型,使用vllm0.9.0版本,由于一直在版本不兼容或直接推理出错中反复横跳,最后从github上找到了一个方法,直接用nvidia自己释出来的docker镜像进行vllm推理。
RuntimeError: CUDA error: no kernel image is available for execution on the device
Compile with 'TORCH_USE _CUDA_DSA' to enable device-side assertions.
错误截图:
raise RuntimeError(f"NCCL error: {error_str})
RuntimeError NCCL error: unhandled cuda error (run with NCCL DEBUG-INFO for details)
尝试然后重新安装torch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 --ignore-installed
重新安装nccl
python -m pip install "nvidia-nccl-cu12>=2.26.5"
并且启动成功了,但是一推理就会跳到no kernel image is available for execution on the device
错误
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve /opt/models/Qwen3-32B-AWQ \
--gpu-memory-utilization 0.8 \
--tensor-parallel-size 4 \
--served-model-name Qwen3-32B \
--port 8801 \
--host 0.0.0.0 \
--distributed-executor-backend mp
最后参考[Doc]: Steps to run vLLM on your RTX5080 or 5090!
#14452使用nvcr.io/nvidia/tritonserver:25.05-vllm-python-py3
镜像启动了测试,看起来vllm还没有做好对5090的兼容性。
snap install docker
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt install nvidia-container-toolkit nvidia-container-toolkit-base libnvidia-container-tools libnvidia-container1
systemctl restart snap.docker.dockerd
services:
vllm-server:
image: nvcr.io/nvidia/tritonserver:25.05-vllm-python-py3
command: >
vllm serve /opt/tritonserver/models/Qwen3-32B-AWQ
--trust-remote-code
--enable-prefix-caching
--disable-sliding-window
--gpu-memory-utilization 0.9
--port 8005
--max-model-len 32768
--max-num-seqs 2
--tensor-parallel-size 4
environment:
- VLLM_ATTENTION_BACKEND=xformers
- NVIDIA_VISIBLE_DEVICES=all
volumes:
- ~/.cache:/root/.cache
- /root/models:/opt/tritonserver/models
runtime: nvidia
network_mode: host
ipc: host
restart: "no"
docker-compose up -d
evalscope
执行压力测试evalscope
环境miniconda环境安装略过,参考官网去做。
conda create -n evalscope python=3.10
pip install 'evalscope[app,perf]' -U
(test) root@ubuntu:/opt/models# evalscope perf --model /opt/tritonserver/models/Qwen3-32B-AWQ --url "http://127.0.0.1:8005/v1/chat/completions" --parallel 5 --number 20 --api openai --dataset openqa --stream
2025-06-29 00:50:50,964 - evalscope - INFO - Save the result to: outputs/20250629_005050/Qwen3-32B-AWQ
2025-06-29 00:50:50,964 - evalscope - INFO - Starting benchmark with args:
2025-06-29 00:50:50,964 - evalscope - INFO - {
"model": "/opt/tritonserver/models/Qwen3-32B-AWQ",
"model_id": "Qwen3-32B-AWQ",
"attn_implementation": null,
"api": "openai",
"tokenizer_path": null,
"port": 8877,
"url": "http://127.0.0.1:8005/v1/chat/completions",
"headers": {},
"connect_timeout": 600,
"read_timeout": 600,
"api_key": null,
"no_test_connection": false,
"number": 20,
"parallel": 5,
"rate": -1,
"log_every_n_query": 10,
"debug": false,
"wandb_api_key": null,
"swanlab_api_key": null,
"name": null,
"outputs_dir": "outputs/20250629_005050/Qwen3-32B-AWQ",
"max_prompt_length": 9223372036854775807,
"min_prompt_length": 0,
"prefix_length": 0,
"prompt": null,
"query_template": null,
"apply_chat_template": true,
"dataset": "openqa",
"dataset_path": null,
"frequency_penalty": null,
"repetition_penalty": null,
"logprobs": null,
"max_tokens": 2048,
"min_tokens": null,
"n_choices": null,
"seed": 0,
"stop": null,
"stop_token_ids": null,
"stream": true,
"temperature": 0.0,
"top_p": null,
"top_k": null,
"extra_args": {}
}
2025-06-29 00:50:51,718 - evalscope - INFO - Test connection successful.
2025-06-29 00:50:52,308 - evalscope - INFO - Save the data base to: outputs/20250629_005050/Qwen3-32B-AWQ/benchmark_data.db
Processing: 45%|█████████████████████████████████████████████████████ | 9/20 [01:31<01:52, 10.24s/it]2025-06-29 00:52:27,819 - evalscope - INFO - {
"Time taken for tests (s)": 95.5079,
"Number of concurrency": 5,
"Total requests": 10,
"Succeed requests": 10,
"Failed requests": 0,
"Output token throughput (tok/s)": 133.7795,
"Total token throughput (tok/s)": 136.8055,
"Request throughput (req/s)": 0.1047,
"Average latency (s)": 39.6521,
"Average time to first token (s)": 21.0828,
"Average time per output token (s)": 0.0145,
"Average input tokens per request": 28.9,
"Average output tokens per request": 1277.7,
"Average package latency (s)": 0.0145,
"Average package per request": 1277.7
}
Processing: 95%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 19/20 [03:01<00:07, 7.96s/it]2025-06-29 00:54:19,768 - evalscope - INFO - {
"Time taken for tests (s)": 207.457,
"Number of concurrency": 5,
"Total requests": 20,
"Succeed requests": 20,
"Failed requests": 0,
"Output token throughput (tok/s)": 128.4025,
"Total token throughput (tok/s)": 131.2224,
"Request throughput (req/s)": 0.0964,
"Average latency (s)": 44.8448,
"Average time to first token (s)": 25.5177,
"Average time per output token (s)": 0.0145,
"Average input tokens per request": 29.25,
"Average output tokens per request": 1331.9,
"Average package latency (s)": 0.0145,
"Average package per request": 1331.9
}
Processing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [03:27<00:00, 10.37s/it]
2025-06-29 00:54:19,874 - evalscope - INFO -
Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 207.457 |
+-----------------------------------+-----------+
| Number of concurrency | 5 |
+-----------------------------------+-----------+
| Total requests | 20 |
+-----------------------------------+-----------+
| Succeed requests | 20 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 128.403 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 131.222 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 0.0964 |
+-----------------------------------+-----------+
| Average latency (s) | 44.8448 |
+-----------------------------------+-----------+
| Average time to first token (s) | 25.5177 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0145 |
+-----------------------------------+-----------+
| Average input tokens per request | 29.25 |
+-----------------------------------+-----------+
| Average output tokens per request | 1331.9 |
+-----------------------------------+-----------+
| Average package latency (s) | 0.0145 |
+-----------------------------------+-----------+
| Average package per request | 1331.9 |
+-----------------------------------+-----------+
2025-06-29 00:54:19,886 - evalscope - INFO -
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 14.8518 | 0.0142 | 0.0144 | 36.7898 | 21 | 1002 | 20.0306 | 20.8404 |
| 25% | 22.124 | 0.0143 | 0.0145 | 40.9366 | 26 | 1111 | 25.1348 | 25.6956 |
| 50% | 30.0105 | 0.0144 | 0.0145 | 48.3621 | 28 | 1289 | 28.3309 | 28.8152 |
| 66% | 31.3366 | 0.0145 | 0.0146 | 52.2564 | 31 | 1305 | 31.6099 | 32.3916 |
| 75% | 32.8376 | 0.0145 | 0.0146 | 52.5391 | 34 | 1630 | 35.9473 | 36.3159 |
| 80% | 33.5577 | 0.0146 | 0.0146 | 52.8828 | 37 | 1645 | 40.1009 | 41.1051 |
| 90% | 37.6004 | 0.0148 | 0.0147 | 56.9723 | 41 | 2048 | 67.9679 | 69.3509 |
| 95% | 38.6689 | 0.015 | 0.0148 | 59.6889 | 45 | 2048 | 68.0682 | 69.528 |
| 98% | 38.6689 | 0.0154 | 0.0148 | 59.6889 | 45 | 2048 | 68.0682 | 69.528 |
| 99% | 38.6689 | 0.0157 | 0.0148 | 59.6889 | 45 | 2048 | 68.0682 | 69.528 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-06-29 00:54:19,887 - evalscope - INFO - Save the summary to: outputs/20250629_005050/Qwen3-32B-AWQ
Sun Jun 29 01:47:33 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.169 Driver Version: 570.169 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 D Off | 00000000:16:00.0 Off | N/A |
| 39% 47C P1 147W / 575W | 30263MiB / 32607MiB | 61% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5090 D Off | 00000000:38:00.0 Off | N/A |
| 40% 47C P1 145W / 575W | 30235MiB / 32607MiB | 81% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 5090 D Off | 00000000:49:00.0 Off | N/A |
| 43% 49C P1 148W / 575W | 30235MiB / 32607MiB | 83% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 5090 D Off | 00000000:5A:00.0 Off | N/A |
| 40% 47C P1 141W / 575W | 30235MiB / 32607MiB | 33% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA GeForce RTX 5090 D Off | 00000000:98:00.0 Off | N/A |
| 30% 33C P8 12W / 575W | 3MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA GeForce RTX 5090 D Off | 00000000:B8:00.0 Off | N/A |
| 30% 33C P8 20W / 575W | 3MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA GeForce RTX 5090 D Off | 00000000:C8:00.0 Off | N/A |
| 30% 34C P8 22W / 575W | 3MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA GeForce RTX 5090 D Off | 00000000:D8:00.0 Off | N/A |
| 30% 34C P8 12W / 575W | 3MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
top - 01:48:17 up 3 days, 17:28, 9 users, load average: 4.42, 3.99, 3.48
Tasks: 1367 total, 5 running, 1362 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.1 us, 0.6 sy, 0.0 ni, 97.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 1031697.+total, 832427.9 free, 12809.9 used, 186459.2 buff/cache
MiB Swap: 8192.0 total, 8192.0 free, 0.0 used. 995540.9 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
739296 root 20 0 195.6g 6.5g 4.9g R 85.4 0.6 5:46.54 python3
739200 root 20 0 198.8g 6.8g 4.9g R 85.1 0.7 5:51.75 python3
739294 root 20 0 195.7g 6.6g 5.0g R 85.1 0.7 5:43.39 python3
739295 root 20 0 195.7g 6.6g 5.0g R 77.8 0.7 5:46.38 python3
739897 root 20 0 49.1g 677088 281532 S 5.3 0.1 0:33.57 evalscope
738281 root 20 0 16.8g 955408 318480 S 4.0 0.1 0:26.31 vllm
740992 root 20 0 11964 5656 3428 R 1.3 0.0 0:00.70 top
741137 root 20 0 166360 19872 5420 S 1.0 0.0 0:00.03 nvidia-smi
20210 nvidia-+ 20 0 5476 2068 1896 S 0.3 0.0 0:05.53 nvidia-persiste
143225 root 20 0 7304 4084 2792 S 0.3 0.0 0:23.48 watch
733652 ubuntu 20 0 17476 8548 5928 S 0.3 0.0 0:00.02 sshd
1 root 20 0 166596 12252 8492 S 0.0 0.0 0:28.93 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:21.25 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
5 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 slub_flushwq
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 netns
8 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-events_highpri
11 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_tasks_rude_
13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_tasks_trace
14 root 20 0 0 0 0 S 0.0 0.0 0:00.52 ksoftirqd/0
15 root 20 0 0 0 0 I 0.0 0.0 0:30.41 rcu_sched
16 root rt 0 0 0 0 S 0.0 0.0 0:02.42 migration/0
17 root -51 0 0 0 0 S 0.0 0.0 0:00.00 idle_inject/0
19 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0
20 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/1
21 root -51 0 0 0 0 S 0.0 0.0 0:00.00 idle_inject/1
22 root rt 0 0 0 0 S 0.0 0.0 0:12.44 migration/1
23 root 20 0 0 0 0 S 0.0 0.0 0:00.16 ksoftirqd/1
25 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/1:0H-events_highpri
26 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/2
27 root -51 0 0 0 0 S 0.0 0.0 0:00.00 idle_inject/2
28 root rt 0 0 0 0 S 0.0 0.0 0:12.04 migration/2
29 root 20 0 0 0 0 S 0.0 0.0 0:00.14 ksoftirqd/2
31 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/2:0H-kblockd
32 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/3
33 root -51 0 0 0 0 S 0.0 0.0 0:00.00 idle_inject/3
34 root rt 0 0 0 0 S 0.0 0.0 0:11.00 migration/3
(base) root@ubuntu:~# free -lm
total used free shared buff/cache available
Mem: 1031697 12803 832434 17298 186459 995547
Low: 1031697 199262 832434
High: 0 0 0
Swap: 8191 0 8191
INFO 06-29 02:02:12 [engine.py:310] Added request cmpl-766a4b9729f2404aa7528a162026809b-0.
INFO 06-29 02:02:14 [metrics.py:489] Avg prompt throughput: 5718.9 tokens/s, Avg generation throughput: 50.7 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 28 reqs, GPU KV cache usage: 4.3%, CPU KV cache usage: 0.0%.
INFO 06-29 02:02:14 [metrics.py:505] Prefix cache hit rate: GPU: 91.25%, CPU: 0.00%
INFO 06-29 02:02:19 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 51.4 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 28 reqs, GPU KV cache usage: 4.4%, CPU KV cache usage: 0.0%.
INFO 06-29 02:02:19 [metrics.py:505] Prefix cache hit rate: GPU: 91.25%, CPU: 0.00%
INFO 06-29 02:02:24 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 51.3 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 28 reqs, GPU KV cache usage: 4.5%, CPU KV cache usage: 0.0%.
INFO 06-29 02:02:24 [metrics.py:505] Prefix cache hit rate: GPU: 91.25%, CPU: 0.00%
中间测试过多种参数,细节不表,以最后一个全成功的为例。
version: '3.8'
services:
vllm-server:
image: nvcr.io/nvidia/tritonserver:25.05-vllm-python-py3
command: >
vllm serve /opt/tritonserver/models/Qwen3-32B-AWQ
--trust-remote-code
--enable-prefix-caching
--disable-sliding-window
--gpu-memory-utilization 0.9
--port 8005
--max-model-len 32768
--max-num-seqs 32
--tensor-parallel-size 8
--max-num-batched-tokens 32768
--block-size 16
--swap-space 4
--enforce-eager
environment:
- VLLM_ATTENTION_BACKEND=xformers
- NVIDIA_VISIBLE_DEVICES=all
- NCCL_P2P_LEVEL=NVL
- NCCL_IB_DISABLE=1
volumes:
- ~/.cache:/root/.cache
- /root/models:/opt/tritonserver/models
runtime: nvidia
network_mode: host
ipc: host
restart: "no"
evalscope perf \
--parallel 1 10 50 100 200 \
--number 10 20 100 200 400 \
--url "http://127.0.0.1:8005/v1/completions" \
--model /opt/tritonserver/models/Qwen3-32B-AWQ \
--log-every-n-query 5 \
--connect-timeout 6000 \
--read-timeout 6000 \
--max-tokens 2048 \
--min-tokens 2048 \
--min-prompt-length 2048 \
--max-prompt-length 2048 \
--api openai \
--dataset speed_benchmark
INFO 07-01 07:07:07 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%
INFO 07-01 07:07:12 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 722.1 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 4.3%, CPU KV cache usage: 0.0%.
INFO 07-01 07:07:12 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%
INFO 07-01 07:07:17 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 722.0 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 4.4%, CPU KV cache usage: 0.0%.
INFO 07-01 07:07:17 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%
INFO 07-01 07:07:22 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 718.4 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 4.5%, CPU KV cache usage: 0.0%.
INFO 07-01 07:07:22 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%
INFO 07-01 07:07:27 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 719.7 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 4.6%, CPU KV cache usage: 0.0%.
INFO 07-01 07:07:27 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%
INFO 07-01 07:07:32 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 718.1 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 4.7%, CPU KV cache usage: 0.0%.
INFO 07-01 07:07:32 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%
INFO 07-01 07:07:37 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 717.7 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 4.8%, CPU KV cache usage: 0.0%.
INFO 07-01 07:07:37 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%
INFO 07-01 07:07:42 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 716.2 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 4.9%, CPU KV cache usage: 0.0%.
INFO 07-01 07:07:42 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%
INFO 07-01 07:07:47 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 714.7 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 5.0%, CPU KV cache usage: 0.0%.
INFO 07-01 07:07:47 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%
INFO 07-01 07:07:52 [metrics.py:489] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 715.0 tokens/s, Running: 32 reqs, Swapped: 0 reqs, Pending: 16 reqs, GPU KV cache usage: 5.1%, CPU KV cache usage: 0.0%.
INFO 07-01 07:07:52 [metrics.py:505] Prefix cache hit rate: GPU: 99.66%, CPU: 0.00%
2025-07-01 07:10:01,050 - evalscope - INFO -
Benchmarking summary:
+-----------------------------------+------------+
| Key | Value |
+===================================+============+
| Time taken for tests (s) | 1191.02 |
+-----------------------------------+------------+
| Number of concurrency | 200 |
+-----------------------------------+------------+
| Total requests | 400 |
+-----------------------------------+------------+
| Succeed requests | 400 |
+-----------------------------------+------------+
| Failed requests | 0 |
+-----------------------------------+------------+
| Output token throughput (tok/s) | 687.816 |
+-----------------------------------+------------+
| Total token throughput (tok/s) | 4986.75 |
+-----------------------------------+------------+
| Request throughput (req/s) | 0.3358 |
+-----------------------------------+------------+
| Average latency (s) | 453.865 |
+-----------------------------------+------------+
| Average time to first token (s) | 362.35 |
+-----------------------------------+------------+
| Average time per output token (s) | 0.0447 |
+-----------------------------------+------------+
| Average input tokens per request | 12800.2 |
+-----------------------------------+------------+
| Average output tokens per request | 2048 |
+-----------------------------------+------------+
| Average package latency (s) | 0.0447 |
+-----------------------------------+------------+
| Average package per request | 2047 |
+-----------------------------------+------------+
2025-07-01 07:10:01,431 - evalscope - INFO -
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 92.0465 | 0.0431 | 0.0435 | 181.9807 | 1 | 2048 | 3.1836 | 3.7153 |
| 25% | 271.3524 | 0.0438 | 0.0439 | 360.5895 | 6144 | 2048 | 3.6678 | 14.6708 |
| 50% | 459.0579 | 0.0444 | 0.0447 | 551.4757 | 14336 | 2048 | 3.7184 | 29.6172 |
| 66% | 461.576 | 0.0447 | 0.0447 | 553.1902 | 14336 | 2048 | 4.5274 | 50.3719 |
| 75% | 466.7801 | 0.045 | 0.0449 | 558.3789 | 30720 | 2048 | 5.6796 | 51.1859 |
| 80% | 466.7892 | 0.0451 | 0.0449 | 558.7151 | 30720 | 2048 | 7.5505 | 59.4955 |
| 90% | 552.2437 | 0.0465 | 0.0451 | 643.2943 | 30720 | 2048 | 11.254 | 72.4224 |
| 95% | 556.1562 | 0.0477 | 0.0478 | 647.7583 | 30720 | 2048 | 22.2845 | 90.0249 |
| 98% | 558.924 | 0.0484 | 0.0478 | 650.5183 | 30720 | 2048 | 22.2869 | 178.27 |
| 99% | 558.9281 | 0.0494 | 0.0478 | 650.5304 | 30720 | 2048 | 22.2874 | 356.5268 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-07-01 07:10:01,487 - evalscope - INFO -
Speed Benchmark Results:
+---------------+-----------------+----------------+
| Prompt Tokens | Speed(tokens/s) | GPU Memory(GB) |
+---------------+-----------------+----------------+
| 1 | 7.04 | 0.0 |
| 6144 | 7.0 | 0.0 |
| 14336 | 6.69 | 0.0 |
| 30720 | 4.27 | 0.0 |
+---------------+-----------------+----------------+
2025-07-01 07:10:01,487 - evalscope - INFO - Save the summary to: outputs/20250701_061700/Qwen3-32B-AWQ/parallel_200_number_400
╭──────────────────────────────────────────────────────────╮
│ Performance Test Summary Report │
╰──────────────────────────────────────────────────────────╯
Basic Information:
┌───────────────────────┬──────────────────────────────────┐
│ Model │ Qwen3-32B-AWQ │
│ Total Generated │ 1,495,040.0 tokens │
│ Total Test Time │ 3152.19 seconds │
│ Avg Output Rate │ 474.29 tokens/sec │
└───────────────────────┴──────────────────────────────────┘
Detailed Performance Metrics
┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃ ┃ ┃ Avg ┃ P99 ┃ Gen. ┃ Avg ┃ P99 ┃ Avg ┃ P99 ┃ Success┃
┃Conc. ┃ RPS ┃ Lat.(s) ┃ Lat.(s) ┃ toks/s ┃ TTFT(s) ┃ TTFT(s) ┃ TPOT(s) ┃ TPOT(s) ┃ Rate┃
┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│ 1 │ 0.01 │ 80.112 │ 81.993 │ 25.56 │ 0.546 │ 2.413 │ 0.039 │ 0.039 │ 100.0%│
│ 10 │ 0.12 │ 83.782 │ 84.111 │ 244.42 │ 0.239 │ 0.357 │ 0.041 │ 0.041 │ 100.0%│
│ 50 │ 0.28 │ 127.826 │ 183.693 │ 572.93 │ 36.854 │ 92.083 │ 0.044 │ 0.045 │ 100.0%│
│ 100 │ 0.32 │ 238.166 │ 369.664 │ 645.05 │ 146.954 │ 278.091 │ 0.045 │ 0.045 │ 100.0%│
│ 200 │ 0.34 │ 453.865 │ 650.530 │ 687.82 │ 362.350 │ 558.928 │ 0.045 │ 0.048 │ 100.0%│
└──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘
Best Performance Configuration
Highest RPS Concurrency 200 (0.34 req/sec)
Lowest Latency Concurrency 1 (80.112 seconds)
Performance Recommendations:
• The system seems not to have reached its performance bottleneck, try higher concurrency
压力测试结果
max-model-len=32768
+ max-num-seqs=32
关键发现