#9521[Bug] Speculative Decoding runtime memory issues
Issue Details
Author
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
[2025-08-22 21:41:04] Scheduler hit an exception: Traceback (most recent call last): File "/workspace/output/SpecForge/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2769, in run_scheduler_process scheduler.event_loop_normal() File "/workspace/output/SpecForge/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/workspace/output/SpecForge/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 772, in event_loop_normal result = self.run_batch(batch) ^^^^^^^^^^^^^^^^^^^^^ File "/workspace/output/SpecForge/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 1756, in run_batch ) = self.draft_worker.forward_batch_speculative_generation(batch) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/output/SpecForge/.venv/lib/python3.12/site-packages/sglang/srt/speculative/eagle_worker.py", line 323, in forward_batch_speculative_generation spec_info = self.draft(batch) ^^^^^^^^^^^^^^^^^ File "/workspace/output/SpecForge/.venv/lib/python3.12/site-packages/sglang/srt/speculative/eagle_worker.py", line 526, in draft score_list, token_list, parents_list = self.cuda_graph_runner.replay( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/output/SpecForge/.venv/lib/python3.12/site-packages/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py", line 346, in replay self.graphs[bs].replay() File "/workspace/output/SpecForge/.venv/lib/python3.12/site-packages/torch/cuda/graphs.py", line 88, in replay super().replay() RuntimeError: CUDA error: an illegal memory access was encountered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
eagle3 speculative decoding has runtime memory error.
Reproduction
I trained a Qwen3 MoE model with SpecForge, and then used EAGLE3 for inference (to narrow down the issue, I set up the most basic version, e.g., tp=1, bs=1)
CUDA_VISIBLE_DEVICES=1 uv run python -m sglang.launch_server \ --model Qwen/Qwen3-30B-A3B-Instruct-2507 \ --speculative-algorithm EAGLE3 \ --speculative-draft-model-path ./cache/Qwen3-30B-A3B-Instruct-2507-Eagle3/epoch_19 \ --speculative-num-steps 3 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 4 \ --mem-fraction-static 0.75 \ --cuda-graph-max-bs 1 \ --tp 1 \ --context-length 8192 \ --trust-remote-code \ --host 0.0.0.0 \ --port 30000 \ --dtype bfloat16
This issue can be replicated at the 16th question of mt-bench (during running SpecForge tests):
# uv run python benchmarks/run_mtbench.py --parallel 1 --port 30000 20%|███████████▍ | 16/80
Environment
Python: 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] CUDA available: True GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3 GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0 CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 12.9, V12.9.86 CUDA Driver Version: 550.127.08 PyTorch: 2.7.1+cu126 sglang: 0.4.9.post2 sgl_kernel: 0.2.5 flashinfer_python: 0.2.7.post1 triton: 3.3.1 transformers: 4.53.0 torchao: 0.9.0 numpy: 2.3.2 aiohttp: 3.12.15 fastapi: 0.116.1 hf_transfer: 0.1.9 huggingface_hub: 0.34.3 interegular: 0.3.3 modelscope: 1.28.1 orjson: 3.11.1 outlines: 0.1.11 packaging: 25.0 psutil: 7.0.0 pydantic: 2.11.7 python-multipart: 0.0.20 pyzmq: 27.0.1 uvicorn: 0.35.0 uvloop: 0.21.0 vllm: Module Not Found xgrammar: 0.1.21 openai: 1.98.0 tiktoken: 0.9.0 anthropic: 0.60.0 litellm: 1.74.15.post1 decord: 0.6.0 NVIDIA Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX PIX PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS 0-12 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS 26-38 2 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS 39-51 3 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS 13-25 1 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS PIX PIX PIX SYS SYS SYS 52-64 4 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX SYS SYS 78-90 6 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX SYS 91-103 7 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX 65-77 5 N/A NIC0 PIX SYS SYS SYS SYS SYS SYS SYS X PIX PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS NIC1 PIX SYS SYS SYS SYS SYS SYS SYS PIX X PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS NIC2 PIX SYS SYS SYS SYS SYS SYS SYS PIX PIX X SYS SYS SYS SYS SYS SYS SYS SYS SYS NIC3 SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS SYS SYS SYS NIC4 SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS SYS SYS NIC5 SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS SYS NIC6 SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX PIX SYS SYS SYS NIC7 SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X PIX SYS SYS SYS NIC8 SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX PIX X SYS SYS SYS NIC9 SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS NIC10 SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X SYS NIC11 SYS SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 NIC6: mlx5_6 NIC7: mlx5_7 NIC8: mlx5_8 NIC9: mlx5_9 NIC10: mlx5_10 NIC11: mlx5_11 ulimit soft: 1048576
SpecForge version is at a42238765df239fbebf75bfa8ce110c71027f608