#14692Eval bug: CUDA error: operation not supported
Issue Details
Author
Name and Version
version: 1 (cbc68be) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu Buid from source: https://github.com/ggml-org/llama.cpp/tree/cbc68be51d88b1d5531643b926a4b359c3cff131
Operating systems
Linux
GGML backends
CUDA
Hardware
INTEL(R) XEON(R) PLATINUM 8580 NVIDIA 8xH200
Models
https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf
Problem description & steps to reproduce
I was trying to use llama.cpp (built from master) to serve Deepseek V3 0324 GUFF UD-Q2-K-XL
./llama-server -m /mnt/models/deepseek-v3-0324-guff-UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf \ --port 8000 \ --threads 24 \ --ctx-size 16384 \ --n-gpu-layers 99 \ --tensor-split 1,1,1,1,1,1,1,1 \ --flash-attn \ --batch-size 128 \ --no-warmup
The server started successfully but report cuda error when requesting.
Another problem is that when I set a large context (--ctx-size 16384
) , the server usually be killed because of RAM OOM. I found that the llama-server cost almost 1TB RAM.
Thank you!
First Bad Commit
No response
Relevant log output
main: server is listening on http://127.0.0.1:8000 - starting the main loop srv update_slots: all slots are idle srv params_from_: Chat format: Content-only slot launch_slot_: id 0 | task 0 | processing task slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 16384, n_keep = 0, n_prompt_tokens = 9 slot update_slots: id 0 | task 0 | kv cache rm [0, end) slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 9, n_tokens = 9, progress = 1.000000 slot update_slots: id 0 | task 0 | prompt done, n_past = 9, n_tokens = 9 /vllm-workspace/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:79: CUDA error CUDA error: operation not supported current device: 0, in function ggml_backend_cuda_cpy_tensor_async at /vllm-workspace/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2557 cudaMemcpyPeerAsync(dst->data, cuda_ctx_dst->device, src->data, cuda_ctx_src->device, ggml_nbytes(dst), cuda_ctx_src->stream()) /vllm-workspace/llama.cpp/build/bin/libggml-base.so(+0x16ceb)[0x7fb11dec1ceb] /vllm-workspace/llama.cpp/build/bin/libggml-base.so(ggml_print_backtrace+0x21f)[0x7fb11dec214f] /vllm-workspace/llama.cpp/build/bin/libggml-base.so(ggml_abort+0x152)[0x7fb11dec2322] /vllm-workspace/llama.cpp/build/bin/libggml-cuda.so(+0xb6976)[0x7fb11bcbd976] /vllm-workspace/llama.cpp/build/bin/libggml-cuda.so(+0xbd822)[0x7fb11bcc4822] /vllm-workspace/llama.cpp/build/bin/libggml-base.so(ggml_backend_sched_graph_compute_async+0x2c9)[0x7fb11deda609] /vllm-workspace/llama.cpp/build/bin/libllama.so(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0x99)[0x7fb11dfe6f99] /vllm-workspace/llama.cpp/build/bin/libllama.so(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x103)[0x7fb11dfe7253] /vllm-workspace/llama.cpp/build/bin/libllama.so(_ZN13llama_context6decodeERK11llama_batch+0x338)[0x7fb11dfebb58] /vllm-workspace/llama.cpp/build/bin/libllama.so(llama_decode+0x10)[0x7fb11dfecd30] ./llama-server(+0xdaf61)[0x5581ec541f61] ./llama-server(+0x8402d)[0x5581ec4eb02d] ./llama-server(+0x4c7d5)[0x5581ec4b37d5] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fb11d978d90] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fb11d978e40] ./llama-server(+0x4e225)[0x5581ec4b5225] Aborted (core dumped)