#14692Eval bug: CUDA error: operation not supported

Issue Details

about 2 months ago
No assignee
bug-unconfirmedstale
0xshawn0xshawn
opened about 2 months ago
Author

Name and Version

version: 1 (cbc68be) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu Buid from source: https://github.com/ggml-org/llama.cpp/tree/cbc68be51d88b1d5531643b926a4b359c3cff131

Operating systems

Linux

GGML backends

CUDA

Hardware

INTEL(R) XEON(R) PLATINUM 8580 NVIDIA 8xH200

Models

https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf

Problem description & steps to reproduce

I was trying to use llama.cpp (built from master) to serve Deepseek V3 0324 GUFF UD-Q2-K-XL

./llama-server -m /mnt/models/deepseek-v3-0324-guff-UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf \ --port 8000 \ --threads 24 \ --ctx-size 16384 \ --n-gpu-layers 99 \ --tensor-split 1,1,1,1,1,1,1,1 \ --flash-attn \ --batch-size 128 \ --no-warmup

The server started successfully but report cuda error when requesting.

Another problem is that when I set a large context (--ctx-size 16384) , the server usually be killed because of RAM OOM. I found that the llama-server cost almost 1TB RAM.

Thank you!

First Bad Commit

No response

Relevant log output

main: server is listening on http://127.0.0.1:8000 - starting the main loop srv update_slots: all slots are idle srv params_from_: Chat format: Content-only slot launch_slot_: id 0 | task 0 | processing task slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 16384, n_keep = 0, n_prompt_tokens = 9 slot update_slots: id 0 | task 0 | kv cache rm [0, end) slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 9, n_tokens = 9, progress = 1.000000 slot update_slots: id 0 | task 0 | prompt done, n_past = 9, n_tokens = 9 /vllm-workspace/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:79: CUDA error CUDA error: operation not supported current device: 0, in function ggml_backend_cuda_cpy_tensor_async at /vllm-workspace/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2557 cudaMemcpyPeerAsync(dst->data, cuda_ctx_dst->device, src->data, cuda_ctx_src->device, ggml_nbytes(dst), cuda_ctx_src->stream()) /vllm-workspace/llama.cpp/build/bin/libggml-base.so(+0x16ceb)[0x7fb11dec1ceb] /vllm-workspace/llama.cpp/build/bin/libggml-base.so(ggml_print_backtrace+0x21f)[0x7fb11dec214f] /vllm-workspace/llama.cpp/build/bin/libggml-base.so(ggml_abort+0x152)[0x7fb11dec2322] /vllm-workspace/llama.cpp/build/bin/libggml-cuda.so(+0xb6976)[0x7fb11bcbd976] /vllm-workspace/llama.cpp/build/bin/libggml-cuda.so(+0xbd822)[0x7fb11bcc4822] /vllm-workspace/llama.cpp/build/bin/libggml-base.so(ggml_backend_sched_graph_compute_async+0x2c9)[0x7fb11deda609] /vllm-workspace/llama.cpp/build/bin/libllama.so(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0x99)[0x7fb11dfe6f99] /vllm-workspace/llama.cpp/build/bin/libllama.so(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x103)[0x7fb11dfe7253] /vllm-workspace/llama.cpp/build/bin/libllama.so(_ZN13llama_context6decodeERK11llama_batch+0x338)[0x7fb11dfebb58] /vllm-workspace/llama.cpp/build/bin/libllama.so(llama_decode+0x10)[0x7fb11dfecd30] ./llama-server(+0xdaf61)[0x5581ec541f61] ./llama-server(+0x8402d)[0x5581ec4eb02d] ./llama-server(+0x4c7d5)[0x5581ec4b37d5] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fb11d978d90] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fb11d978e40] ./llama-server(+0x4e225)[0x5581ec4b5225] Aborted (core dumped)