#14692Eval bug: CUDA error: operation not supported

I Want to Work on This Issue

llama.cpp

ggml-org

LLM inference in C/C++

Issue Details

about 2 months ago

No assignee

bug-unconfirmedstale

View on GitHub

0xshawn

opened about 2 months ago

Author

Name and Version

version: 1 (cbc68be) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu Buid from source: https://github.com/ggml-org/llama.cpp/tree/cbc68be51d88b1d5531643b926a4b359c3cff131

Operating systems

Linux

GGML backends

CUDA

Hardware

INTEL(R) XEON(R) PLATINUM 8580 NVIDIA 8xH200

Models

https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf

Problem description & steps to reproduce

I was trying to use llama.cpp (built from master) to serve Deepseek V3 0324 GUFF UD-Q2-K-XL

./llama-server -m /mnt/models/deepseek-v3-0324-guff-UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf    \
  --port 8000   \
  --threads 24   \
  --ctx-size 16384   \
  --n-gpu-layers 99   \
  --tensor-split 1,1,1,1,1,1,1,1   \
  --flash-attn   \
  --batch-size 128 \
  --no-warmup

The server started successfully but report cuda error when requesting.

Another problem is that when I set a large context (--ctx-size 16384) , the server usually be killed because of RAM OOM. I found that the llama-server cost almost 1TB RAM.

Thank you!

First Bad Commit

No response

Relevant log output

main: server is listening on http://127.0.0.1:8000 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 16384, n_keep = 0, n_prompt_tokens = 9
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 9, n_tokens = 9, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 9, n_tokens = 9
/vllm-workspace/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:79: CUDA error
CUDA error: operation not supported
  current device: 0, in function ggml_backend_cuda_cpy_tensor_async at /vllm-workspace/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2557
  cudaMemcpyPeerAsync(dst->data, cuda_ctx_dst->device, src->data, cuda_ctx_src->device, ggml_nbytes(dst), cuda_ctx_src->stream())
/vllm-workspace/llama.cpp/build/bin/libggml-base.so(+0x16ceb)[0x7fb11dec1ceb]
/vllm-workspace/llama.cpp/build/bin/libggml-base.so(ggml_print_backtrace+0x21f)[0x7fb11dec214f]
/vllm-workspace/llama.cpp/build/bin/libggml-base.so(ggml_abort+0x152)[0x7fb11dec2322]
/vllm-workspace/llama.cpp/build/bin/libggml-cuda.so(+0xb6976)[0x7fb11bcbd976]
/vllm-workspace/llama.cpp/build/bin/libggml-cuda.so(+0xbd822)[0x7fb11bcc4822]
/vllm-workspace/llama.cpp/build/bin/libggml-base.so(ggml_backend_sched_graph_compute_async+0x2c9)[0x7fb11deda609]
/vllm-workspace/llama.cpp/build/bin/libllama.so(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0x99)[0x7fb11dfe6f99]
/vllm-workspace/llama.cpp/build/bin/libllama.so(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x103)[0x7fb11dfe7253]                                                                                               /vllm-workspace/llama.cpp/build/bin/libllama.so(_ZN13llama_context6decodeERK11llama_batch+0x338)[0x7fb11dfebb58]
/vllm-workspace/llama.cpp/build/bin/libllama.so(llama_decode+0x10)[0x7fb11dfecd30]
./llama-server(+0xdaf61)[0x5581ec541f61]
./llama-server(+0x8402d)[0x5581ec4eb02d]
./llama-server(+0x4c7d5)[0x5581ec4b37d5]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fb11d978d90]
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fb11d978e40]
./llama-server(+0x4e225)[0x5581ec4b5225]
Aborted (core dumped)