#14795Eval bug: LLAMA_SET_ROWS=1 gibberish output with Dual GPU offload

llama.cpp

ggml-org

LLM inference in C/C++

Issue Details

5 months ago

No assignee

bug-unconfirmed

View on GitHub

askmyteapot

opened 5 months ago

Author

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 1: Tesla P40, compute capability 6.1, VMM: no version: 5944 (36c15324) built with MSVC 19.44.35208.0 for x64

Operating systems

Windows

GGML backends

CUDA

Hardware

Ryzen 5800x 3090 + P40

Models

magnum-v4-22b-Q6_K.gguf TheSkullery_L3.3-Unnamed-Exp-70B-v0.8-IQ4_XS.gguf

Problem description & steps to reproduce

Running with environment variable 'LLAMA_SET_ROWS=0' results in normal output. Setting to 1 results in gibberish


Helpful AI
21 July 2025 7:42 AM

How can I help?
12t
User
21 July 2025 7:43 AM

Can you recite for me the intro to sesame street
6.4s
85t
Helpful AI
21 July 2025 8:25 PM

Sunnynynynyy dayyy day,,
day!…
AItSunIt''''ssIts
unnme of
time for for to to
talk you play
play S S
S
ThisSThis is is is is the is a the way street best w song sest
way of way you a
to explore to be come in a
to play learn
play

However, if i restrict to just a single GPU (P40 or 3090), llama_set_rows=1 works with no issues.

First Bad Commit

Havent tested earlier versions.

Relevant log output

With LLAMA_SET_ROWS=1
Gibberish:
.\llama-server.exe -m D:\text-generation-webui\models\TheSkullery_L3.3-Unnamed-Exp-70B-v0.8-IQ4_XS.gguf -ngl 99 -ts 40/43 -fa -c 32768
.\llama-server.exe -m D:\text-generation-webui\models\magnum-v4-22b-Q6_K.gguf -ngl 99 -ts 40/43 -fa -c 32768

Sane (i cant fit 32k ctx on one GPU:
.\llama-server.exe -m D:\text-generation-webui\models\magnum-v4-22b-Q6_K.gguf -ngl 99 -fa -c 16368
.\llama-server.exe -m D:\text-generation-webui\models\magnum-v4-22b-Q6_K.gguf -ngl 53 -fa -c 32768 (CPU has 4 layers offloaded)