#14795Eval bug: LLAMA_SET_ROWS=1 gibberish output with Dual GPU offload
Issue Details
Author
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 1: Tesla P40, compute capability 6.1, VMM: no version: 5944 (36c15324) built with MSVC 19.44.35208.0 for x64
Operating systems
Windows
GGML backends
CUDA
Hardware
Ryzen 5800x 3090 + P40
Models
magnum-v4-22b-Q6_K.gguf TheSkullery_L3.3-Unnamed-Exp-70B-v0.8-IQ4_XS.gguf
Problem description & steps to reproduce
Running with environment variable 'LLAMA_SET_ROWS=0' results in normal output. Setting to 1 results in gibberish
Helpful AI 21 July 2025 7:42 AM How can I help? 12t User 21 July 2025 7:43 AM Can you recite for me the intro to sesame street 6.4s 85t Helpful AI 21 July 2025 8:25 PM Sunnynynynyy dayyy day,, day!… AItSunIt''''ssIts unnme of time for for to to talk you play play S S S ThisSThis is is is is the is a the way street best w song sest way of way you a to explore to be come in a to play learn play
However, if i restrict to just a single GPU (P40 or 3090), llama_set_rows=1 works with no issues.
First Bad Commit
Havent tested earlier versions.
Relevant log output
With LLAMA_SET_ROWS=1 Gibberish: .\llama-server.exe -m D:\text-generation-webui\models\TheSkullery_L3.3-Unnamed-Exp-70B-v0.8-IQ4_XS.gguf -ngl 99 -ts 40/43 -fa -c 32768 .\llama-server.exe -m D:\text-generation-webui\models\magnum-v4-22b-Q6_K.gguf -ngl 99 -ts 40/43 -fa -c 32768 Sane (i cant fit 32k ctx on one GPU: .\llama-server.exe -m D:\text-generation-webui\models\magnum-v4-22b-Q6_K.gguf -ngl 99 -fa -c 16368 .\llama-server.exe -m D:\text-generation-webui\models\magnum-v4-22b-Q6_K.gguf -ngl 53 -fa -c 32768 (CPU has 4 layers offloaded)