Misc. bug: llama-server embedding endpoint returns vectors with just null values after a while

#14812

Issue Details

22 days ago
No assignee
bug-unconfirmed
ngladitzngladitz
opened 22 days ago
Author

Name and Version

/opt/homebrew/bin/llama-server --version version: 5920 (d9b69108) built with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0

Operating systems

Mac

Which llama.cpp modules do you know to be affected?

llama-server

Command line

/opt/homebrew/bin/llama-server -m Qwen3-Embedding-8B-Q4_K_M.gguf --alias Qwen3-embedding --embedding --pooling last -ub 8192 --verbose-prompt --offline -c 40960 --no-mmap --mlock --port 9008

Problem description & steps to reproduce

I successfully generate working embeddings via the server for a while (works for hours or days maybe; maybe one embedding is being requested per minute) but after a while the embedding vectors start being returned with just null elements. I see no errors or indicators in the log output when this happens and I need to restart the server to recover.

When the server is in the error state (I omitted the repetitive middle of the vector in the response):

% curl -X POST http://localhost:9008/v1/embeddings \ -H "Content-Type: application/json" \ -d '{"input": "test"}' {"model":"gpt-3.5-turbo","object":"list","usage":{"prompt_tokens":2,"total_tokens":2},"data":[{"embedding":[null, ... ,null],"index":0,"object":"embedding"}]}%

Repeating the query after process restart:

% curl -X POST http://localhost:9008/v1/embeddings \ -H "Content-Type: application/json" \ -d '{"input": "test"}' {"model":"gpt-3.5-turbo","object":"list","usage":{"prompt_tokens":2,"total_tokens":2},"data":[{"embedding":[0.027558811008930206, ..., ,0.021016428247094154],"index":0,"object":"embedding"}]}%

I am currently unsure how to reproduce / reduce this or how to come up with a usable test case.

First Bad Commit

No response

Relevant log output