Misc. bug: llama-server embedding endpoint returns vectors with just null values after a while

#14812

llama.cpp

ggml-org

Issue Details

22 days ago

No assignee

bug-unconfirmed

View on GitHub

I Want to Work on This Issue

ngladitz

opened 22 days ago

Author

Name and Version

/opt/homebrew/bin/llama-server --version version: 5920 (d9b69108) built with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0

Operating systems

Mac

Which llama.cpp modules do you know to be affected?

llama-server

Command line

/opt/homebrew/bin/llama-server -m Qwen3-Embedding-8B-Q4_K_M.gguf --alias Qwen3-embedding --embedding --pooling last -ub 8192 --verbose-prompt --offline  -c 40960 --no-mmap --mlock --port 9008

Problem description & steps to reproduce

I successfully generate working embeddings via the server for a while (works for hours or days maybe; maybe one embedding is being requested per minute) but after a while the embedding vectors start being returned with just null elements. I see no errors or indicators in the log output when this happens and I need to restart the server to recover.

When the server is in the error state (I omitted the repetitive middle of the vector in the response):

% curl -X POST http://localhost:9008/v1/embeddings \
     -H "Content-Type: application/json" \
     -d '{"input": "test"}'

{"model":"gpt-3.5-turbo","object":"list","usage":{"prompt_tokens":2,"total_tokens":2},"data":[{"embedding":[null, ... ,null],"index":0,"object":"embedding"}]}%

Repeating the query after process restart:

% curl -X POST http://localhost:9008/v1/embeddings \
     -H "Content-Type: application/json" \
     -d '{"input": "test"}'

{"model":"gpt-3.5-turbo","object":"list","usage":{"prompt_tokens":2,"total_tokens":2},"data":[{"embedding":[0.027558811008930206, ..., ,0.021016428247094154],"index":0,"object":"embedding"}]}%

I am currently unsure how to reproduce / reduce this or how to come up with a usable test case.

First Bad Commit

No response

Relevant log output

I Want to Work on This Issue

llama.cpp

ggml-org

Issue Details

22 days ago

No assignee

bug-unconfirmed

View on GitHub

I Want to Work on This Issue

Misc. bug: llama-server embedding endpoint returns vectors with just null values after a while

#14812

ngladitz

opened 22 days ago

Author

Name and Version

/opt/homebrew/bin/llama-server --version version: 5920 (d9b69108) built with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0

Operating systems

Mac

Which llama.cpp modules do you know to be affected?

llama-server

Command line

/opt/homebrew/bin/llama-server -m Qwen3-Embedding-8B-Q4_K_M.gguf --alias Qwen3-embedding --embedding --pooling last -ub 8192 --verbose-prompt --offline  -c 40960 --no-mmap --mlock --port 9008

Problem description & steps to reproduce

When the server is in the error state (I omitted the repetitive middle of the vector in the response):

% curl -X POST http://localhost:9008/v1/embeddings \
     -H "Content-Type: application/json" \
     -d '{"input": "test"}'

{"model":"gpt-3.5-turbo","object":"list","usage":{"prompt_tokens":2,"total_tokens":2},"data":[{"embedding":[null, ... ,null],"index":0,"object":"embedding"}]}%

Repeating the query after process restart:

% curl -X POST http://localhost:9008/v1/embeddings \
     -H "Content-Type: application/json" \
     -d '{"input": "test"}'

{"model":"gpt-3.5-turbo","object":"list","usage":{"prompt_tokens":2,"total_tokens":2},"data":[{"embedding":[0.027558811008930206, ..., ,0.021016428247094154],"index":0,"object":"embedding"}]}%

I am currently unsure how to reproduce / reduce this or how to come up with a usable test case.

First Bad Commit

No response

Relevant log output

I Want to Work on This Issue