#14566Misc. bug: OpenAI HTTP interface returns "HTTP-200" with error details in streamed chunk
Issue Details
Name and Version
llama-server --version load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so version: 5630 (4c763c8d) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
/app/llama-server --port 3080 -m /data/model/gemma-3-1b-it-q4_0.gguf
Problem description & steps to reproduce
When using the OpenAI HTTP interface for interacting with the LLM using streaming, the HTTP interface returns an HTTP/1.1 200 OK
even for invalid inputs, returning the error while interacting with the LLM as part of the streamed response.
This makes automating clients somewhat tricky, and - more importantly - does not behave like other inference servers does.
A simple way to reproduce this is to load a model that has a limited context window (e.g., gemma):
/app/llama-server --port 3080 -m /data/model/gemma-3-1b-it-q4_0.gguf
then generate a rather large context using a chat-completion with stream=True
:
import json d = { "temperature": 0.0, "n": 1, "stop": ["End"], "stream": True, "model": "gemma", "messages": [ { "role": "user", "content": "The quick brown fox shows that llama.cpp's OpenAI interface does something weird. \n" * 10000, } ], } with open("data.json", encoding="ascii", mode="w") as fp: json.dump(d, fp=fp)
and then send the above body to the LLM:
curl --location 'http://127.0.0.1:3080/chat/completions' --header 'Content-Type: application/json' --header 'Accept: application/json' --data @/tmp/data.json -v
Expected behavior: The server returns HTTP-400
saying that the request exceeds the available context size.
Actual behavior: the server returns HTTP-200
but then later streams an "error junk":
curl ... [...] < HTTP/1.1 200 OK < Keep-Alive: timeout=5, max=100 < Content-Type: text/event-stream < Server: llama.cpp < Transfer-Encoding: chunked < Access-Control-Allow-Origin: < error: {"code":400,"message":"the request exceeds the available context size. try increasing the context size or enable context shift","type":"invalid_request_error"} data: [DONE]
First Bad Commit
No response
Relevant log output
srv params_from_: Chat format: Content-only slot launch_slot_: id 0 | task 201 | processing task slot update_slots: id 0 | task 201 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 190008 slot release: id 0 | task 201 | stop processing: n_past = 0, truncated = 0 srv send_error: task id = 201, error: the request exceeds the available context size. try increasing the context size or enable context shift srv update_slots: no tokens to decode srv update_slots: all slots are idle srv cancel_tasks: cancel task, id_task = 201 srv log_server_r: request: POST /chat/completions 127.0.0.1 200 srv update_slots: all slots are idle