#14566Misc. bug: OpenAI HTTP interface returns "HTTP-200" with error details in streamed chunk

llama.cpp

ggml-org

LLM inference in C/C++

Issue Details

2 months ago

No assignee

bug-unconfirmedstale

View on GitHub

ckolbitsch-work

opened 2 months ago

Author

Name and Version

llama-server --version
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so
version: 5630 (4c763c8d)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

/app/llama-server --port 3080 -m /data/model/gemma-3-1b-it-q4_0.gguf

Problem description & steps to reproduce

When using the OpenAI HTTP interface for interacting with the LLM using streaming, the HTTP interface returns an HTTP/1.1 200 OK even for invalid inputs, returning the error while interacting with the LLM as part of the streamed response.

This makes automating clients somewhat tricky, and - more importantly - does not behave like other inference servers does.

A simple way to reproduce this is to load a model that has a limited context window (e.g., gemma):

/app/llama-server --port 3080 -m /data/model/gemma-3-1b-it-q4_0.gguf

then generate a rather large context using a chat-completion with stream=True:

import json

d = {
    "temperature": 0.0,
    "n": 1,
    "stop": ["End"],
    "stream": True,
    "model": "gemma",
    "messages": [
        {
            "role": "user",
            "content": "The quick brown fox shows that llama.cpp's OpenAI interface does something weird. \n"
            * 10000,
        }
    ],
}
with open("data.json", encoding="ascii", mode="w") as fp:
    json.dump(d, fp=fp)

and then send the above body to the LLM:

curl --location 'http://127.0.0.1:3080/chat/completions' --header 'Content-Type: application/json' --header 'Accept: application/json' --data @/tmp/data.json -v

Expected behavior: The server returns HTTP-400 saying that the request exceeds the available context size.

Actual behavior: the server returns HTTP-200 but then later streams an "error junk":

curl ...
[...]
< HTTP/1.1 200 OK
< Keep-Alive: timeout=5, max=100
< Content-Type: text/event-stream
< Server: llama.cpp
< Transfer-Encoding: chunked
< Access-Control-Allow-Origin:
<
error: {"code":400,"message":"the request exceeds the available context size. try increasing the context size or enable context shift","type":"invalid_request_error"}

data: [DONE]

First Bad Commit

No response

Relevant log output

srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 201 | processing task
slot update_slots: id  0 | task 201 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 190008
slot      release: id  0 | task 201 | stop processing: n_past = 0, truncated = 0
srv    send_error: task id = 201, error: the request exceeds the available context size. try increasing the context size or enable context shift
srv  update_slots: no tokens to decode
srv  update_slots: all slots are idle
srv  cancel_tasks: cancel task, id_task = 201
srv  log_server_r: request: POST /chat/completions 127.0.0.1 200
srv  update_slots: all slots are idle