#9726[PD router] Why do we not abort decode request if prefill request failed?

I Want to Work on This Issue

sglang

sgl-project

SGLang is a fast serving framework for large language models and vision language models.

Issue Details

about 1 month ago

slin1237

router

View on GitHub

datdo-msft

opened about 1 month ago

Author

Hi,

I'm reading the pd_router.rs code and in execute_dual_dispatch_internal(), if we don't need the logprobs, we simply only wait for the decode response and send the prefill response to a background worker for draining. My question is how come we don't want to drop the decode future/response when the prefill request fails first (ie, with status code >= 400)? Since if the prefill request fails first, it is most likely if not guaranteed that the decode request will also fail right?

This would be important in the case where the prefill server has timed out on a request waiting for the decode server to bootstrap (ie, BOOTSTRAP_TIMEOUT is reached on prefill), while the decode server is still waiting for more memory to free up to allocate kv cache before it can bootstrap. As a result, say the BOOTSTRAP timeout is 2min and it takes the decode server 10min to go from PreAlloc state to TransferQueue state, then we'd be waiting for at least 10min before the decode server fails the request (since the prefill already timed out a long time ago). Instead, if we fail/drop the decode request when we detected the failed prefill request, we don't have to do any further waiting.

cc @slin1237