fix(proxy): harden continuity recovery, safe WS replay, and shutdown/restart bridge lifecycle#415
Conversation
|
It looks like this one is much more then my solution, Think this one could be better if you have a good look! If this one is accepted, close mine: #416 |
…and add regression tests
…connect-only behavior
aa32456 to
157ab94
Compare
…t response routing
…bridge run continuity
|
Tested this on my live server. Before this patch, on I deployed a build based on From what I observed, this PR really helps with the post-compact continuity/token blowup class of issues. |
Summary
previous_response_idrecovery for both WebSocket and HTTP Responses flowsprevious_response_idcontinuity, without forcing client-side context blowupinvalid_request_errorcasesresponse.createdwith quota/rate-limit errorssession_idinrequest_logsand preferring turn-state scope over shared session scopeprevious_response_idfrom session scope for normal downstream requests, which was causing context blowup after restart/rebind/v1/responsesrequests without bridge can recover the original owner from real streamed response IDsProblem
We were seeing several production failure modes around follow-up turns and restarts:
previous_response_not_foundinvalid_request_errorwithparam=previous_response_idcontext_length_exceededafter restart or local rebind/v1/responsessessions eventually failing withbridge_kind=session_header ... context_length_exceededworking/reconnectingWe were also seeing WebSocket failures before
response.createdon quota/rate-limit conditions (for exampleusage_limit_reached), which surfaced as stream termination and forcedmanual resend even when other accounts could continue the run.
A separate issue was that continuity and owner recovery could bleed across scopes:
prompt_cachecontinuity into unintended hard bridge identitiesChanges
WebSocket path (
previous_response_idrecovery)code,param,message)code=previous_response_not_foundcode=invalid_request_error+param=previous_response_id+ message semantics matching not foundresponse.failed(stream_incomplete)and trigger reconnectinvalid_request_errorresponses untouched for downstream visibility400WebSocket path (pre-created failure hardening)
response.createdrate_limit_exceededusage_limit_reachedinsufficient_quotausage_not_includedquota_exceededresponse.createdresponse_id< 1sticky_key,sticky_kind,reallocate_sticky) across reconnect/replayresponse.creategate correctly on fail-closed connect / terminal-error paths so later requests on the same downstream socket do not get blockedHTTP Responses / fallback path
previous_response_idin fallback HTTP streamingupstream_unavailablewhen the previous-response owner is unavailable, instead of silently failing over to another accountprevious_response_not_foundto retryablestream_incomplete1009 message too bigfamily of bridge-adjacent failuresresponse.idandsession_idfor successful non-bridge streamed responses so laterprevious_response_idfollow-ups can recover the original owner fromrequest_logsHTTP bridge path
_stream_http_bridge_session_events(...)to unify primary and retry stream handlingstream_incompleteSession scope and request-log continuity lookup
session_idpersistence torequest_logs(request_id, status, api_key_id, requested_at desc, id desc)(request_id, status, api_key_id, session_id, requested_at desc, id desc)previous_response_idx-codex-turn-statex-codex-session-id/x-codex-conversation-idprevious_response_idfrom request-log session scope for normal HTTP / compact / WebSocket downstream requestsContinuity and restart hardening
prompt_cacheback to syntheticturn_state_headerlatest_turn_state/latest_response_idwhen alias continuity is missing after restartprompt_cachesemantics instead of promoting it to hard session identityContinuity observability
Shutdown and reconnect lifecycle
close_all_http_bridge_sessions()now fails inflight bridge waiters with a terminal error instead of leaving them blockedstream_incompleteRecovery guardrails
_http_bridge_should_attempt_local_previous_response_recovery(...)using the same recoverable predicateTesting
Added or updated unit coverage for:
invalid_request_errornot-found semanticsinvalid_request_errormessagesparam != previous_response_idresponse.failed(usage_limit_reached)error(usage_limit_reached)previous_response_idinference from session scope for normal requestsAdded or updated integration coverage for:
invalid_request_error(previous_response_id, ...not found...)/v1/responsescontinuity behavior without bridge, including owner pinning and retryable fail-closed responsesValidation
uvx ruff format --check .uvx ruff check .uv run ty checkopenspec validate --specsuv run pytest -qResult:
1789 passed, 7 skipped, 4 warningsResult
This keeps previous-response recovery deterministic for the real failure mode, preserves correct error surfacing for unrelated invalid requests, makes previous-response owner routing
session-aware, prevents restart-time and session-header continuity regressions that were causing context blowup, aligns fail-closed behavior across WS / HTTP / bridge paths, and
makes shutdown/reconnect behavior fail fast and cleanly instead of hanging clients.
/v1/responsescan fail withbridge_kind=session_header ... context_length_exceeded#423