-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix: broken invariants in agent-loop, "400 No tool output found" #490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Im east coast and taking off for the day, will circle back tmr |
64dde7f
to
4f200f2
Compare
Hey @jonchurch, thank you for the incredible sleuthing here, lots of good insights here with subtle issues in the current logical flow. I'm excited to see if we can land this, can I help in some way? |
Hey @tibo-openai, after spending more time analyzing the agent loop, I'm confident in the surgical fix this diff provides for the 400 "No function output" issues. However, fixing the stream abort reveals a hidden trade off in the current implementation. With this diff, we could potentially lose conversation state if a user cancels midstream (before the Here's why: the current code never truly aborts the stream upon cancel. This bug actually has an unintended benefit, it ensures that even during cancellation, we still receive the When we fix the abort mechanism, we'll properly stop the stream before For a complete solution, we should consider implementing a partial transcript approach alongside this fix (very similar to what the Zero Data Retention feature implemented). This would track user inputs and partial responses during a canceled turn and include them in the next request to maintain conversation continuity. Given this trade off, do you think we should:
I'm inclined toward option 3 since the current fix addresses the critical 400 errors, but I'd appreciate your thoughts. I am picking at a reworking of the agent loop, as this diff is still sort of a band aid. The current implementation is clear that flush should not run on cancelation. The guard of unfortunately Im having too much fun reimplementing the loop, so progress has been slow 🤣 |
I prefer we do not merge the fix in the current state as perhaps counterintuitively I think that the current broken state with 400s is better than a half fixed state with subtle downsides and bugs. At least in the current version, things just break and the user has to explicitly restart. In terms of sequencing, either option 2) that you suggest, with explicit atomic commits and known valid cancellation points sgtm or we can wait for your agent-loop rewrite if it's something that you think can be done in the next couple of days, but I wouldn't want to block on this for too long. |
I'm not certain this fully fixes the issue, but these changes address several clear problems in the current cancellation flow
This is a draft because I think we can do better still, and I welcome pushes to this PR or publishing if yall wanna run with it.
There is an issue of broken invariants, where we are running logic such as
flush
even though the code as written expects it will not run aftercancel
.related to #114 aka
400 No tool output found for function call
.Problem 1: Abort Controller is null
Upon cancel, we attempt to abort the stream via abortController.
After #178 we are nulling out the handle to the stream BEFORE we call the abort method on its controller. TS assertions hide that fact from the compiler, giving us a silent no-op instead of an abort:
codex/codex-cli/src/utils/agent/agent-loop.ts
Lines 122 to 133 in 09f0ae3
This means cancel will not abort the stream, and we continue processing events.
Problem 2: Broken control flow expectations
Problem 1 is an issue, but not one that should lead to the
400 No tool output found for function call
.The agent loop is written in such a way that it appears to expect to break out of the async iterator loop when the stream is aborted. That never actually happens even after fixing Problem 1 and you can confirm that w/ a console.log in the catch handler.
codex/codex-cli/src/utils/agent/agent-loop.ts
Lines 790 to 800 in 09f0ae3
That is the broken invariant. When we do not break out of the loop on abort, we continue processing events, and our cleanup code later runs.
"Fix Problem 1, and Problem 2 heals itself then" you may say. Under testing, it does not! I have not observed the
catch
executing in response to a proper abort.Problem 3 -- Inconsistent State Guards
There are inconsistent
canceled
state guards in the event processing and leads to state drift when we realize that:flush
even though we may havethis.canceled === true
impacting the processing of events.With probelm 2 being the broken invariant, the rest of the flow gets unpredictable. Most places are guarded, like
codex/codex-cli/src/utils/agent/agent-loop.ts
Lines 1112 to 1121 in 09f0ae3
But we will still increment the
lastReponseId
even when we are in a cancel situation, and stopped processing function calls.I focused on the
flush
code. It will unconditionally clear thependingAborts
array, despite it expecting that we are in a cancel free state. I didn't trace the exact specifics to limit my brain running out of context space for while debugging these async flows.Solution
Well, I ran out of time to really come up with a bullet proof solution. Testing this is difficult, as it relies on timing. Asking to prefix tool runs w/ sleep can help a bit, but I had a harder time reproing today vs last week.
For now, ensuring Abort actually is triggered, and guarding the flush is an improvement
Ideally, we can get the throw to happen on an abort, or otherwise return early from the processing. That's the smallest surgical fix until this logic can be reworked to be entirely signal aware. I propose that in the near future instead of a canceled flag, we observe the AbortSignal as the single source of truth for cancellation state and esnure it propogates consistently through all operations.
bugbears
Stream processing can be finicky, as you may still receive buffered events after a stream has been canceled. Idk how this maps exactly to async generators
Once the abort is successful, assuming we have no more events that leak out, we can't easily break from the loop and return early. I wasn't able to consistently repro today getting another event even when we are in the canceled state, but throwing a manually constructed
AbortError
if we process an event duringthis.canceled === true
would fix the control flow.