fix: broken invariants in agent-loop, "400 No tool output found" #490

jonchurch · 2025-04-21T21:59:41Z

I'm not certain this fully fixes the issue, but these changes address several clear problems in the current cancellation flow

This is a draft because I think we can do better still, and I welcome pushes to this PR or publishing if yall wanna run with it.

There is an issue of broken invariants, where we are running logic such as flush even though the code as written expects it will not run after cancel.

related to #114 aka 400 No tool output found for function call.

Problem 1: Abort Controller is null

Upon cancel, we attempt to abort the stream via abortController.

After #178 we are nulling out the handle to the stream BEFORE we call the abort method on its controller. TS assertions hide that fact from the compiler, giving us a silent no-op instead of an abort:

codex/codex-cli/src/utils/agent/agent-loop.ts

Lines 122 to 133 in 09f0ae3

    
               // Reset the current stream to allow new requests 
        
               this.currentStream = null; 
        
               log( 
        
                 `AgentLoop.cancel() invoked – currentStream=${Boolean( 
        
                   this.currentStream, 
        
                 )} execAbortController=${Boolean(this.execAbortController)} generation=${ 
        
                   this.generation 
        
                 }`, 
        
               ); 
        
               ( 
        
                 this.currentStream as { controller?: { abort?: () => void } } | null 
        
               )?.controller?.abort?.();

This means cancel will not abort the stream, and we continue processing events.

Problem 2: Broken control flow expectations

Problem 1 is an issue, but not one that should lead to the 400 No tool output found for function call.

The agent loop is written in such a way that it appears to expect to break out of the async iterator loop when the stream is aborted. That never actually happens even after fixing Problem 1 and you can confirm that w/ a console.log in the catch handler.

codex/codex-cli/src/utils/agent/agent-loop.ts

Lines 790 to 800 in 09f0ae3

    
           } catch (err: unknown) { 
        
             // Gracefully handle an abort triggered via `cancel()` so that the 
        
             // consumer does not see an unhandled exception. 
        
             if (err instanceof Error && err.name === "AbortError") { 
        
               if (!this.canceled) { 
        
                 // It was aborted for some other reason; surface the error. 
        
                 throw err; 
        
               } 
        
               this.onLoading(false); 
        
               return; 
        
             }

That is the broken invariant. When we do not break out of the loop on abort, we continue processing events, and our cleanup code later runs.

"Fix Problem 1, and Problem 2 heals itself then" you may say. Under testing, it does not! I have not observed the catch executing in response to a proper abort.

The implementation of the Stream in the openai sdk does not seem to bubble up aborts, but will instead return silently. (assuming that i've located the codepath correctly here) https://github.com/openai/openai-node/blob/2785c1186b528e4ab3a2a7c9282e041aaa4c13f6/src/streaming.ts#L82-L85

Problem 3 -- Inconsistent State Guards

There are inconsistent canceled state guards in the event processing and leads to state drift when we realize that:

Problem 1 meant we continue to process the stream
Problem 2 means that we will run cleanup code flush even though we may have this.canceled === true impacting the processing of events.

With probelm 2 being the broken invariant, the rest of the flow gets unpredictable. Most places are guarded, like

codex/codex-cli/src/utils/agent/agent-loop.ts

Lines 1112 to 1121 in 09f0ae3

    
           private async processEventsWithoutStreaming( 
        
             output: Array<ResponseInputItem>, 
        
             emitItem: (item: ResponseItem) => void, 
        
           ): Promise<Array<ResponseInputItem>> { 
        
             // If the agent has been canceled we should short‑circuit immediately to 
        
             // avoid any further processing (including potentially expensive tool 
        
             // calls). Returning an empty array ensures the main run‑loop terminates 
        
             // promptly. 
        
             if (this.canceled) { 
        
               return [];

But we will still increment the lastReponseId even when we are in a cancel situation, and stopped processing function calls.

I focused on the flush code. It will unconditionally clear the pendingAborts array, despite it expecting that we are in a cancel free state. I didn't trace the exact specifics to limit my brain running out of context space for while debugging these async flows.

Solution

Well, I ran out of time to really come up with a bullet proof solution. Testing this is difficult, as it relies on timing. Asking to prefix tool runs w/ sleep can help a bit, but I had a harder time reproing today vs last week.

For now, ensuring Abort actually is triggered, and guarding the flush is an improvement

Ideally, we can get the throw to happen on an abort, or otherwise return early from the processing. That's the smallest surgical fix until this logic can be reworked to be entirely signal aware. I propose that in the near future instead of a canceled flag, we observe the AbortSignal as the single source of truth for cancellation state and esnure it propogates consistently through all operations.

bugbears

Stream processing can be finicky, as you may still receive buffered events after a stream has been canceled. Idk how this maps exactly to async generators

Once the abort is successful, assuming we have no more events that leak out, we can't easily break from the loop and return early. I wasn't able to consistently repro today getting another event even when we are in the canceled state, but throwing a manually constructed AbortError if we process an event during this.canceled === true would fix the control flow.

p.s. I include this note whenever I’ve spent significant time on a contribution to a project I don’t maintain. It’s part disclosure, part invitation. ❤️

this investigation took a few days of work to fully unwind and validate. working on open source full-time has given me the freedom to do the kind of deep, slow, careful work that rarely fits into a sprint. that freedom made this fix possible.

if this kind of work resonates with you, or you want to connect about open source systems work more broadly, feel free to reach out.

jonchurch · 2025-04-21T22:26:02Z

Im east coast and taking off for the day, will circle back tmr

tibo-openai · 2025-04-24T18:11:08Z

Hey @jonchurch, thank you for the incredible sleuthing here, lots of good insights here with subtle issues in the current logical flow. I'm excited to see if we can land this, can I help in some way?

jonchurch · 2025-04-25T18:57:01Z

Hey @tibo-openai, after spending more time analyzing the agent loop, I'm confident in the surgical fix this diff provides for the 400 "No function output" issues.

However, fixing the stream abort reveals a hidden trade off in the current implementation.

With this diff, we could potentially lose conversation state if a user cancels midstream (before the response.completed event is processed).

Here's why: the current code never truly aborts the stream upon cancel. This bug actually has an unintended benefit, it ensures that even during cancellation, we still receive the response.completed event and update lastResponseId.
This preserves the user's input and model responses in the server-side conversation state.

When we fix the abort mechanism, we'll properly stop the stream before response.completed if user cancels during response streaming, meaning we won't get a new lastResponseId. The next request will use the previous conversation state, effectively "losing" the user's most recent input and any partial model responses they already saw between the conclusion of that last complete response and the new turn they canceled.

For a complete solution, we should consider implementing a partial transcript approach alongside this fix (very similar to what the Zero Data Retention feature implemented). This would track user inputs and partial responses during a canceled turn and include them in the next request to maintain conversation continuity.

Given this trade off, do you think we should:

Proceed with this surgical fix and accept the potential conversation discontinuity (prolly not, as that is confusing UX, and "data loss")
Expand the fix to include a basic partial transcript implementation
Merge this fix but open an issue for the transcript implementation as follow-up work

I'm inclined toward option 3 since the current fix addresses the critical 400 errors, but I'd appreciate your thoughts.

I am picking at a reworking of the agent loop, as this diff is still sort of a band aid. The current implementation is clear that flush should not run on cancelation. The guard of pendingAborts.clear() in the flush method is only required because the abort (even after this diff) doesn't throw and land us in the actually expected termination branch, the catch block which would prevent flush from running via an early return. So band aids which are a code smell IMO until control flow is better managed.

unfortunately Im having too much fun reimplementing the loop, so progress has been slow 🤣

tibo-openai · 2025-04-27T15:06:31Z

I prefer we do not merge the fix in the current state as perhaps counterintuitively I think that the current broken state with 400s is better than a half fixed state with subtle downsides and bugs. At least in the current version, things just break and the user has to explicitly restart.

In terms of sequencing, either option 2) that you suggest, with explicit atomic commits and known valid cancellation points sgtm or we can wait for your agent-loop rewrite if it's something that you think can be done in the next couple of days, but I wouldn't want to block on this for too long.

jonchurch marked this pull request as draft April 21, 2025 22:00

jonchurch changed the title ~~fix: broken invariants in agent-loop~~ fix: broken invariants in agent-loop, "400 No tool output found" Apr 21, 2025

jonchurch added 2 commits April 21, 2025 18:31

null stream after aborting

4159584

guard flush

4f200f2

jonchurch force-pushed the cancel-invariant-fix branch from 64dde7f to 4f200f2 Compare April 21, 2025 22:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: broken invariants in agent-loop, "400 No tool output found" #490

fix: broken invariants in agent-loop, "400 No tool output found" #490

jonchurch commented Apr 21, 2025 •

edited

Loading

jonchurch commented Apr 21, 2025

tibo-openai commented Apr 24, 2025

jonchurch commented Apr 25, 2025 •

edited

Loading

tibo-openai commented Apr 27, 2025

	// Reset the current stream to allow new requests
	this.currentStream = null;
	log(
	`AgentLoop.cancel() invoked – currentStream=${Boolean(
	this.currentStream,
	)} execAbortController=${Boolean(this.execAbortController)} generation=${
	this.generation
	}`,
	);
	(
	this.currentStream as { controller?: { abort?: () => void } } \| null
	)?.controller?.abort?.();

	} catch (err: unknown) {
	// Gracefully handle an abort triggered via `cancel()` so that the
	// consumer does not see an unhandled exception.
	if (err instanceof Error && err.name === "AbortError") {
	if (!this.canceled) {
	// It was aborted for some other reason; surface the error.
	throw err;
	}
	this.onLoading(false);
	return;
	}

	private async processEventsWithoutStreaming(
	output: Array<ResponseInputItem>,
	emitItem: (item: ResponseItem) => void,
	): Promise<Array<ResponseInputItem>> {
	// If the agent has been canceled we should short‑circuit immediately to
	// avoid any further processing (including potentially expensive tool
	// calls). Returning an empty array ensures the main run‑loop terminates
	// promptly.
	if (this.canceled) {
	return [];

fix: broken invariants in agent-loop, "400 No tool output found" #490

Are you sure you want to change the base?

fix: broken invariants in agent-loop, "400 No tool output found" #490

Conversation

jonchurch commented Apr 21, 2025 • edited Loading

I'm not certain this fully fixes the issue, but these changes address several clear problems in the current cancellation flow

Problem 1: Abort Controller is null

Problem 2: Broken control flow expectations

Problem 3 -- Inconsistent State Guards

Solution

bugbears

jonchurch commented Apr 21, 2025

tibo-openai commented Apr 24, 2025

jonchurch commented Apr 25, 2025 • edited Loading

tibo-openai commented Apr 27, 2025

jonchurch commented Apr 21, 2025 •

edited

Loading

jonchurch commented Apr 25, 2025 •

edited

Loading