Skip to content

fix(screentracker): make writeStabilize Phase 1 non-fatal when agents don't echo input#208

Merged
johnstcn merged 7 commits intomainfrom
fix/write-stabilize-non-fatal-phase1
Apr 10, 2026
Merged

fix(screentracker): make writeStabilize Phase 1 non-fatal when agents don't echo input#208
johnstcn merged 7 commits intomainfrom
fix/write-stabilize-non-fatal-phase1

Conversation

@johnstcn
Copy link
Copy Markdown
Member

@johnstcn johnstcn commented Mar 31, 2026

Fixes #123.

Changes

  • Make Phase 1 (echo detection) of writeStabilize non-fatal on timeout -- proceed to Phase 2 instead of returning HTTP 500
  • Guard non-fatal path with errors.Is(err, util.WaitTimedOut) so context cancellation still propagates
  • Reduce Phase 1 timeout from 15s to 2s (terminal echo is near-instant)
  • Extract writeStabilizeEchoTimeout and writeStabilizeProcessTimeout constants
  • Log at Info level (not Warn) since non-echoing agents hit this on every message
  • Add doc comment on formatPaste in claude.go documenting the ESC limitation with TUI selection prompts
  • Add tests for non-echoing agents (reacts, unresponsive, context-cancelled)

Known limitation

For TUI selection prompts (numbered/arrow-key lists), this fix eliminates the 500 but does not deliver the correct selection -- the \x1b (ESC) in bracketed paste cancels the selection widget. The correct approach is MessageTypeRaw. Documented via a comment on formatPaste in lib/httpapi/claude.go.

Also discovered a separate issue during smoke-testing: #209 (fixed in #215, included in this branch).

🤖 Written by a Coder Agent. Reviewed by a human and a veritable army of robots.

…agents don't echo input

Phase 1 of writeStabilize waited 15s for the screen to change after
writing message text (echo detection), returning HTTP 500 if it didn't.
Many TUI agents using bracketed paste don't echo input until Enter is
pressed, causing every message send to fail.

Phase 1 timeout is now non-fatal (2s) — if the screen doesn't change,
we log at Info level and proceed to Phase 2 (send carriage return).
Phase 2 remains the real indicator of agent responsiveness.

Key changes:
- Guard non-fatal path with errors.Is(err, util.WaitTimedOut) so
  context cancellation still propagates as a fatal error
- Reduce Phase 1 timeout from 15s to 2s (echo is near-instant)
- Extract named constants for both timeouts
- Add tests for no-echo-success and no-echo-no-react-failure
- Add documentation test for TUI selection prompt ESC limitation

Closes #123
@github-actions
Copy link
Copy Markdown

✅ Preview binaries are ready!

To test with modules: agentapi_version = "agentapi_208" or download from: https://github.com/coder/agentapi/releases/tag/agentapi_208

…tests

Restructure test comments to follow Cucumber-style Given/When/Then
pattern for clarity. Also fix send-no-echo-agent-reacts assertion
to scan for the user message instead of assuming it's the last
message in the conversation (the snapshot loop may append an agent
message after Send returns).
@johnstcn johnstcn self-assigned this Mar 31, 2026
@johnstcn johnstcn requested a review from mafredri March 31, 2026 11:58
Copy link
Copy Markdown
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean design. Making Phase 1 non-fatal for WaitTimedOut while preserving context cancellation as fatal is the right call. The errors.Is guard, extracted constants, and 2s timeout are all well-calibrated. Two P2 findings (missing test coverage for the key invariant, gofmt failure), two P3s (doc accuracy, test layering), and a handful of notes.

"Oh, this test suite looks lovely! Fifty rows, full coverage, green across the board. It's fake. Every row hits the same code path. You dressed up one test in fifty outfits." -- Bisky, on a different test. The new tests here are mostly genuine.

Severity count: 0 P0, 0 P1, 2 P2, 2 P3, 3 Notes.

🤖 This review was automatically generated with Coder Agents.

- Add send-message-no-echo-context-cancelled test: verifies the
  errors.Is(WaitTimedOut) guard by cancelling context during Phase 1
  and asserting context.Canceled propagates (P2)
- Fix gofmt: correct indentation, proper brace placement (P2)
- Fix constant comment: describe WaitFor timeout semantics accurately,
  note 1s stability check can extend past timeout, add TODO tag (P3)
- Drop send-tui-selection-esc-cancels test from screentracker, add
  ESC limitation comment to formatPaste in claude.go instead (P3)
- Shorten log message to match codebase style (Note)
- Rename tests to send-message-* prefix, use newConversation helper
  with opts callbacks (Note)
The test had a race: advanceFor could complete before the Send()
goroutine enqueued, so the stableSignal never fired, and sendCancel
ran while the message was still queued (never reaching writeStabilize).

Fix: use onWrite callback as a synchronization point. advanceUntil
waits for writeStabilize to start writing (onWrite fires), then
cancels. This guarantees Phase 1 WaitFor is running when ctx is
cancelled, and its sleep select sees ctx.Done() immediately.
@johnstcn johnstcn marked this pull request as ready for review March 31, 2026 15:28
Copilot AI review requested due to automatic review settings March 31, 2026 15:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the screentracker PTY conversation send pipeline to avoid failing requests when an agent doesn’t echo typed input during writeStabilize Phase 1 (echo detection), and adds tests to codify the new behavior.

Changes:

  • Make writeStabilize Phase 1 (echo detection) timeout non-fatal and proceed to Phase 2 (processing detection), while still propagating context cancellation.
  • Reduce Phase 1 timeout and extract Phase 1/2 timeouts into constants.
  • Add new test coverage for non-echoing agents (reacting vs unresponsive) and context-cancellation behavior; add clarifying documentation comment about bracketed paste and TUI selection cancellation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
lib/screentracker/pty_conversation.go Makes echo-detection timeout non-fatal, adds timeouts as constants, and adjusts logging/error handling.
lib/screentracker/pty_conversation_test.go Adds tests for non-echoing agents, unresponsive agents, and context cancellation during Phase 1.
lib/httpapi/claude.go Documents bracketed-paste ESC interaction and suggests MessageTypeRaw for TUI selection prompts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Collaborator

@35C4n0r 35C4n0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@35C4n0r
Copy link
Copy Markdown
Collaborator

35C4n0r commented Apr 1, 2026

@johnstcn phase 2 in a different PR ?
Can you explain what is the intended behaviour ?

@35C4n0r 35C4n0r self-requested a review April 1, 2026 09:18
@johnstcn
Copy link
Copy Markdown
Member Author

johnstcn commented Apr 1, 2026

@johnstcn phase 2 in a different PR ? Can you explain what is the intended behaviour ?

writeStabilize is a two-phase operation. Phase 1 would return a 500 if the screen didn't change. More info in the implementation plan details toggle.

@35C4n0r
Copy link
Copy Markdown
Collaborator

35C4n0r commented Apr 1, 2026

@johnstcn but I still get the 500 response from the server.

run agentapi with claude -> send /status -> wait for stabilise -> send a message -> wait till you get a 500

@johnstcn
Copy link
Copy Markdown
Member Author

johnstcn commented Apr 1, 2026

@johnstcn but I still get the 500 response from the server.

run agentapi with claude -> send /status -> wait for stabilise -> send a message -> wait till you get a 500

could possibly be also related to #209 - I have a separate fix for this one in the works

@35C4n0r
Copy link
Copy Markdown
Collaborator

35C4n0r commented Apr 1, 2026

@johnstcn, my bad it's a seperate 500 from phase 2

unexpected error occurred: failed to send message: failed to send message: failed to wait for processing to start: timeout waiting for condition

@mafredri mafredri requested review from mafredri and removed request for mafredri April 8, 2026 15:58
Copy link
Copy Markdown
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 2. All 7 findings from round 1 are resolved. The code is cleaner, tests are well-structured, and the core invariant (errors.Is guard) is locked down with the context-cancelled test. Three P3s remain, two notes.

Hisoka ran the suite 10x with -race and found nothing. Mafuuu traced all four adversarial scenarios. Razor verified every doc claim. Meruem confirmed no structural concerns.

"Boring code gets silence. This earned it." -- Hisoka

P3 (convention scan): Four em-dashes (U+2014) in new code. Replace with -- or rephrase.

  • lib/screentracker/pty_conversation.go:441: "this is non-fatal — many TUI"
  • lib/screentracker/pty_conversation.go:446: "fatal on timeout — if the"
  • lib/screentracker/pty_conversation_test.go:490: "completely unresponsive — it"
  • lib/screentracker/pty_conversation_test.go:519: "message parts — this is used"

Severity count: 0 P0, 0 P1, 0 P2, 3 P3, 2 Notes.

🤖 This review was automatically generated with Coder Agents.

- Drop hardcoded '1s' from writeStabilizeEchoTimeout comment so
  it won't go stale if the stability check duration changes
- Replace em-dashes (U+2014) with periods in four comments
- Use Phase 1/Phase 2 vocabulary in inline comments to match
  the doc comment on writeStabilize
- Replace loop-and-flag with slices.ContainsFunc in test
Copy link
Copy Markdown
Member Author

johnstcn commented Apr 9, 2026

🤖 Round 2 feedback addressed in dfd3caa:

  • P3 (hardcoded 1s): Dropped the specific duration from the writeStabilizeEchoTimeout comment
  • P3 (slices.ContainsFunc): Replaced loop-and-flag with slices.ContainsFunc
  • P3 (em-dashes): Replaced all four U+2014 em-dashes with periods
  • Note (Phase vocabulary): Updated inline comments at lines 455/491 to use Phase 1/Phase 2 vocabulary

@mafredri mafredri self-requested a review April 9, 2026 16:30
Copy link
Copy Markdown
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 3 progress: 4 of 5 open findings from round 2 are addressed in dfd3caa. Good work.

One finding has no author response:

  • F12 (Note, pty_conversation.go:478, thread): Go select non-determinism when context cancellation and Phase 1 timeout race. The reviewer noted this is inherent to Go select and needs no fix, but the thread is unresolved and has no author acknowledgment.

Further review is blocked until the author responds to F12 (even a brief "acknowledged" resolves it) or resolves the thread.

🤖 This review was automatically generated with Coder Agents.

@johnstcn johnstcn requested a review from mafredri April 10, 2026 08:06
Copy link
Copy Markdown
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 12 findings from rounds 1-2 are resolved or acknowledged. 6 reviewers + Netero on round 4 found no new issues above P4. One P4, one Note.

Clean work across 4 rounds. The errors.Is(WaitTimedOut) guard and statusLocked fix are structurally sound -- confirmed independently by Bisky, Mafuuu, Meruem, and Knov. Tests are genuine: each fails without its corresponding production change. Race detector clean.

P4 pty_conversation.go:465,504 -- Two duration literals in writeStabilize remain unnamed alongside the two extracted constants: 1 * time.Second (stability check, line 465) and 3*time.Second (carriage return retry interval, line 504). Someone tuning this function finds two named knobs and two hidden ones -- the 3s retry interval interacts with the 15s process timeout (5 retries max). Pre-existing; the PR extracted the values it changed and left the rest, which is reasonable scoping. (Gon)

Knov put it well: "The statusLocked guard makes the Stable contract honest."

🤖 This review was automatically generated with Coder Agents.

@johnstcn johnstcn merged commit 00ff7ff into main Apr 10, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

failed to send message: failed to send message: failed to wait for screen to stabilize: timeout waiting for condition

4 participants