CI timeout reliability improvements for flaky test failures#1194
Open
brooke-hamilton wants to merge 3 commits intodevcontainers:mainfrom
Open
CI timeout reliability improvements for flaky test failures#1194brooke-hamilton wants to merge 3 commits intodevcontainers:mainfrom
brooke-hamilton wants to merge 3 commits intodevcontainers:mainfrom
Conversation
Signed-off-by: Brooke Hamilton <45323234+brooke-hamilton@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds timeout guards and retry logic to prevent CI failures caused by transient network timeouts when tests make OCI registry calls.
Problem
Recent non-dependabot CI runs have been failing due to mocha test timeouts on network operations. Every failure in the last two weeks was caused by a test exceeding its timeout while making HTTP calls to OCI registries like
ghcr.io. Examples:Run #4064 (PR #1189) —
containerFeaturesOrder.test.tsRun #4063 (PR #1188) —
featuresCLICommands.test.tsThis test had no
this.timeout()set, so it inherited mocha's 2-second default — far too short for network I/O.Run #4062 (PR #1183) —
featureHelpers.test.tsEarlier dependabot runs also experienced multi-hour hangs (1.5–3+ hours) before eventually failing, wasting CI resources.
Solution
1. Add
--retries 1to thetest-matrixnpm scriptThis automatically retries any transiently failing test once before marking it as failed. This alone would have prevented all 4 failing jobs across the 3 recent non-dependabot runs.
2. Add
.mocharc.ymlwith a 6-minute global timeoutTests with explicit
this.timeout()calls override this, but tests without a timeout (like thegetVersionsStrictSortedtests) now get a reasonable default instead of 2 seconds.3. Add
timeout-minutesto all GitHub Actions jobsdev-containers.ymlclidev-containers.ymltests-matrixdev-containers.ymlfeatures-registry-compatibilitydev-containers.ymlinstall-scripttest-windows.ymltests-matrixtest-docker-v29.ymltest-docker-v29test-docker-v20.ymltest-docker-v20This prevents jobs from hanging for hours (as seen in dependabot runs #4055 and #4056 which ran for 1h 50m+).
Files Changed
.github/workflows/dev-containers.yml— Addedtimeout-minutesto 4 jobs.github/workflows/test-windows.yml— Addedtimeout-minutes: 15.github/workflows/test-docker-v29.yml— Addedtimeout-minutes: 20.github/workflows/test-docker-v20.yml— Addedtimeout-minutes: 20.mocharc.yml— New file with 6-minute global timeoutpackage.json— Added--retries 1totest-matrixscriptImpact
--retries 1?.mocharc.yml?this.timeout()All recent non-dependabot failures would be prevented by these changes.