Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI hang during "yarn --inline-builds" #3059

Open
philrz opened this issue Apr 26, 2024 · 3 comments
Open

CI hang during "yarn --inline-builds" #3059

philrz opened this issue Apr 26, 2024 · 3 comments
Assignees
Labels
bug Something isn't working test Creating/improving test automation

Comments

@philrz
Copy link
Contributor

philrz commented Apr 26, 2024

In the time after #2972 merged, I've started to notice a repeating intermittent failure in CI during the "Setup Zui" Actions Workflow in the yarn --inline-builds step. For example, here's how it looked in the most recent failure which was during a "Create Insiders Release" run on macOS:

Link step
  ➤ YN0007: │ @swc/core@npm:1.3.41 must be built because it never has been before or the last one failed
  ➤ YN0007: │ esbuild@npm:0.17.12 must be built because it never has been before or the last one failed
  ➤ YN0007: │ esbuild@npm:0.18.14 must be built because it never has been before or the last one failed
  ➤ YN0007: │ @parcel/watcher@npm:2.0.4 must be built because it never has been before or the last one failed
  ➤ YN0007: │ brimcap@https://github.com/brimdata/brimcap.git#commit=bf7fb4996738767bb4f27eee939ec67dd21aab52 must be built because it never has been before or the last one failed
  ➤ YN0007: │ electron@npm:28.0.0 must be built because it never has been before or the last one failed
  ➤ YN0007: │ keytar@npm:7.7.0 must be built because it never has been before or the last one failed
  ➤ YN0007: │ msw@npm:0.36.8 must be built because it never has been before or the last one failed
  ➤ YN0007: │ styled-components@npm:5.3.5 [cb7c7] must be built because it never has been before or the last one failed
  ➤ YN0007: │ playwright-chromium@npm:1.41.1 must be built because it never has been before or the last one failed
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT Downloading Chromium 121.0.6[167](https://github.com/brimdata/zui/actions/runs/8842643153/job/24281616674#step:3:177).57 (playwright build v1097) from https://playwright.azureedge.net/builds/chromium/1097/chromium-mac.zip
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |                                                                                |   0% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■                                                                        |  10% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■                                                                |  20% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■                                                        |  30% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                                |  40% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                        |  50% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                |  60% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                        |  70% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                |  80% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■        |  90% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■| 100% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT Chromium 121.0.6167.57 (playwright build v1097) downloaded to /Users/runner/Library/Caches/ms-playwright/chromium-1097
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT Downloading FFMPEG playwright build v1009 from https://playwright.azureedge.net/builds/ffmpeg/1009/ffmpeg-mac.zip
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |                                                                                |   1% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■                                                                        |  11% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■                                                                |  20% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■                                                        |  30% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                                |  40% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                        |  51% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                |  60% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                        |  70% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                |  80% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■        |  91% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■| 100% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT FFMPEG playwright build v1009 downloaded to /Users/runner/Library/Caches/ms-playwright/ffmpeg-1009
  Error: The operation was canceled.

The job ultimately gets killed by Actions after hanging for 6 hours.

By comparison, here's how the same step proceeded to completion on the prior successful Create Insiders Release run on macOS:

Link step
  ➤ YN0007: │ @swc/core@npm:1.3.41 must be built because it never has been before or the last one failed
  ➤ YN0007: │ esbuild@npm:0.17.12 must be built because it never has been before or the last one failed
  ➤ YN0007: │ esbuild@npm:0.18.14 must be built because it never has been before or the last one failed
  ➤ YN0007: │ @parcel/watcher@npm:2.0.4 must be built because it never has been before or the last one failed
  ➤ YN0007: │ brimcap@https://github.com/brimdata/brimcap.git#commit=bf7fb4996738767bb4f27eee939ec67dd21aab52 must be built because it never has been before or the last one failed
  ➤ YN0007: │ electron@npm:28.0.0 must be built because it never has been before or the last one failed
  ➤ YN0007: │ keytar@npm:7.7.0 must be built because it never has been before or the last one failed
  ➤ YN0007: │ msw@npm:0.36.8 must be built because it never has been before or the last one failed
  ➤ YN0007: │ styled-components@npm:5.3.5 [cb7c7] must be built because it never has been before or the last one failed
  ➤ YN0007: │ playwright-chromium@npm:1.41.1 must be built because it never has been before or the last one failed
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT Downloading Chromium 121.0.6167.57 (playwright build v1097) from https://playwright.azureedge.net/builds/chromium/1097/chromium-mac.zip
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |                                                                                |   0% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■                                                                        |  10% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■                                                                |  20% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■                                                        |  30% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                                |  40% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                        |  50% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                |  60% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                        |  70% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                |  80% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■        |  90% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■| 100% of 139.7 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT Chromium 121.0.6[167](https://github.com/brimdata/zui/actions/runs/8810538623/job/24183065715#step:3:177).57 (playwright build v1097) downloaded to /Users/runner/Library/Caches/ms-playwright/chromium-1097
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT Downloading FFMPEG playwright build v1009 from https://playwright.azureedge.net/builds/ffmpeg/1009/ffmpeg-mac.zip
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |                                                                                |   1% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■                                                                        |  11% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■                                                                |  20% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■                                                        |  30% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                                |  40% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                        |  51% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                |  60% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                        |  70% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                |  80% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■        |  91% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■| 100% of 1.1 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.41.1 STDOUT FFMPEG playwright build v1009 downloaded to /Users/runner/Library/Caches/ms-playwright/ffmpeg-1009
  ➤ YN0007: │ zui@workspace:apps/zui must be built because it never has been before or the last one failed
  ➤ YN0007: │ nx@npm:16.10.0 [a0cba] must be built because it never has been before or the last one failed
  ➤ YN0000: │ zui@workspace:apps/zui STDOUT Downloading zdeps (skip with ZDEPS=false yarn)
  ➤ YN0000: │ zui@workspace:apps/zui STDOUT copied brimcap artifacts Version: v1.7.0
  ➤ YN0000: │ zui@workspace:apps/zui STDOUT copied Zed artifacts Version: v1.15.0-12-gec3f004c

I have no idea what it's hanging on in the failure case, but at minimum I wanted to open this bug to have a place to start logging incidents to look for patterns. I expect I'll start looking into debug approaches, such as maybe increasing log verbosity and/or running the steps manually/interactively on the runner so I can watch top and see if I can catch a hanging process.

Looking back over recent Actions runs, here's other incidents I see of the same symptom.

Run Workflow Platform
https://github.com/brimdata/zui/actions/runs/8804933973 Zui CI macOS
https://github.com/brimdata/zui/actions/runs/8443742929 Zui CI macOS
https://github.com/brimdata/zui/actions/runs/8851891223 Zui CI macOS
https://github.com/brimdata/zui/actions/runs/8854325998 Zui CI macOS

In conclusion, thus far it's only been observed on macOS, though we're still looking at small numbers.

@philrz philrz added bug Something isn't working test Creating/improving test automation labels Apr 26, 2024
@philrz philrz self-assigned this Apr 26, 2024
@philrz
Copy link
Contributor Author

philrz commented Apr 27, 2024

Yesterday I did a bunch of runs to try to establish repro patterns and see if I could catch it in the act. Angles I pursued and findings:

  1. I did 40 separate runs of "Zui CI" at commit d7c1c3a, out of which it failed 2 times with the symptom shown above. I've added these incidents to the bottom of the table in the issue's opening text.

  2. I did a looping repro attempt on my Macbook using the following script that covers the same commands leading up to the repro in CI. It ran successfully 404 times without hanging before I stopped it for the night.

#!/bin/bash
NUM=1
while true; do
  echo "Run #: $NUM"
  echo "=============="
  git clone https://github.com/brimdata/zui.git
  cd zui
  nvm use $(cat .node-version)
  yarn --inline-builds
  cd ..
  rm -rf zui
  NUM=$(expr $NUM + 1)
done
  1. I looped the same repro script after having used ngrok-ssh to start an interactive login with a GitHub Actions runner running macos-12 (just as "Zui CI" runs on). It ran successfully 221 times without hanging before I stopped it for the night.

  2. I did 20 separate one-at-a-time ngrok-ssh logins to GitHub Actions macos-12 runners, each executing the same commands below leading up to the repro in CI. It did not hang once.

git clone https://github.com/brimdata/zui.git
cd zui
nvm install $(cat .node-version)
nvm use $(cat .node-version)
yarn --inline-builds

This all unfortunately doesn't do a whole lot to narrow it down. It does look pretty certain that the repro is unique to macOS. The fact it didn't repro through a loop locally on my Macbook nor on a single Actions Runner leads me to speculate that the essential ingredients for repro could be:

  1. Somehow related to the job landing on a "bad" runner. That said, the experiences I've had with "bad" runners in the past were usually more random and on-off (e.g., failures to load cache, network drops, etc.) and not a symptom like this with quiet hangs in the same spot.

  2. Something about the workflow setup other than the essential commands in my looping script. This hang is so early in the job that there's not a whole lot it could be, but if I want to be meticulous, there's things I could correct for like how it does the Go/Node installations through separate workflow steps or that it runs jongwooo/next-cache@v1.

The fact I wasn't able to catch it with my 20 manual ngrok-ssh interactive logins is disappointing, but the low repro rate makes that not altogether unsurprising, and if there's something specific about it being run as the a "Zui CI" job I might be wasting my time with that approach. For my next attempts I'll look at grafting nrgok-ssh onto "Zui CI" in hopes I can catch it that way.

@philrz
Copy link
Contributor Author

philrz commented Apr 29, 2024

I've got a branch zui-3059-debug rigged up to start ngrok-ssh before doing the rest of the "Zui CI" setup steps. I ran it 41 times without it hanging once. It seems this problem doesn't want to be caught in the act, or it magically fixed itself. I'm going to pause chasing it for the moment and will resume if it starts flaring up again.

@philrz
Copy link
Contributor Author

philrz commented Apr 10, 2025

tl;dr

Some variant of this symptom seems to have been occurring the past couple days and I've been studying it. The evidence suggests it may be a problem that afflicts a particular type of hosted GitHub Actions runner.

Details

The first recent incident was https://github.com/brimdata/zui/actions/runs/14341070751/job/40200161153 where it hung during the "Link step" portion of the yarn --inline-builds command of "setup app". In this regard it was somewhat similar to the failures shown earlier in this issue, but with two notable differences:

  1. The previous incidents were on macos-12 runners, whereas this job happens to be on ubuntu-22.04.
  2. This seemed to get a little further along than in the previous incidents, as the Completed and Done with warnings messages appear at the end of the output, though the job never proceeds to the next step. After it hung at that spot for 6 hours, GitHub timed out the job and the Error: The operation was canceled. message appears.

The job's output shown for that step:

Link step
  ➤ YN0007: │ @swc/core@npm:1.10.18 [1dea9] must be built because it never has been before or the last one failed
  ➤ YN0007: │ @parcel/watcher@npm:2.0.4 must be built because it never has been before or the last one failed
  ➤ YN0007: │ electron@npm:30.0.5 must be built because it never has been before or the last one failed
  ➤ YN0007: │ esbuild@npm:0.18.14 must be built because it never has been before or the last one failed
  ➤ YN0007: │ keytar@npm:7.7.0 must be built because it never has been before or the last one failed
  ➤ YN0007: │ msw@npm:0.36.8 must be built because it never has been before or the last one failed
  ➤ YN0007: │ styled-components@npm:5.3.5 [2527b] must be built because it never has been before or the last one failed
  ➤ YN0007: │ playwright-chromium@npm:1.44.0 must be built because it never has been before or the last one failed
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT Downloading Chromium 125.0.6422.26 (playwright build v1117) from https://playwright.azureedge.net/builds/chromium/1117/chromium-linux.zip
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |                                                                                |   0% of 156.8 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■                                                                        |  10% of 156.8 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■■■■■■■■■                                                                |  20% of 156.8 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■                                                        |  30% of 156.8 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                                |  40% of 156.8 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                        |  50% of 156.8 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                |  60% of 156.8 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                        |  70% of 156.8 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                |  80% of 156.8 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■        |  90% of 156.8 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■| 100% of 156.8 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT Chromium 125.0.6422.26 (playwright build v1117) downloaded to /home/runner/.cache/ms-playwright/chromium-1117
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT Downloading FFMPEG playwright build v1009 from https://playwright.azureedge.net/builds/ffmpeg/1009/ffmpeg-linux.zip
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |                                                                                |   0% of 2.6 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■                                                                        |  10% of 2.6 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■■■■■■■■■                                                                |  20% of 2.6 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■                                                        |  30% of 2.6 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                                |  40% of 2.6 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                        |  50% of 2.6 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                                |  60% of 2.6 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                        |  70% of 2.6 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■                |  80% of 2.6 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■        |  90% of 2.6 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■| 100% of 2.6 MiB
  ➤ YN0000: │ playwright-chromium@npm:1.44.0 STDOUT FFMPEG playwright build v1009 downloaded to /home/runner/.cache/ms-playwright/ffmpeg-1009
  ➤ YN0007: │ superdb-desktop@workspace:apps/superdb-desktop must be built because it never has been before or the last one failed
  ➤ YN0007: │ nx@npm:16.10.0 [1dea9] must be built because it never has been before or the last one failed
  ➤ YN0000: │ superdb-desktop@workspace:apps/superdb-desktop STDOUT Downloading zdeps (skip with ZDEPS=false yarn)
  ➤ YN0000: │ superdb-desktop@workspace:apps/superdb-desktop STDOUT copied artifacts Version: v1.18.0-376-gde8a72e4b
  ➤ YN0007: │ superdb@workspace:. must be built because it never has been before or the last one failed
  ➤ YN0000: │ superdb@workspace:. STDOUT 
  ➤ YN0000: │ superdb@workspace:. STDOUT  >  NX   Running target build for 2 projects:
  ➤ YN0000: │ superdb@workspace:. STDOUT 
  ➤ YN0000: │ superdb@workspace:. STDOUT     - superdb-types
  ➤ YN0000: │ superdb@workspace:. STDOUT     - superdb-node-client
  ➤ YN0000: │ superdb@workspace:. STDOUT 
  ➤ YN0000: │ superdb@workspace:. STDOUT  
  ➤ YN0000: │ superdb@workspace:. STDOUT 
  ➤ YN0000: │ superdb@workspace:. STDOUT > nx run superdb-types:build
  ➤ YN0000: │ superdb@workspace:. STDOUT 
  ➤ YN0000: │ superdb@workspace:. STDOUT 
  ➤ YN0000: │ superdb@workspace:. STDOUT > nx run superdb-node-client:build
  ➤ YN0000: │ superdb@workspace:. STDOUT 
  ➤ YN0000: │ superdb@workspace:. STDOUT 
  ➤ YN0000: │ superdb@workspace:. STDOUT  
  ➤ YN0000: │ superdb@workspace:. STDOUT 
  ➤ YN0000: │ superdb@workspace:. STDOUT  >  NX   Successfully ran target build for 2 projects
  ➤ YN0000: │ superdb@workspace:. STDOUT 
  ➤ YN0000: │ superdb@workspace:. STDOUT 
➤ YN0000: └ Completed in 31s 876ms
➤ YN0000: Done with warnings in 3m 33s
Error: The operation was canceled.

Another job https://github.com/brimdata/zui/actions/runs/14362563690/job/40267559741 failed several hours later in exactly the same fashion.

Remembering this issue, I started doing some debug similar to what I described in prior comments above, specifically using ngrok-ssh to try to reproduce the problem by repeating the same commands from the Workflow in an interactive shell on an ubuntu-22.04 runner. Interestingly, this time I was able to witness the hang in the interactive session, though it didn't give me much more to go on in terms of knowing why it was happening. One thing I will note is that after letting it hang for several minutes I hit Ctrl-C and Ctrl-\ several times to exit the hung yarn --inline-builds, then did an up-arrow to repeat the yarn --inline-builds again, and it ran and finished instantly.

At this point I recalled that I'd noticed a couple Actions jobs on our "superdb-website" repo had also failed yesterday. Specifically there were https://github.com/brimdata/superdb-website/actions/runs/14340927065/job/40199684873 and https://github.com/brimdata/superdb-website/actions/runs/14362563675 that also run on ubuntu-22.04 and those both hung in the yarn command of their "Install JS dependencies" phase and similarly got killed after 6 hours of hang time. Their closing output:

yarn install v1.22.22
[1/4] Resolving packages...
[2/4] Fetching packages...
[3/4] Linking dependencies...
Error: The operation was canceled.

In an interesting twist, just a few hours ago another job in the Zui repo actually made it past that "setup-app" phase and has been hung in the yarn test phase for over 3 hours, so it seems on track to get killed a few hours from now. Like the other hangs, though, the output makes it seem like it finished the step on which it's hanging, but it's just not proceeding to the next step. Here's how the output currently looks while it's actively hung:

yarn test
  shell: /usr/bin/bash -e {0}
  env:
    super_ref: 9fb5ae65[3](https://github.com/brimdata/zui/actions/runs/14367324067/job/40286471458#step:8:3)b45b4687a6eeebe338eeeb9afabb1ee
 >  NX   Running target test for 3 projects:
    - superdb-node-client
    - superdb-types
    - superdb-desktop
 
> nx run superdb-types:test
 PASS  src/util/time.test.ts
 PASS  src/util/error.test.ts
 PASS  src/values/duration.test.ts
 PASS  src/values/time.test.ts
 PASS  src/values/record.test.ts
Test Suites: 5 passed, 5 total
Tests:       61 passed, 61 total
Snapshots:   0 total
Time:        1.868 s
Ran all test suites.
> nx run superdb-node-client:test
 PASS  src/field.test.ts
context canceled
 PASS  src/lake.test.ts
 PASS  src/binpath.test.ts
 PASS  src/zjson.test.ts
 PASS  src/zq.test.ts
  ● Console
    console.log
      { query: '{num: this}', f: 'sup' }

      at log (src/zq.ts:138:11)
Test Suites: 5 passed, 5 total
Tests:       2[4](https://github.com/brimdata/zui/actions/runs/14367324067/job/40286471458#step:8:4) passed, 24 total
Snapshots:   3 passed, 3 total
Time:        2.[6](https://github.com/brimdata/zui/actions/runs/14367324067/job/40286471458#step:8:7)16 s
Ran all test suites.

This got me to wondering if there was something specifically cursed about these ubuntu-22.04 hosted runners, though this whole time the GitHub Status has not reported any kind of outage or degraded status.

To further debug, I created a debug branch that invokes just the "setup-app" steps on an ubuntu-22.04 runner. After having started it 10 times, it finished ok 7 times, and the 3 remaining have been hung for 1+ hour and are surely on track to time out. I then made another debug branch that ran the same steps but on an ubuntu-24.04 runner, and with that I saw 10 out of 10 successful runs. I then did another batch of runs on ubuntu-22.04 and 6 succeeded with 4 hanging, and then finally one more batch with ubuntu-24.04 runner and once again has 10 out of 10 successful runs.

Conclusion

The way things are right now, given the number of runs, the failure rate specific to the ubuntu-22.04 relative to the consistent success rate on the ubuntu-24.04, combined with the failures we saw on ubuntu-22.04 in another repo, all seems like statistically significant evidence that the problem is specific to that type of runner. What's still unclear is if the problem might just disappear on its own, much like reported earlier in this issue where it seemed to disappear suddenly on the macos-12 runner. Since we're not doing a lot of active development on Zui at the moment it seems like we could let this ride for a couple days and see if it clears up. If it seems to be sticking around consistently and/or gets to be too annoying, it seems like we could just switch these jobs over to running on ubuntu-24.04 runners and hope that it remains stable on those and doesn't resurface down the road on them as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test Creating/improving test automation
Projects
None yet
Development

No branches or pull requests

1 participant