-
Notifications
You must be signed in to change notification settings - Fork 85
How calculate‐docker‐image works in PyTorch CI
The GH composite action calculate-docker-image is a fairly complex GHA and it's probably trying to do more things than it should be. The GHA is used throughout PyTorch CI whenever a Docker image is involved, for example PyTorch _linux_build/_linux_test, Docker build, or Nova Linux job.
At a high level, the action does two main things:
- Given a docker image name, it checks for the availability of the image on our AWS ECR
308535385114.dkr.ecr.us-east-1.amazonaws.com
. - When the requested image is not available, the GHA will try to build the Docker image so that the workflow calling it has the image to continue what it's doing.
The actual implementation of the GHA, however, is more complex with a set of parameters interacting with each other in a not-so-straightforward way. So, when it works, it's great. But when there are issues, it's not an easy thing to debug.
Let's start with the list of parameters of the GHA. There are 7 of them at the moment. The trio docker-image-name
, docker-build-dir
, and docker-registry
are used to check for the image on ECR. While the other three amigos always-rebuild
, push
, and force-push
controls the build process. The last parameter working-directory
plays a smaller role and it is only used by Nova workflow to point to the checkout repository.
-
docker-image-name
. This is the name of the Docker image that the GHA is looking for. -
docker-build-dir
. This is the directory where the Docker build script exists. The convention here is to use.ci/docker
directory, for example PyTorch or ExecuTorch. The GHA will look for abuild.sh
script under this directory to trigger Docker build process. -
docker-registry
. This is just default to308535385114.dkr.ecr.us-east-1.amazonaws.com
until maybe we decide to move the ECR to somewhere else like LF AWS account.
These parameters are used to check if the request Docker image exists. Simple, right? Not so fast. Let's take a look at how the check is performed.
---
title: How calculate-docker-image checks for a Docker image
---
flowchart TD
B@{ shape: circle, label: "Start" }
--> check-short-name@{shape: diamond, label: "Is short name?"}
check-short-name
-->|y| short-name@{ shape: lean-r, label: "#36;#123;DOCKER_IMAGE_NAME#125;, i.e. pytorch-linux-focal-linter" }
--> compute-short-form-tag[**Compute the docker image tag** using git rev-parse HEAD:#34;#36;#123;DOCKER_BUILD_DIR#125;#34;. The tag depends on the content of DOCKER_BUILD_DIR. When files in that directory are updated, a new tag is generated signifying that a new Docker image is needed#185;#178;]
--> full-name@{ shape: lean-r, label: "#36;#123;DOCKER_REGISTRY#125;/#36;#123;REPO_NAME#125;/#36;#123;DOCKER_IMAGE_NAME#125;:#36;#123;DOCKER_TAG#125;, i.e. 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-linter:88e25063afb45411a0d16539a1335bd864b6f2be" }
check-short-name
-->|n| full-name
full-name
--> login-ecr[Login to ECR]
--> check-image-exist@{shape: diamond, label: "Does it exist?"}
check-image-exist
-->|y| E@{ shape: circle, label: "Stop" }
check-image-exist
-->|n| should-build-image@{shape: diamond, label: "PUSH is set?"}
should-build-image
-->|y| build-image[[Build the image]]
should-build-image
-->|n| wait((Wait up to 90 minutes#179;#8308;))
wait
--> should-build-image-2@{shape: diamond, label: "Is it there yet?"}
should-build-image-2
-->|y| E
should-build-image-2
-->|n| build-image
build-image
--> E
Footnotes:
- Using a bash script to copy files into
.ci/docker
directory is an anti-pattern because the copied files are not version tracked in git and this leads to stale tags and other nasty businesses. It's a very easy mistake to make because Docker build only accept files in the build context https://docs.docker.com/build/concepts/context, a.k.a.ci/docker
folder. So, people wrongly assume that they could just copy any files they need there before the build starts by tweaking.ci/docker/build.sh
script. - The GHA requires a full checkout because it performs a check against the merge base (for PR) or the parent comment (for trunk commit). Maybe there is away to get rid of this check, but it's not that critical because PyTorch performs a full checkout in CI anyway.
- The 90-minute wait was a recent change from https://github.com/pytorch/pytorch/issues/141885. PyTorch Dockers images have grown to the point that it couldn't be built on a
linux.2xlarge
runner anymore and would fail with either a OOM error or timing out. So, https://github.com/pytorch/test-infra/pull/6013 made it so that when a new Docker image is needed, all the build jobs running onlinux.2xlarge
will wait up to 90 minutes for the dedicated Docker build job running on a largerlinux.12xlarge
to get the image ready. Once it's there, PyTorch build jobs will continue as usual. - Building the new Docker image on a dedicated Docker build job will also address the rate limit issue to docker.io, which will be covered in the next section.
-
always-rebuild
. As its name implies, if this is set, the above check will be skipped and the image will always be built. -
push
. If this is set, the GHA will upload the image to ECR ONLY WHEN it doesn't exist. -
force-push
. If this is set and ifpush
is also set, the action will always upload the image to ECR.
---
title: How calculate-docker-image builds a new Docker image
---
flowchart TD
B@{ shape: circle, label: "Start" }
--> check-image[[Check for the image on ECR ]]
--> check-image-exist@{shape: diamond, label: "Does it exist?"}
check-image-exist
-->|y| always-rebuild@{shape: diamond, label: "Always rebuild?"}
always-rebuild
-->|n| E@{ shape: circle, label: "Stop" }
always-rebuild
-->|y| login-docker[Login to docker.io#185;]
--> build-image[Build docker image by calling build.sh in .ci/docker]
check-image-exist
-->|n| login-docker
build-image
--> should-push-image@{shape: diamond, label: "PUSH is set?"}
should-push-image
-->|n| E
should-push-image
-->|y| check-image-exist-2@{shape: diamond, label: "Check if the image is there on ECR again?"}
-->|n| push-image[Push the image to ECR]
-->E
check-image-exist-2
-->|y| should-force-push-image@{shape: diamond, label: "FORCE_PUSH is set?"}
-->|n| E
should-force-push-image
-->|y| push-image
Straightforward, eh?
Footnotes:
- Logging in to docker.io is needed because the base docker image is usually there. If the base image comes from elsewhere, for example quay.io, we might need to login there too but it's not implemented at the moment. Logging in to docker.io is done at both the runner level using the post installation script and at the workflow level as a step in the GHA. The credential is stored on AWS secrets manager that is accessible only by the runner. It's a read-only credential.
- As part of _linux_build and _linux_test workflows. The requested Docker image is guaranteed to be there before these workflows pull them locally. The
push
parameter is not set here, so they won't pushed anything to ECR. - As part of Docker build workflows. This is the dedicated workflows to build all Docker images used by PyTorch CI. It sets
always-rebuild
andpush
parameters, so we know for sure that the Docker image will be built and pushed to ECR if it's not there yet. Thealways-rebuild
parameter is there to ensure that the image is rebuilt periodically in trunk and any failures there will surface early. - As part of CD workflow to build manywheel images. The same principal applies.
- Push a non-fork PR to test-infra with the change, i.e. https://github.com/pytorch/test-infra/pull/6013
- Create a test PR on PyTorch using the branch from step 1 to trigger the workflows there, i.e. https://github.com/pytorch/pytorch/pull/142177/commits/cc45015329fd579a9bbc4e75a5676fb66f17d604