Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ensure containerd-related directories removed on failed bootstrap/join-cluster. #863

Merged
merged 2 commits into from
Dec 16, 2024

Conversation

aznashwan
Copy link
Contributor

@aznashwan aznashwan commented Dec 3, 2024

fix: ensure containerd-related directories removed on failed bootstrap/join-cluster

k8sd automatically sets up some directories with the appropriate
ownership/permissions to be used by containerd in the early stages
of the bootstrap and join-cluster commands.

In the classic (non-strict) version of the k8s-snap, these
containerd directories are system-wide (e.g. /etc/containerd,
/run/containerd, etc).

Should any of the other setup steps fail after the containerd
directories were set up, the directories would still remain on
disk and thus lead to a 'partial installation' of on the host system.

This patch ensures that k8s will automatically remove any
containerd-related directories which were created in the event of
the bootstrap / join-cluster commands failing.

fix: ensure containerd Base Dir lockfile is never accidentally deleted.

The containerd Base Dir is the special path all other containerd-related
paths on the snap are derived from.

Under classic confinement and default settings, this path defaults
to the host's root (/), and thus extreme care must be taken to
not accidentally include it in k8sd's cleanup routine or the
k8s-snap's remove hook.

@aznashwan aznashwan requested a review from a team as a code owner December 3, 2024 19:00
@aznashwan aznashwan force-pushed the containerd-bootstrap-cleanup branch 6 times, most recently from 773eb93 to fa16a12 Compare December 3, 2024 20:58
Copy link
Member

@berkayoz berkayoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, we should probably wait and adjust for #817

src/k8s/pkg/k8sd/setup/containerd.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/setup/containerd.go Outdated Show resolved Hide resolved
tests/integration/tests/test_cleanup.py Show resolved Hide resolved
@aznashwan
Copy link
Contributor Author

@berkayoz note that it just hit me that because of the new pytest.mark.tags(tags.NIGHTLY) test tagging system the integration test doesn't get executed as part of the PR, sorry 🤦

While I believe it's better to leave that test for NIGHTLY, I've manually run the test on a 22.04 VM and I have.

I needed to add this small fix (I had forgotten to actually call socket.listen() in the contextmanager).

Can vouch that the added integration test seems to work when run manually now BUT it does sometimes exhibit the EOF error on my machine (which I am still positive is unrelated and hope Claudiu will push a fix for soon).

Copy link
Contributor

@bschimke95 bschimke95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @aznashwan - a couple of comments.

tests/integration/tests/test_cleanup.py Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/setup/containerd.go Show resolved Hide resolved
src/k8s/pkg/k8sd/setup/containerd.go Outdated Show resolved Hide resolved
src/k8s/pkg/k8sd/setup/containerd.go Outdated Show resolved Hide resolved
@aznashwan aznashwan force-pushed the containerd-bootstrap-cleanup branch 2 times, most recently from 3d69d81 to edb3fa8 Compare December 10, 2024 10:29
@claudiubelu
Copy link
Contributor

The mentioned PR has been merged, so this should be fine now to go in as well, after the conflicts have been resolved.

It seems I have been doing something similar as well. Though, just cleaning up the containerd-related directories is not going to be enough, we should probably also shutdown the Kubernetes services, otherwise, on the next bootstrap / join-cluster, we're going to see that the ports are in use.

@aznashwan aznashwan force-pushed the containerd-bootstrap-cleanup branch from edb3fa8 to 80cb072 Compare December 10, 2024 13:29
@aznashwan aznashwan changed the title fix: ensure containerd-related directories removed on failed bootstrap/join-cluster. [DO NOT MERGE] fix: ensure containerd-related directories removed on failed bootstrap/join-cluster. Dec 10, 2024
@aznashwan aznashwan changed the title [DO NOT MERGE] fix: ensure containerd-related directories removed on failed bootstrap/join-cluster. fix: ensure containerd-related directories removed on failed bootstrap/join-cluster. Dec 10, 2024
@aznashwan aznashwan force-pushed the containerd-bootstrap-cleanup branch from a2145d2 to 57f5b04 Compare December 10, 2024 14:44
@aznashwan
Copy link
Contributor Author

aznashwan commented Dec 10, 2024

@bschimke95 @berkayoz related to this latest defensive commit I had to add to the cleanup

@claudiubelu literally accidentally deleted his root dir using this cleanup code after I re-based my patch over his recently-merged configurable containerd paths one (thank you Claudiu, this was the definition of taking one for the team!)

I think it's really bad that we have our cleanup logic split between the snap's remove hook (on snap remove) and in k8sd (on failed bootstrap/join-cluster), and we'll always be exposing ourselves to us/users messing up the lockfiles.

I would strongly suggest we move ALL cleanup logic to a dedicated routine which we re-use internally and expose via a dedicated k8s cleanup command which should be best-effort; and directly called in the snap's remove hook.

@aznashwan aznashwan force-pushed the containerd-bootstrap-cleanup branch from 57f5b04 to 3a2ca37 Compare December 10, 2024 14:50
Copy link
Member

@berkayoz berkayoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, agreed on turning this into a cleanup command we can call on both the snap hook and in bootstrap/join cleanup. We can pick that up as a separate task and backport it to 1.32.

Copy link
Contributor

@bschimke95 bschimke95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor nit and Berkays comment

src/k8s/pkg/k8sd/app/hooks_remove.go Outdated Show resolved Hide resolved
@bschimke95
Copy link
Contributor

please also add some description to this PR. IMHO the title is not sufficient anymore.

@aznashwan aznashwan force-pushed the containerd-bootstrap-cleanup branch from 3a2ca37 to 17ed765 Compare December 12, 2024 11:51
@aznashwan
Copy link
Contributor Author

please also add some description to this PR. IMHO the title is not sufficient anymore.

Made the commit messages very explicit for posterity; and pasted their descriptions in the PR.

Nashwan Azhari added 2 commits December 12, 2024 17:05
…ap/join-cluster`

`k8sd` automatically sets up some directories with the appropriate
ownership/permissions to be used by containerd in the early stages
of the `bootstrap` and `join-cluster` commands.

In the classic (non-strict) version of the k8s-snap, these
containerd directories are system-wide (e.g. `/etc/containerd`,
`/run/containerd`, etc).

Should any of the other setup steps fail after the containerd
directories were set up, the directories would still remain on
disk and thus lead to a 'partial installation' of on the host system.

This patch ensures that `k8s` will automatically remove any
containerd-related directories which were created in the event of
the  `bootstrap` / `join-cluster` commands failing.

Signed-off-by: Nashwan Azhari <nashwan.azhari@canonical.com>
The containerd Base Dir is the special path all other containerd-related
paths on the snap are derived from.

Under classic confinement and default settings, this path defaults
to the host's root (`/`), and thus extreme care must be taken to
not accidentally include it in k8sd's cleanup routine or the
k8s-snap's remove hook.

Signed-off-by: Nashwan Azhari <nashwan.azhari@canonical.com>
@aznashwan aznashwan force-pushed the containerd-bootstrap-cleanup branch from 17ed765 to 235bc36 Compare December 12, 2024 15:05
@berkayoz berkayoz merged commit eb495a0 into canonical:main Dec 16, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants