Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.18](backport #3090) cnvm: Delete snapshots after scanning them #3124

Closed
wants to merge 1 commit into from

Conversation

mergify[bot]
Copy link

@mergify mergify bot commented Mar 20, 2025

Summary of your changes

Fixes various underlying issues with CNVM snapshot deletion. The logic here is to do a best-effort attempt to clean up snapshots created during the run both continuously (after we are done scanning the snapshot) and on shutdown. Cleaning old snapshots that we don't use anymore is part of #3105. Issues fixed:

  • internal/flavors/vulnerability.go: Wait for Run() to finish, this ensures that final snapshot clean-up is done after execution finishes
  • internal/resources/providers/awslib/ec2/provider.go: Give extra retries to snapshot deletion, mainly avoiding "too many requests" errors
  • internal/vulnerability/snapshot.go: New snapshot manager to handle creation, deletion and clean-up of snapshots. The deletion extends the context.Context with an extra 30s timeout to give a grace period to clean-up snapshots during shutdown/restart.
  • internal/vulnerability/replicator.go: Add dependency to the snapshot manager instead of provider to track created snapshots
  • internal/vulnerability/scanner.go: Delete snapshot after scanning
  • internal/vulnerability/worker.go: defer a call snapshot manager's cleanup

Screenshot/Data

  1. The way I verified we avoid leftover snapshots is to change the name of the snapshots:
diff --git a/internal/resources/providers/awslib/ec2/provider.go b/internal/resources/providers/awslib/ec2/provider.go
index 14abc5bf..3faeef7d 100644
--- a/internal/resources/providers/awslib/ec2/provider.go
+++ b/internal/resources/providers/awslib/ec2/provider.go
@@ -78,7 +78,7 @@ func (p *Provider) CreateSnapshots(ctx context.Context, ins *Ec2Instance) ([]EBS
  		  {
  			  ResourceType: "snapshot",
  			  Tags: []types.Tag{
-					{Key: aws.String("Name"), Value: aws.String(fmt.Sprintf("elastic-vulnerability-%s", *ins.InstanceId))},
+					{Key: aws.String("Name"), Value: aws.String(fmt.Sprintf("orestis-elastic-vulnerability-%s", *ins.InstanceId))},
  				  {Key: aws.String("Workload"), Value: aws.String("Cloudbeat Vulnerability Snapshot")},
  			  },
  		  },
  1. Deploy cloudbeat based on this PR + the diff above.
  2. Do various tests like:
    • kill cloudbeat while its running
    • restart the agent
    • run continuously over the weekend (note: CNVM takes quite a while to finish a single cycle)
  3. Verify no leftovers are with command:
aws ec2 describe-snapshots --owner-ids self --query 'Snapshots[?StartTime>=`2025-03-10` && Tags[?Key==`Name` && starts_with(Value, `orestis-elastic-vulnerability-`)]] | [].SnapshotId' --output table

Related Issues

See Development to the right

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • I have added the necessary README/documentation (if appropriate)

This is an automatic backport of pull request #3090 done by [Mergify](https://mergify.com).

### Summary of your changes
Fixes various underlying issues with CNVM snapshot deletion. The logic here is to do a best-effort attempt to clean up snapshots created during the run both continuously (after we are done scanning the snapshot) and on shutdown. Cleaning old snapshots that we don't use anymore is part of #3105. Issues fixed:
- `internal/flavors/vulnerability.go`: Wait for `Run()` to finish, this ensures that final snapshot clean-up is done after execution finishes
- `internal/resources/providers/awslib/ec2/provider.go`: Give extra retries to snapshot deletion, mainly avoiding "too many requests" errors
- `internal/vulnerability/snapshot.go`: New snapshot manager to handle creation, deletion and clean-up of snapshots. The deletion extends the `context.Context` with an extra 30s timeout to give a grace period to clean-up snapshots during shutdown/restart.
- `internal/vulnerability/replicator.go`: Add dependency to the snapshot manager instead of `provider` to track created snapshots
- `internal/vulnerability/scanner.go`: Delete snapshot after scanning
- `internal/vulnerability/worker.go`: `defer` a call snapshot manager's cleanup

### Screenshot/Data
1. The way I verified we avoid leftover snapshots is to change the name of the snapshots:
  ```diff
  diff --git a/internal/resources/providers/awslib/ec2/provider.go b/internal/resources/providers/awslib/ec2/provider.go
  index 14abc5bf..3faeef7d 100644
  --- a/internal/resources/providers/awslib/ec2/provider.go
  +++ b/internal/resources/providers/awslib/ec2/provider.go
  @@ -78,7 +78,7 @@ func (p *Provider) CreateSnapshots(ctx context.Context, ins *Ec2Instance) ([]EBS
 			  {
 				  ResourceType: "snapshot",
 				  Tags: []types.Tag{
  -					{Key: aws.String("Name"), Value: aws.String(fmt.Sprintf("elastic-vulnerability-%s", *ins.InstanceId))},
  +					{Key: aws.String("Name"), Value: aws.String(fmt.Sprintf("orestis-elastic-vulnerability-%s", *ins.InstanceId))},
 					  {Key: aws.String("Workload"), Value: aws.String("Cloudbeat Vulnerability Snapshot")},
 				  },
 			  },
  ```
2. Deploy cloudbeat based on this PR + the diff above.
3. Do various tests like:
    - kill cloudbeat while its running
    - restart the agent
    - run continuously over the weekend (note: CNVM takes quite a while to finish a single cycle)
4. Verify no leftovers are with command:
  ```shell
  aws ec2 describe-snapshots --owner-ids self --query 'Snapshots[?StartTime>=`2025-03-10` && Tags[?Key==`Name` && starts_with(Value, `orestis-elastic-vulnerability-`)]] | [].SnapshotId' --output table
  ```

### Related Issues
See `Development` to the right

### Checklist
- [x] I have added tests that prove my fix is effective or that my feature works
- [x] I have added the necessary README/documentation (if appropriate)

(cherry picked from commit 68ff40d)

# Conflicts:
#	internal/resources/providers/awslib/config.go
#	internal/vulnerability/cleaner_test.go
@mergify mergify bot requested a review from a team as a code owner March 20, 2025 15:55
Copy link
Author

mergify bot commented Mar 20, 2025

Cherry-pick of 68ff40d has failed:

On branch mergify/bp/8.18/pr-3090
Your branch is up to date with 'origin/8.18'.

You are currently cherry-picking commit 68ff40dd.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   internal/flavors/vulnerability.go
	modified:   internal/resources/providers/awslib/ec2/provider.go
	deleted:    internal/vulnerability/cleaner.go
	modified:   internal/vulnerability/events_creator.go
	deleted:    internal/vulnerability/mock_replicator_provider.go
	deleted:    internal/vulnerability/mock_snapshot_cleaner.go
	new file:   internal/vulnerability/mock_snapshot_creator_deleter.go
	renamed:    internal/vulnerability/mock_snapshot_provider.go -> internal/vulnerability/mock_snapshot_describer.go
	deleted:    internal/vulnerability/mock_worker_provider.go
	modified:   internal/vulnerability/replicator.go
	modified:   internal/vulnerability/replicator_test.go
	modified:   internal/vulnerability/runner.go
	modified:   internal/vulnerability/runner_test.go
	modified:   internal/vulnerability/scanner.go
	modified:   internal/vulnerability/scanner_test.go
	new file:   internal/vulnerability/snapshot.go
	new file:   internal/vulnerability/snapshot_test.go
	modified:   internal/vulnerability/verifier.go
	modified:   internal/vulnerability/verifier_test.go
	modified:   internal/vulnerability/worker.go
	modified:   internal/vulnerability/worker_test.go

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)
	both modified:   internal/resources/providers/awslib/config.go
	deleted by them: internal/vulnerability/cleaner_test.go

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

Copy link
Author

mergify bot commented Mar 24, 2025

This pull request has not been merged yet. Could you please review and merge it @orestisfl? 🙏

@orestisfl
Copy link
Contributor

We've decided to only backport this to 8.x

@orestisfl orestisfl closed this Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant