Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage collect manifest files #192

Closed
mvandenburgh opened this issue Oct 24, 2024 · 9 comments · Fixed by #199 or #205
Closed

Garbage collect manifest files #192

mvandenburgh opened this issue Oct 24, 2024 · 9 comments · Fixed by #199 or #205
Assignees

Comments

@mvandenburgh
Copy link
Member

Following up on #190 (comment).

Currently, the trailing delete mechanism for blobs is intended to work like this:

  1. User "deletes" an asset (via replacing an existing one, or an explicit delete request), which just causes it to be delinked from the dandiset
  2. Garbage collection picks up this "orphaned" asset and sends a delete request to s3, resulting in a delete marker being placed on the object.
  3. In 30 days, the trailing delete lifecycle rule will clean up the deleted object permanently.

In the case of manifest files, we can exclusively use a lifecycle policy to accomplish garbage collection, and avoid adding anything to the dandi-archive application logic. See https://docs.aws.amazon.com/AmazonS3/latest/API/API_NoncurrentVersionExpiration.html.

@mvandenburgh
Copy link
Member Author

mvandenburgh commented Oct 24, 2024

@yarikoptic what should the garbage collection policy be for manifest files? We can make it so old versions (i.e. versions that have a delete marker on them) are deleted after they reach a certain age, or we can make it so that only the N most recent versions are preserved while any older ones are deleted (see the two policy options here https://docs.aws.amazon.com/AmazonS3/latest/API/API_NoncurrentVersionExpiration.html).

@yarikoptic
Copy link
Member

I do not see any need for some policy difference from blobs lifecycle, so why not the same

In 30 days, the trailing delete lifecycle rule will clean up the deleted object permanently.

?

@yarikoptic
Copy link
Member

Sorry, I think I misunderstood as mixed up cleanup of objects with DeleteMarker from "prior versions of an object" but not necessarily with a DeleteMarker. Frankly, I am not yet fully grasping the life cycle documentations there on either there is a way to Expire only older (not the current "visible") versions so they become Noncurrent and then subject to those two possible choices of NewerNoncurrentVersions and NoncurrentDays. But it feels that may be NewerNoncurrentVersions of 1 would be sufficient/accomplish what we need? (pickup/remove older versions of those keys)

@mvandenburgh mvandenburgh self-assigned this Oct 29, 2024
@mvandenburgh
Copy link
Member Author

Sorry, I think I misunderstood as mixed up cleanup of objects with DeleteMarker from "prior versions of an object" but not necessarily with a DeleteMarker. Frankly, I am not yet fully grasping the life cycle documentations there on either there is a way to Expire only older (not the current "visible") versions so they become Noncurrent and then subject to those two possible choices of NewerNoncurrentVersions and NoncurrentDays. But it feels that may be NewerNoncurrentVersions of 1 would be sufficient/accomplish what we need? (pickup/remove older versions of those keys)

@waxlamp and I looked into this yesterday, and there are actually two things going on here. In normal operations, manifest files are never deleted - they are simply replaced by newer versions when metadata changes. So the expectation would be that each manifest file has just a single object, with potentially multiple older "versions". In other words, something like this (dandiset 217):

Screenshot from 2024-10-29 10-36-42

But, like you pointed out in #190 (comment), there are some manifest files that also have delete markers in their history, like dandiset 4

Screenshot from 2024-10-29 10-38-54

This was surprising, since we don't currently do any deletion of manifest files in the dandi-archive application. However, we observed that the manifest files that contained delete markers seemed to be from older dandisets (created in 2021), and the delete markers were all from around September 2021. We looked into the git history, and there was a PR around that time that introduced a script to delete things from the bucket (dandi/dandi-archive#528). Furthermore, it looks like that script has a bug that causes delete markers to get left behind. It seems likely that this is the cause of these strange delete markers. So, cleaning those up can just be a manual step that we take asynchronously. I'll file another issue to discuss that in.

For this particular issue, we are just concerned with garbage collecting older versions of manifest files.

But it feels that may be NewerNoncurrentVersions of 1 would be sufficient/accomplish what we need? (pickup/remove older versions of those keys)

This would result in the number of "noncurrent versions" being capped at one. In other words, at any point in time, each manifest file would have two versions - the "current" one that is visible in the bucket, and the latest "noncurrent" one. When a new manifest file is generated, the "current" one would become the "noncurrent" one, and the "noncurrent" one would be deleted by the lifecycle rule. If we wanted to keep things even cleaner, we could also enable the NoncurrentDays policy to delete noncurrent versions after one day.

@yarikoptic at a high level, we have three options:

  1. Cap the number of noncurrent versions to 1
  2. Auto delete noncurrent versions that are older than 1 day
  3. Do both 1) and 2)

@waxlamp
Copy link
Member

waxlamp commented Oct 29, 2024

Do both 1) and 2)

This would be my recommendation.

@yarikoptic
Copy link
Member

Thank you for the analysis @waxlamp and @mvandenburgh .

Greedy me would prefer 1) just in case ;) size effect is minimal and having at least 1 version back might come handy. 1 day is too arbitrary -- we might want to troubleshoot past that duration

@waxlamp
Copy link
Member

waxlamp commented Nov 12, 2024

Greedy me would prefer 1) just in case ;) size effect is minimal and having at least 1 version back might come handy. 1 day is too arbitrary -- we might want to troubleshoot past that duration

Ok, then it seems like we can just keep N non-current versions, and not worry about deleting things that are older than some arbitrary age.

The remaining question: what value of N do you like, @yarikoptic?

@yarikoptic
Copy link
Member

I am ok with any non-0. Let's say 1? ;-) hence we keep "current" and immediate "undo"

mvandenburgh added a commit that referenced this issue Nov 21, 2024
See #192 for
rationale behind one day rule.
mvandenburgh added a commit that referenced this issue Dec 2, 2024
See #192 for
rationale behind one day rule.
@mvandenburgh
Copy link
Member Author

Reopening, #205 is required for this to be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants