Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete files that don't match inventory items #117

Merged
merged 25 commits into from
Jan 14, 2025
Merged

Delete files that don't match inventory items #117

merged 25 commits into from
Jan 14, 2025

Conversation

jwodder
Copy link
Member

@jwodder jwodder commented Jan 7, 2025

Closes #33.

To do:

  • Process CSV files in order of first keys
  • Before downloading a key, add it to a "tree tracker" that detects when we've moved past a directory
  • When past a directory, spawn a task to delete extraneous files
  • Test in action
  • Add doc comments to new code items
  • Update CHANGELOG

@jwodder jwodder self-assigned this Jan 7, 2025
Copy link

codecov bot commented Jan 7, 2025

Codecov Report

Attention: Patch coverage is 78.47534% with 288 lines in your changes missing coverage. Please review.

Project coverage is 59.95%. Comparing base (ee14ade) to head (132cc1d).
Report is 26 commits behind head on main.

Files with missing lines Patch % Lines
src/syncer/mod.rs 0.00% 132 Missing ⚠️
src/syncer/metadata.rs 0.00% 88 Missing ⚠️
src/syncer/treetracker/mod.rs 95.79% 34 Missing ⚠️
src/inventory/item.rs 0.00% 13 Missing ⚠️
src/nursery.rs 93.38% 9 Missing ⚠️
src/s3/mod.rs 0.00% 5 Missing ⚠️
src/util.rs 0.00% 3 Missing ⚠️
src/main.rs 0.00% 2 Missing ⚠️
src/syncer/treetracker/inner.rs 98.61% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #117       +/-   ##
===========================================
+ Coverage   31.37%   59.95%   +28.57%     
===========================================
  Files          15       19        +4     
  Lines        1224     2412     +1188     
===========================================
+ Hits          384     1446     +1062     
- Misses        840      966      +126     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jwodder jwodder mentioned this pull request Jan 8, 2025
@jwodder jwodder force-pushed the gh-33 branch 7 times, most recently from fd2cf70 to 41ac719 Compare January 10, 2025 17:27
@jwodder jwodder marked this pull request as ready for review January 11, 2025 00:29
@jwodder
Copy link
Member Author

jwodder commented Jan 11, 2025

@yarikoptic I tested this by first taking a backup of dandisets/ and then running s3invsync again on the same backup directory but targeting the manifest from a week prior. The run exited successfully, and files were indeed deleted, though I didn't check the actual inventories for correctness. You can see a report on the runs here; in particular, note that though the first run took 2 hours, the second only took 42 seconds.

@jwodder
Copy link
Member Author

jwodder commented Jan 13, 2025

@yarikoptic Should this be merged now, or are there any other scenarios you want me to test first?

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge, even if just to facilitate further testing by @aaronkanzer and @kabilar on their use case. I left a few questions to give me a few bits of understanding .

Also please test 1 more "scenario". Following up on

@yarikoptic I tested this by first taking a backup of dandisets/ and then running s3invsync again on the same backup directory but targeting the manifest from a week prior. The run exited successfully, and files were indeed deleted, though I didn't check the actual inventories for correctness

That is cute sneaky. Overall, if we do

  • run1: DATE1
  • run2: DATE2 (back in time)
  • run3: DATE1
  • run4: DATE2 (again)

unless some "trailing delete" policy deletes anything, I expect results from run3 to bringing back state identical to run1, and then run4 identical to run2 "regardless" (since nothing new should emerge but we should delete the same).

Since trailing delete on manifests is not yet implemented in production, run1 and run3 must also be identical (ref).

You could use cp --reflink=always (on typhon's btrfs) to copy after each run without incurring heavy data transfer/time. And then just straight diff to see if anything different.

use tokio::{sync::oneshot, time::timeout};

#[test]
fn nursery_is_send() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this test work if both internal functions just announced not used or are they used by those Nursery's in the tokio async tests somehow automagically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test asserts that Nursery implements the Send trait (meaning it can be sent between threads) as long as the return type of the tasks does so as well. If it doesn't, the test code won't even compile, because the argument to require_send() won't meet the type contraint.

} {
if let Some(entry) = clnt.peek_inventory_csv(&fspec).await? {
if sender.send((fspec, entry)).await.is_err() {
// Assume we're shutting down
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codecov reports that this condition is never hit. Does it mean that within tests we just never have "big enough" file that we do partial read or that we always downloading full files during such "peeking"?

Or related -- did you track sample executions to see that we are indeed not downloading full manifests here, thus downloading them fully twice altogether (once in peeking for sorting and once then for full use)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are currently no tests that do any I/O, so nothing in src/syncer/mod.rs is tested. I don't know why codecov is highlighting just that line.

Also, for the record, this check has nothing to do with whether the read was partial or not; line 245 is reached if the channel for sending the peek results was closed, but I don't believe that can actually happen for the code here, so maybe codecov is smarter than it looks....

Or related -- did you track sample executions to see that we are indeed not downloading full manifests here, thus downloading them fully twice altogether (once in peeking for sorting and once then for full use)?

I did not track that. I'm not sure how you'd do that, given that the peeking code doesn't even write anything to disk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have assumed it logged on its operation -- then could have been checked in the logs on either read in full or just in part

@jwodder
Copy link
Member Author

jwodder commented Jan 13, 2025

@yarikoptic OK, I did the following: I still have the backup taken for the first scenario downloaded, and I updated it to the latest (2025-01-12) manifest earlier today to make sure my last-minute tweaks hadn't broken anything. So, I used cp --reflink=always to copy the download directory, then I synced the download to the 2025-01-05 manifest, then synced it back to 2025-01-12; both sync runs took about 42 seconds. Now I'm running diff -Naur on the download directory and the copy, but since the directories are over a terabyte in size, I expect that to take a while. Do you know a faster way to diff here?

@jwodder
Copy link
Member Author

jwodder commented Jan 13, 2025

@yarikoptic Regarding diffing the download dir and its backup, the paths of all the files & directories are the same between the two, at least. Do you need their contents to be compared as well?

@yarikoptic
Copy link
Member

I didn't see you not, works ensure that downloads are ok, doesn't diff do that anyways?

@jwodder
Copy link
Member Author

jwodder commented Jan 14, 2025

@yarikoptic The problem with diff is that it takes multiple hours to compare two TB-sized directories. Fortunately, the diff process finished some time last night, and it found no discrepancies.

@yarikoptic
Copy link
Member

That's great! So let's consider this one good and proceed! I will do my mighty contribution then by clicking the green button ;)

@yarikoptic yarikoptic merged commit 9211fd8 into main Jan 14, 2025
13 checks passed
@yarikoptic yarikoptic deleted the gh-33 branch January 14, 2025 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Delete files from backup that don't match any inventory items
2 participants