Delete files that don't match inventory items #117

jwodder · 2025-01-07T21:23:06Z

Closes #33.

To do:

Process CSV files in order of first keys
Before downloading a key, add it to a "tree tracker" that detects when we've moved past a directory
When past a directory, spawn a task to delete extraneous files
Test in action
Add doc comments to new code items
Update CHANGELOG

codecov · 2025-01-07T21:23:52Z

Codecov Report

Attention: Patch coverage is 78.47534% with 288 lines in your changes missing coverage. Please review.

Project coverage is 59.95%. Comparing base (ee14ade) to head (132cc1d).
Report is 26 commits behind head on main.

Files with missing lines	Patch %	Lines
src/syncer/mod.rs	0.00%	132 Missing ⚠️
src/syncer/metadata.rs	0.00%	88 Missing ⚠️
src/syncer/treetracker/mod.rs	95.79%	34 Missing ⚠️
src/inventory/item.rs	0.00%	13 Missing ⚠️
src/nursery.rs	93.38%	9 Missing ⚠️
src/s3/mod.rs	0.00%	5 Missing ⚠️
src/util.rs	0.00%	3 Missing ⚠️
src/main.rs	0.00%	2 Missing ⚠️
src/syncer/treetracker/inner.rs	98.61%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main     #117       +/-   ##
===========================================
+ Coverage   31.37%   59.95%   +28.57%     
===========================================
  Files          15       19        +4     
  Lines        1224     2412     +1188     
===========================================
+ Hits          384     1446     +1062     
- Misses        840      966      +126

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jwodder · 2025-01-11T00:32:50Z

@yarikoptic I tested this by first taking a backup of dandisets/ and then running s3invsync again on the same backup directory but targeting the manifest from a week prior. The run exited successfully, and files were indeed deleted, though I didn't check the actual inventories for correctness. You can see a report on the runs here; in particular, note that though the first run took 2 hours, the second only took 42 seconds.

For AWS SDK crates

jwodder · 2025-01-13T14:03:30Z

@yarikoptic Should this be merged now, or are there any other scenarios you want me to test first?

yarikoptic

Let's merge, even if just to facilitate further testing by @aaronkanzer and @kabilar on their use case. I left a few questions to give me a few bits of understanding .

Also please test 1 more "scenario". Following up on

@yarikoptic I tested this by first taking a backup of dandisets/ and then running s3invsync again on the same backup directory but targeting the manifest from a week prior. The run exited successfully, and files were indeed deleted, though I didn't check the actual inventories for correctness

That is cute sneaky. Overall, if we do

run1: DATE1
run2: DATE2 (back in time)
run3: DATE1
run4: DATE2 (again)

unless some "trailing delete" policy deletes anything, I expect results from run3 to bringing back state identical to run1, and then run4 identical to run2 "regardless" (since nothing new should emerge but we should delete the same).

Since trailing delete on manifests is not yet implemented in production, run1 and run3 must also be identical (ref).

You could use cp --reflink=always (on typhon's btrfs) to copy after each run without incurring heavy data transfer/time. And then just straight diff to see if anything different.

yarikoptic · 2025-01-13T15:56:28Z

src/nursery.rs

+    use tokio::{sync::oneshot, time::timeout};
+
+    #[test]
+    fn nursery_is_send() {


how does this test work if both internal functions just announced not used or are they used by those Nursery's in the tokio async tests somehow automagically?

This test asserts that Nursery implements the Send trait (meaning it can be sent between threads) as long as the return type of the tasks does so as well. If it doesn't, the test code won't even compile, because the argument to require_send() won't meet the type contraint.

yarikoptic · 2025-01-13T16:00:45Z

src/syncer/mod.rs

+                    } {
+                        if let Some(entry) = clnt.peek_inventory_csv(&fspec).await? {
+                            if sender.send((fspec, entry)).await.is_err() {
+                                // Assume we're shutting down


codecov reports that this condition is never hit. Does it mean that within tests we just never have "big enough" file that we do partial read or that we always downloading full files during such "peeking"?

Or related -- did you track sample executions to see that we are indeed not downloading full manifests here, thus downloading them fully twice altogether (once in peeking for sorting and once then for full use)?

There are currently no tests that do any I/O, so nothing in src/syncer/mod.rs is tested. I don't know why codecov is highlighting just that line.

Also, for the record, this check has nothing to do with whether the read was partial or not; line 245 is reached if the channel for sending the peek results was closed, but I don't believe that can actually happen for the code here, so maybe codecov is smarter than it looks....

Or related -- did you track sample executions to see that we are indeed not downloading full manifests here, thus downloading them fully twice altogether (once in peeking for sorting and once then for full use)?

I did not track that. I'm not sure how you'd do that, given that the peeking code doesn't even write anything to disk.

I would have assumed it logged on its operation -- then could have been checked in the logs on either read in full or just in part

jwodder · 2025-01-13T17:54:24Z

@yarikoptic OK, I did the following: I still have the backup taken for the first scenario downloaded, and I updated it to the latest (2025-01-12) manifest earlier today to make sure my last-minute tweaks hadn't broken anything. So, I used cp --reflink=always to copy the download directory, then I synced the download to the 2025-01-05 manifest, then synced it back to 2025-01-12; both sync runs took about 42 seconds. Now I'm running diff -Naur on the download directory and the copy, but since the directories are over a terabyte in size, I expect that to take a while. Do you know a faster way to diff here?

jwodder · 2025-01-13T20:40:30Z

@yarikoptic Regarding diffing the download dir and its backup, the paths of all the files & directories are the same between the two, at least. Do you need their contents to be compared as well?

yarikoptic · 2025-01-13T22:26:22Z

I didn't see you not, works ensure that downloads are ok, doesn't diff do that anyways?

jwodder · 2025-01-14T12:57:05Z

@yarikoptic The problem with diff is that it takes multiple hours to compare two TB-sized directories. Fortunately, the diff process finished some time last night, and it found no discrepancies.

yarikoptic · 2025-01-14T14:12:23Z

That's great! So let's consider this one good and proceed! I will do my mighty contribution then by clicking the green button ;)

jwodder self-assigned this Jan 7, 2025

jwodder mentioned this pull request Jan 8, 2025

Run Reports #106

Open

yarikoptic mentioned this pull request Jan 8, 2025

Delete files from backup that don't match any inventory items #33

Closed

jwodder force-pushed the gh-33 branch 7 times, most recently from fd2cf70 to 41ac719 Compare January 10, 2025 17:27

jwodder marked this pull request as ready for review January 11, 2025 00:29

jwodder force-pushed the gh-33 branch from cd6507c to a7701e1 Compare January 11, 2025 15:33

jwodder added 16 commits January 13, 2025 08:17

Process CSVs in order of first keys

b54500f

Move metadata code to submodule of syncer

7694a9d

First draft of TreeTracker

0d572cb

Work on CmpName

7bdb8e6

Add more tests, including a failing one

774d947

Fix failing test

e228aa1

Move TreeTracker utilities to their own file

9f25b4c

Test KeyComponents

a1605be

More TreeTracker tests

c417333

Remove some unreachable code

dd070c8

Update TreeTracker to track filenames of non-latest versions

ebe1b7d

Finish up syncer/mod.rs

9618273

Improve some code

136a725

Roll my own task group

240169e

Don't try to clean up nonexistent dirs

9fd6631

Missed a thing

bee489e

jwodder added 5 commits January 13, 2025 08:19

Add some basic tests for Nursery

1afa117

Update changelog & README

8daea98

Doc comments

c5efd2d

Fix

5637bc5

Typo caught by codespell

a4abc26

jwodder force-pushed the gh-33 branch from a7701e1 to a4abc26 Compare January 13, 2025 13:20

jwodder added 4 commits January 13, 2025 08:26

Increase MSRV to 1.81

cd658fe

For AWS SDK crates

Improve nursery.rs

1c08933

Break up Syncer::run()

b3b1deb

More tests for KeyPath

132cc1d

yarikoptic approved these changes Jan 13, 2025

View reviewed changes

yarikoptic merged commit 9211fd8 into main Jan 14, 2025
13 checks passed

yarikoptic deleted the gh-33 branch January 14, 2025 14:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete files that don't match inventory items #117

Delete files that don't match inventory items #117

jwodder commented Jan 7, 2025 •

edited

Loading

codecov bot commented Jan 7, 2025 •

edited

Loading

jwodder commented Jan 11, 2025

jwodder commented Jan 13, 2025

yarikoptic left a comment •

edited

Loading

yarikoptic Jan 13, 2025

jwodder Jan 13, 2025

yarikoptic Jan 13, 2025

jwodder Jan 13, 2025

yarikoptic Jan 14, 2025

jwodder commented Jan 13, 2025

jwodder commented Jan 13, 2025

yarikoptic commented Jan 13, 2025

jwodder commented Jan 14, 2025

yarikoptic commented Jan 14, 2025

Delete files that don't match inventory items #117

Delete files that don't match inventory items #117

Conversation

jwodder commented Jan 7, 2025 • edited Loading

codecov bot commented Jan 7, 2025 • edited Loading

Codecov Report

jwodder commented Jan 11, 2025

jwodder commented Jan 13, 2025

yarikoptic left a comment • edited Loading

Choose a reason for hiding this comment

yarikoptic Jan 13, 2025

Choose a reason for hiding this comment

jwodder Jan 13, 2025

Choose a reason for hiding this comment

yarikoptic Jan 13, 2025

Choose a reason for hiding this comment

jwodder Jan 13, 2025

Choose a reason for hiding this comment

yarikoptic Jan 14, 2025

Choose a reason for hiding this comment

jwodder commented Jan 13, 2025

jwodder commented Jan 13, 2025

yarikoptic commented Jan 13, 2025

jwodder commented Jan 14, 2025

yarikoptic commented Jan 14, 2025

jwodder commented Jan 7, 2025 •

edited

Loading

codecov bot commented Jan 7, 2025 •

edited

Loading

yarikoptic left a comment •

edited

Loading