Skip to content

Commit

Permalink
Merge pull request #117 from dandi/gh-33
Browse files Browse the repository at this point in the history
Delete files that don't match inventory items
  • Loading branch information
yarikoptic authored Jan 14, 2025
2 parents ee14ade + 132cc1d commit 9211fd8
Show file tree
Hide file tree
Showing 15 changed files with 2,204 additions and 267 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,11 @@ In Development
- Add `--list-dates` option
- The `<outdir>` command-line argument is now optional and defaults to the
current directory
- The `--inventory-jobs` and `--object-jobs` options have been eliminated in
favor of a new `--jobs` option
- Files & directories in the backup tree that are not listed in the inventory
are deleted
- Increased MSRV to 1.81

v0.1.0-alpha.2 (2025-01-06)
---------------------------
Expand Down
73 changes: 33 additions & 40 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 5 additions & 3 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
name = "s3invsync"
version = "0.1.0-alpha.2"
edition = "2021"
rust-version = "1.80"
rust-version = "1.81"
description = "AWS S3 Inventory-based backup tool with efficient incremental & versionId support"
authors = [
"DANDI Developers <team@dandiarchive.org>",
Expand All @@ -24,22 +24,24 @@ aws-smithy-async = "1.2.3"
aws-smithy-runtime-api = "1.7.3"
clap = { version = "4.5.26", default-features = false, features = ["derive", "error-context", "help", "std", "suggestions", "usage", "wrap_help"] }
csv = "1.3.1"
either = "1.13.0"
flate2 = "1.0.35"
fs-err = { version = "3.0.0", features = ["tokio"] }
futures-util = "0.3.31"
futures-util = { version = "0.3.31", default-features = false, features = ["std"] }
hex = "0.4.3"
lockable = "0.1.1"
md-5 = "0.10.6"
memory-stats = "1.2.0"
percent-encoding = "2.3.1"
pin-project-lite = "0.2.16"
regex = "1.11.1"
serde = { version = "1.0.217", features = ["derive"] }
serde_json = "1.0.135"
strum = { version = "0.26.3", features = ["derive"] }
tempfile = "3.15.0"
thiserror = "2.0.11"
time = { version = "0.3.37", features = ["macros", "parsing"] }
tokio = { version = "1.43.0", features = ["macros", "rt-multi-thread", "signal"] }
tokio = { version = "1.43.0", features = ["macros", "rt-multi-thread", "signal", "sync"] }
tokio-util = { version = "0.7.13", features = ["rt"] }
tracing = "0.1.41"
tracing-subscriber = { version = "0.3.19", features = ["local-time", "time"] }
Expand Down
13 changes: 6 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[![Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostatus.org/#wip)
[![CI Status](https://github.com/dandi/s3invsync/actions/workflows/test.yml/badge.svg)](https://github.com/dandi/s3invsync/actions/workflows/test.yml)
[![codecov.io](https://codecov.io/gh/dandi/s3invsync/branch/main/graph/badge.svg)](https://codecov.io/gh/dandi/s3invsync)
[![Minimum Supported Rust Version](https://img.shields.io/badge/MSRV-1.80-orange)](https://www.rust-lang.org)
[![Minimum Supported Rust Version](https://img.shields.io/badge/MSRV-1.81-orange)](https://www.rust-lang.org)
[![MIT License](https://img.shields.io/github/license/dandi/s3invsync.svg)](https://opensource.org/licenses/MIT)

[GitHub](https://github.com/dandi/s3invsync) | [Issues](https://github.com/dandi/s3invsync/issues) | [Changelog](https://github.com/dandi/s3invsync/blob/main/CHANGELOG.md)
Expand Down Expand Up @@ -92,7 +92,9 @@ When downloading a given key from S3, the latest version (if not deleted) is
stored at `{outdir}/{key}`, and the versionIds and etags of all latest object
versions in a given directory are stored in `.s3invsync.versions.json` in that
directory. Each non-latest, non-deleted version of a given key is stored at
`{outdir}/{key}.old.{versionId}.{etag}`.
`{outdir}/{key}.old.{versionId}.{etag}`. Any other files or directories under
`<outdir>` that do not correspond to an object listed in the inventory are
deleted.

Options
-------
Expand All @@ -110,8 +112,8 @@ Options
inventory for the given date is used) or in the format `YYYY-MM-DDTHH-MMZ`
(to specify a specific inventory).

- `-I <INT>`, `--inventory-jobs <INT>` — Specify the maximum number of inventory
list files to download & process at once [default: 20]
- `-J <INT>`, `--jobs <INT>` — Specify the maximum number of concurrent
download jobs [default: 20]

- `--list-dates` — List available inventory manifest dates instead of
backing anything up
Expand All @@ -120,9 +122,6 @@ Options
Possible values are "`ERROR`", "`WARN`", "`INFO`", "`DEBUG`", and "`TRACE`"
(all case-insensitive). [default value: `DEBUG`]

- `-O <INT>`, `--object-jobs <INT>` — Specify the maximum number of inventory
entries to download & process at once [default: 20]

- `--path-filter <REGEX>` — Only download objects whose keys match the given
[regular expression](https://docs.rs/regex/latest/regex/#syntax)

Expand Down
4 changes: 4 additions & 0 deletions src/consts.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
/// The name of the file in which metadata (version ID and etag) are stored for
/// the latest versions of objects in each directory
pub(crate) static METADATA_FILENAME: &str = ".s3invsync.versions.json";

/// The number of initial bytes of an inventory csv.gz file to fetch when
/// peeking at just the first entry
pub(crate) const CSV_GZIP_PEEK_SIZE: usize = 1024;
25 changes: 25 additions & 0 deletions src/inventory/item.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
use crate::keypath::KeyPath;
use crate::s3::S3Location;
use crate::util::make_old_filename;
use time::OffsetDateTime;

/// An entry in an inventory list file
Expand All @@ -9,6 +10,16 @@ pub(crate) enum InventoryEntry {
Item(InventoryItem),
}

impl InventoryEntry {
/// Returns the entry's key
pub(crate) fn key(&self) -> &str {
match self {
InventoryEntry::Directory(Directory { key, .. }) => key,
InventoryEntry::Item(InventoryItem { key, .. }) => key.as_ref(),
}
}
}

/// An entry in an inventory list file pointing to a directory object
#[derive(Clone, Debug, Eq, PartialEq)]
pub(crate) struct Directory {
Expand Down Expand Up @@ -60,6 +71,20 @@ impl InventoryItem {
S3Location::new(self.bucket.clone(), String::from(&self.key))
.with_version_id(self.version_id.clone())
}

/// Returns whether the object is a delete marker
pub(crate) fn is_deleted(&self) -> bool {
self.details == ItemDetails::Deleted
}

/// If the object is not a delete marker and is not the latest version of
/// the key, return the base filename at which it will be backed up.
pub(crate) fn old_filename(&self) -> Option<String> {
let ItemDetails::Present { ref etag, .. } = self.details else {
return None;
};
(!self.is_latest).then(|| make_old_filename(self.key.name(), &self.version_id, etag))
}
}

/// Metadata about an object's content
Expand Down
Loading

0 comments on commit 9211fd8

Please sign in to comment.