Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-11946. Require all ozone repair commands to support a --dry-run option #7682

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

adoroszlai
Copy link
Contributor

What changes were proposed in this pull request?

  1. Add --dry-run option to all ozone repair subcommands that modify state. The new option currently defaults to false, so commands perform repair by default. (Otherwise we'd need to use it as --dry-run=false.)
  2. Implement dry-run mode where needed.

https://issues.apache.org/jira/browse/HDDS-11946

How was this patch tested?

Enforce --dry-run option

Added unit test to verify that each subcommand under ozone repair satisfies one of the following:

  • has subcommands
  • has --dry-run option
  • implements ReadOnlySubcommand (currently only QuotaStatus)

Test failure before implementing ReadOnlyCommand in QuotaStatus:

[ERROR] Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.261 s <<< FAILURE! - in org.apache.hadoop.ozone.repair.TestOzoneRepair
[ERROR] org.apache.hadoop.ozone.repair.TestOzoneRepair.subcommandsSupportDryRun  Time elapsed: 0.211 s  <<< FAILURE!
java.lang.AssertionError: 
['ozone repair quota status' defined by class org.apache.hadoop.ozone.repair.quota.QuotaStatus should support --dry-run or implement interface org.apache.hadoop.ozone.repair.ReadOnlyCommand] 
Expecting UnmodifiableSet:
  ["--service-id",
    "--om-service-id",
    "--service-host",
    "-h",
    "--help",
    "-V",
    "--version"]
to contain:
  ["--dry-run"]
but could not find the following element(s):
  ["--dry-run"]

	at org.apache.hadoop.ozone.repair.TestOzoneRepair.assertSubcommandOptionRecursively(TestOzoneRepair.java:84)
	at org.apache.hadoop.ozone.repair.TestOzoneRepair.assertSubcommandOptionRecursively(TestOzoneRepair.java:89)
	at org.apache.hadoop.ozone.repair.TestOzoneRepair.assertSubcommandOptionRecursively(TestOzoneRepair.java:89)
	at org.apache.hadoop.ozone.repair.TestOzoneRepair.subcommandsSupportDryRun(TestOzoneRepair.java:71)

Tool operation

Sample output of ozone repair om fso-tree from integration tests:

[dry run] No running OM service detected. Proceeding with repair.
[dry run] Creating database of reachable directories at hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-430e6515-4d93-4d63-99d9-2d0e1801aa90/ozone-meta/reachable.db
[dry run] Processing volume: /vol2
[dry run] Processing bucket: vol2/bucket1
[dry run] Deleting unreferenced directory /-9223372036854768896/-9223372036854768384/-9223372036854767871/dir2
[dry run] Deleting unreferenced file /-9223372036854768896/-9223372036854768384/-9223372036854767359/file3
[dry run] Deleting unreferenced file /-9223372036854768896/-9223372036854768384/-9223372036854767871/file1
[dry run] Deleting unreferenced file /-9223372036854768896/-9223372036854768384/-9223372036854767871/file2
[dry run] Processing bucket: vol2/bucket2
2025-01-10 16:46:50,087 [main] INFO  rocksdiff.RocksDBCheckpointDiffer (RocksDBCheckpointDiffer.java:close(307)) - Shutting down CompactionDagPruningService.
[dry run] 
Reachable:
	Directories: 4
	Files: 5
	Bytes: 50
Unreachable:
	Directories: 0
	Files: 0
	Bytes: 0
Unreferenced:
	Directories: 1
	Files: 3
	Bytes: 30

Manual test of ozone repair om update-transaction in compose environment:

$ ozone repair om update-transaction --db=/data/metadata/om.db --term 2 --index 3 --dry-run
[dry run] No running OM service detected. Proceeding with repair.
[dry run] The original highest transaction Info was (t:1, i:14)
[dry run] Updating transaction info to (t:2, i:3)

$ ozone repair om update-transaction --db=/data/metadata/om.db --term 2 --index 3 
ATTENTION: Running as user hadoop. Make sure this is the same user used to run the Ozone process. Are you sure you want to continue (y/N)? y
Run as user: hadoop
No running OM service detected. Proceeding with repair.
The original highest transaction Info was (t:1, i:14)
Updating transaction info to (t:2, i:3)
The highest transaction info has been updated to: (t:2, i:3)

$ ozone repair om update-transaction --db=/data/metadata/om.db --term 2 --index 3 --dry-run
[dry run] No running OM service detected. Proceeding with repair.
[dry run] The original highest transaction Info was (t:2, i:3)
[dry run] Updating transaction info to (t:2, i:3)

CI:
https://github.com/adoroszlai/ozone/actions/runs/12714010228

@adoroszlai adoroszlai added the tools Tools that helps with debugging label Jan 10, 2025
@adoroszlai adoroszlai self-assigned this Jan 10, 2025
@adoroszlai adoroszlai requested a review from errose28 January 10, 2025 18:39
@adoroszlai
Copy link
Contributor Author

@sarvekshayr @Tejaskriya please review

@sarvekshayr
Copy link
Contributor

In my opinion, it would be better to make the --dry-run option the default behaviour. This allows users to see what changes would be performed before executing the actual repair, reducing the risk of accidental modifications to the database. While users can explicitly specify the --dry-run option to preview changes, making it the default ensures it won't be overlooked.

Since the primary focus of this PR is to add the --dry-run option for all repair commands (and any commands to be added in the future), we can consider making --dry-run the default behaviour in a follow-up JIRA.

@adoroszlai
Copy link
Contributor Author

we can consider making --dry-run the default behaviour in a follow-up JIRA.

I think we can decide and implement the default behavior here, that's why I initially created the PR as draft.

I have no objection against making dry run the default, but I find --dry-run=false cumbersome. IMO we need a different name for the option in this case. --repair is already used in fso-tree, so that would be my primary choice.

@sarvekshayr
Copy link
Contributor

I have no objection against making dry run the default, but I find --dry-run=false cumbersome. IMO we need a different name for the option in this case. --repair is already used in fso-tree, so that would be my primary choice.

We can go ahead and use the --repair flag.

@adoroszlai adoroszlai marked this pull request as ready for review January 13, 2025 18:15
@adoroszlai adoroszlai changed the title HDDS-11946. Require all ozone repair commands to support a --dry-run option HDDS-11946. Let ozone repair run dry by default, require --repair option Jan 13, 2025
@adoroszlai
Copy link
Contributor Author

@errose28 could you please review?

@errose28
Copy link
Contributor

errose28 commented Jan 14, 2025

I actually prefer --dry-run as an option and the default to be doing the repair. The user has already entered ozone repair and acknowledged the warning prompt, so after all this not doing the repair seems counter-intuitive. Not to mention the repetitiveness of ozone repair ... --repair just to make the command do what was intended. FSO repair was originally made before there as a dedicated ozone repair command with its own safeguards in place hence dry run was the default in that implementation, but we have a better framework in place now.

@adoroszlai adoroszlai changed the title HDDS-11946. Let ozone repair run dry by default, require --repair option HDDS-11946. Require all ozone repair commands to support a --dry-run option Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tools Tools that helps with debugging
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants