Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable JetStream on disk errors #6292

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

souravagrawal
Copy link

@souravagrawal souravagrawal commented Dec 22, 2024

Currently, there are scenarios where NATS JetStream may encounter permission errors when file system goes into read only mode, which can lead to an inconsistent state. In such cases, the system continues to allow publishing messages by resetting stream state, leading to a misaligned consumer stream sequence.

This PR introduces changes to gracefully handle these permission errors and prevent NATS from continuing in an inconsistent state when:

  • Flushing stream state to disk
  • Deleting expired messages on startup
  • Creating new block for messages
    After this PR, If NATS is running in non-clustered mode, the user will be unable to issue write requests until the issue is resolved. In clustered mode, only the affected node will stop accepting requests, while the system will continue to function as long as the required quorum remains healthy.

PR potentially fixes : #6211 which leads to consumer sequence reaching higher than stream sequence.
Signed-off-by: Sourabh Agrawal souravagrawal1111@gmail.com

Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably have better detection of a wider array of filesystem errors like permissions etc, agreed, but I'm not entirely sure why this would be a configurable option and I don't particularly like it either.

The disk being pulled out from underneath JetStream is effectively a catastrophic operational issue and for JS to continue to try to operate in those circumstances feels problematic, i.e. with the potential for data loss that no one may even notice. Even more so in a clustered scenario where the metalayer and Raft logs are being written out to disk.

@souravagrawal
Copy link
Author

We should probably have better detection of a wider array of filesystem errors like permissions etc, agreed, but I'm not entirely sure why this would be a configurable option and I don't particularly like it either.

The disk being pulled out from underneath JetStream is effectively a catastrophic operational issue and for JS to continue to try to operate in those circumstances feels problematic, i.e. with the potential for data loss that no one may even notice. Even more so in a clustered scenario where the metalayer and Raft logs are being written out to disk.

Thank you for taking the time to review this.
I will remove the flag. As this is my first patch in the server, I wanted to err on the side of caution.
Are there any other cases you think I should address as part of this?

Signed-off-by: Sourabh Agrawal <souravagrawal1111@gmail.com>
Signed-off-by: Sourabh Agrawal <souravagrawal1111@gmail.com>
@souravagrawal
Copy link
Author

Reopening PR, as it was closed mistakenly

@souravagrawal souravagrawal reopened this Dec 25, 2024
@souravagrawal
Copy link
Author

We should probably have better detection of a wider array of filesystem errors like permissions etc, agreed, but I'm not entirely sure why this would be a configurable option and I don't particularly like it either.
The disk being pulled out from underneath JetStream is effectively a catastrophic operational issue and for JS to continue to try to operate in those circumstances feels problematic, i.e. with the potential for data loss that no one may even notice. Even more so in a clustered scenario where the metalayer and Raft logs are being written out to disk.

Thank you for taking the time to review this. I will remove the flag. As this is my first patch in the server, I wanted to err on the side of caution. Are there any other cases you think I should address as part of this?

Thank you for the suggestion @neilalexander. I have implemented the recommended change and removed the flag.

@souravagrawal
Copy link
Author

Hello @neilalexander, Happy New Year.
Just following up to check if you’ve had a chance to review the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stream state resets if the Jetstream store directory goes into read only mode [v2.10.23]
2 participants