-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable cluster pause functionality. #69
base: main
Are you sure you want to change the base?
Conversation
Incorporate a pause element into the CRD definition. Configure the StatefulSet to scale down to 0 when paused. A more nuanced status element should be introduced, beyond a simple boolean for Ready. A possible enumeration could include **Ready**, **Paused**, **Pausing**, **Initializing**, **Reconciling**, and **Needs Attention**. There probably ought to be a status element that is more nuanced th
Hi @mleklund, can you give me some more context about the idea driving this feature? I have the feeling it strays away from the core objective of the operator, that is purely managing OLM aspects of Typesense. If you want to pause or stop workloads you can accomplish it in various ways like:
|
To restore a backup, you need to pause a cluster. This is a precursor to some backup and restore code I’m working on. While overriding the cluster size work, you need to keep track of the cluster size. We’re going to create multiple merge requests for each of the needed pieces. That way, it’ll be easier for everyone to review and comment on the changes. |
I will start checking the PR today. In the meantime please make sure to re-sync your branch from main (context: #70) |
r.logger.V(debugLevel).Info("replicas is nil") | ||
if *sts.Spec.Replicas == 0 { | ||
replicas = ptr.To[int32](1) | ||
r.logger.V(debugLevel).Info("StatefulSet has been paused, update replicas to 1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you updating replicas to 1 and not to sts.Spec.Replicas after pause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typesense does not handle an empty nodes file gracefully and will crash, which leads to crashloopbackoff when it is time to scale backup. While harmless overall, this looks bad and slows scale up. Just always having 1 node was my work around, as I feel it is harmless to assume a cluster size of 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but isn't this contradicting how you handle it at:
and at:
https://github.com/dealnews/typesense-operator/blob/7c0bd0c084baec94be10bd06fe5c189f9525ab39/internal/controller/typesensecluster_controller.go#L196 (I guess the else clause will never be met)
Maybe I am missing something here and I failed to comprehend the logic.
Additionally:
Typesense does not handle an empty nodes file gracefully and will crash[...]
- but if we scale down to zero while pausing there would be no typesense instances to crash; we could/should break the reconciliation circuit at the very beginning of the typesensecluster_controller.go after pause state's been established. Why should the controller perform any actions on the other resources if the cluster is paused anyways? (the backup should definitely run on its own controller independently from the cluster controller-if this is where the discussion will eventually lead us)
- what about this: https://typesense.org/docs/guide/backups.html#restore-steps ? If you leave a single instance runnning how are you going to restore the snapshot?
- Completely theoretical: if you leave a single instance running aren't you putting in question the integrity of the backup as some writes might occur while you're taking your snapshot, and eventually will not be included in the backup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within the realm of what this operator is doing, it is safe to assume a cluster size of 1, at least as far as we are concerned for the configmap. On the configmap for the nodes file, I am never allowing it to be empty, because if an instance starts with an empty nodes file it crashes. If you are running a single node, you must either have that node in the nodes file, or no nodes file at all, a blank nodes file causes a crash.
but isn't this contradicting how you handle it at:
and at:
https://github.com/dealnews/typesense-operator/blob/7c0bd0c084baec94be10bd06fe5c189f9525ab39/internal/controller/typesensecluster_controller.go#L196 (I guess the else clause will never be met)
I feel we are not understanding each other, so I apologize.
The code I added always scales the statefulset to 0 when pausing, it will however leave the configmap set to a single instance, as a blank nodes file causes crashes. There are times that the statefulset is set to 0 replicas and the cluster is still running, such as termination. There also exists the possibility of a race condition where the configmap can be blank on startup, depending on where k8s is in the refresh cycle.
In L147 I am modifying the statefulset to be either 0, or the number of replicas set in the spec.
In L196 I am updating the status to be either pausing or paused.
In typesensecluster_configmap.go L108 I am ensuring that the configmap is not set to empty. It looks like the replicas are being set to nil here, since there is a ts object, the call certainly could set the number for replicas. However this would just push this logic to somewhere in updateConfigMap, since the nodes configmap cannot be empty without causing a crash.
- but if we scale down to zero while pausing there would be no typesense instances to crash; we could/should break the reconciliation circuit at the very beginning of the typesensecluster_controller.go after pause state's been established. Why should the controller perform any actions on the other resources if the cluster is paused anyways? (the backup should definitely run on its own controller independently from the cluster controller-if this is where the discussion will eventually lead us)
The configmap will be updated to empty before the instances terminate, and they will crash.
- what about this: https://typesense.org/docs/guide/backups.html#restore-steps ? If you leave a single instance runnning how are you going to restore the snapshot?
There is no way to restore with a single instance running. You have to shut it down completely, restore the files, then restart the cluster.
- Completely theoretical: if you leave a single instance running aren't you putting in question the integrity of the backup as some writes might occur while you're taking your snapshot, and eventually will not be included in the backup?
Snapshots are taken online, preferably on the leader, and are only single instances in time. There is the possibility that your snapshots do not include every update, but that is the case with every backup system. They are only valid at a single point in time. If the snapshot api is called, the instance will flush memory to disk, including anything not yet written to disk. If you were to take a filesystem snapshot there would be no guarantee that all memory is written to disk. As to leaving the single instance running, I do not do that, I only set the configmap to have a single node.
concerning:
Yes, we can try that and see how it will work out; I can pick it up and in a separate issue #77 , and expand |
Add message about skipping quarum check because cluster is paused.
Incorporate a pause element into the CRD definition. Configure the StatefulSet to scale down to 0 when paused.
A more nuanced status element should be introduced, beyond a simple boolean for Ready. A possible enumeration could include Ready, Paused, Pausing, Initializing, Reconciling, and Needs Attention.