Enable cluster pause functionality. #69

mleklund · 2025-03-19T14:44:57Z

Incorporate a pause element into the CRD definition. Configure the StatefulSet to scale down to 0 when paused.

A more nuanced status element should be introduced, beyond a simple boolean for Ready. A possible enumeration could include Ready, Paused, Pausing, Initializing, Reconciling, and Needs Attention.

Incorporate a pause element into the CRD definition. Configure the StatefulSet to scale down to 0 when paused. A more nuanced status element should be introduced, beyond a simple boolean for Ready. A possible enumeration could include **Ready**, **Paused**, **Pausing**, **Initializing**, **Reconciling**, and **Needs Attention**. There probably ought to be a status element that is more nuanced th

akyriako · 2025-03-25T04:51:40Z

Hi @mleklund, can you give me some more context about the idea driving this feature? I have the feeling it strays away from the core objective of the operator, that is purely managing OLM aspects of Typesense. If you want to pause or stop workloads you can accomplish it in various ways like:

KEDA ScaleJobs https://keda.sh/docs/2.16/reference/scaledjob-spec/
rekuberate-io/sleepcycles https://github.com/rekuberate-io/sleepcycles (another project I am building)

mleklund · 2025-03-26T03:24:46Z

To restore a backup, you need to pause a cluster. This is a precursor to some backup and restore code I’m working on. While overriding the cluster size work, you need to keep track of the cluster size.

We’re going to create multiple merge requests for each of the needed pieces. That way, it’ll be easier for everyone to review and comment on the changes.

akyriako · 2025-03-27T07:11:41Z

I will start checking the PR today. In the meantime please make sure to re-sync your branch from main (context: #70)

akyriako · 2025-03-30T17:30:03Z

internal/controller/typesensecluster_configmap.go

+		r.logger.V(debugLevel).Info("replicas is nil")
+		if *sts.Spec.Replicas == 0 {
+			replicas = ptr.To[int32](1)
+			r.logger.V(debugLevel).Info("StatefulSet has been paused, update replicas to 1")


why are you updating replicas to 1 and not to sts.Spec.Replicas after pause?

Typesense does not handle an empty nodes file gracefully and will crash, which leads to crashloopbackoff when it is time to scale backup. While harmless overall, this looks bad and slows scale up. Just always having 1 node was my work around, as I feel it is harmless to assume a cluster size of 1.

but isn't this contradicting how you handle it at:

https://github.com/dealnews/typesense-operator/blob/7c0bd0c084baec94be10bd06fe5c189f9525ab39/internal/controller/typesensecluster_statefulset.go#L147

and at:

https://github.com/dealnews/typesense-operator/blob/7c0bd0c084baec94be10bd06fe5c189f9525ab39/internal/controller/typesensecluster_controller.go#L196 (I guess the else clause will never be met)

Maybe I am missing something here and I failed to comprehend the logic.

Additionally:

Typesense does not handle an empty nodes file gracefully and will crash[...]

but if we scale down to zero while pausing there would be no typesense instances to crash; we could/should break the reconciliation circuit at the very beginning of the typesensecluster_controller.go after pause state's been established. Why should the controller perform any actions on the other resources if the cluster is paused anyways? (the backup should definitely run on its own controller independently from the cluster controller-if this is where the discussion will eventually lead us)

what about this: https://typesense.org/docs/guide/backups.html#restore-steps ? If you leave a single instance runnning how are you going to restore the snapshot?

Completely theoretical: if you leave a single instance running aren't you putting in question the integrity of the backup as some writes might occur while you're taking your snapshot, and eventually will not be included in the backup?

Within the realm of what this operator is doing, it is safe to assume a cluster size of 1, at least as far as we are concerned for the configmap. On the configmap for the nodes file, I am never allowing it to be empty, because if an instance starts with an empty nodes file it crashes. If you are running a single node, you must either have that node in the nodes file, or no nodes file at all, a blank nodes file causes a crash.

but isn't this contradicting how you handle it at:

https://github.com/dealnews/typesense-operator/blob/7c0bd0c084baec94be10bd06fe5c189f9525ab39/internal/controller/typesensecluster_statefulset.go#L147

and at:

https://github.com/dealnews/typesense-operator/blob/7c0bd0c084baec94be10bd06fe5c189f9525ab39/internal/controller/typesensecluster_controller.go#L196 (I guess the else clause will never be met)

I feel we are not understanding each other, so I apologize.

The code I added always scales the statefulset to 0 when pausing, it will however leave the configmap set to a single instance, as a blank nodes file causes crashes. There are times that the statefulset is set to 0 replicas and the cluster is still running, such as termination. There also exists the possibility of a race condition where the configmap can be blank on startup, depending on where k8s is in the refresh cycle.

In L147 I am modifying the statefulset to be either 0, or the number of replicas set in the spec.

In L196 I am updating the status to be either pausing or paused.

In typesensecluster_configmap.go L108 I am ensuring that the configmap is not set to empty. It looks like the replicas are being set to nil here, since there is a ts object, the call certainly could set the number for replicas. However this would just push this logic to somewhere in updateConfigMap, since the nodes configmap cannot be empty without causing a crash.

but if we scale down to zero while pausing there would be no typesense instances to crash; we could/should break the reconciliation circuit at the very beginning of the typesensecluster_controller.go after pause state's been established. Why should the controller perform any actions on the other resources if the cluster is paused anyways? (the backup should definitely run on its own controller independently from the cluster controller-if this is where the discussion will eventually lead us)

The configmap will be updated to empty before the instances terminate, and they will crash.

what about this: https://typesense.org/docs/guide/backups.html#restore-steps ? If you leave a single instance runnning how are you going to restore the snapshot?

There is no way to restore with a single instance running. You have to shut it down completely, restore the files, then restart the cluster.

Completely theoretical: if you leave a single instance running aren't you putting in question the integrity of the backup as some writes might occur while you're taking your snapshot, and eventually will not be included in the backup?

Snapshots are taken online, preferably on the leader, and are only single instances in time. There is the possibility that your snapshots do not include every update, but that is the case with every backup system. They are only valid at a single point in time. If the snapshot api is called, the instance will flush memory to disk, including anything not yet written to disk. If you were to take a filesystem snapshot there would be no guarantee that all memory is written to disk. As to leaving the single instance running, I do not do that, I only set the configmap to have a single node.

internal/controller/typesensecluster_configmap.go

akyriako · 2025-03-30T17:50:34Z

concerning:

A more nuanced status element should be introduced, beyond a simple boolean for Ready. A possible enumeration could include Ready, Paused, Pausing, Initializing, Reconciling, and Needs Attention.

Yes, we can try that and see how it will work out; I can pick it up and in a separate issue #77 , and expand Ready to a more versatile ConditionStatus spectrum of values besides ConditionTrue and ConditionFalse.

internal/controller/typesensecluster_statefulset.go

Add message about skipping quarum check because cluster is paused.

Merge branch 'main' into pause_cluster

34b9f41

akyriako self-requested a review March 30, 2025 17:28

akyriako reviewed Mar 30, 2025

View reviewed changes

internal/controller/typesensecluster_configmap.go Show resolved Hide resolved

akyriako requested changes Mar 30, 2025

View reviewed changes

internal/controller/typesensecluster_statefulset.go Outdated Show resolved Hide resolved

mleklund added 2 commits April 3, 2025 15:22

Merge remote-tracking branch 'upstream/main' into pause_cluster

58a84fc

Add logging to better illustrate that cluster is paused

7c0bd0c

Add message about skipping quarum check because cluster is paused.

mleklund requested a review from akyriako April 4, 2025 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable cluster pause functionality. #69

Enable cluster pause functionality. #69

mleklund commented Mar 19, 2025

akyriako commented Mar 25, 2025

mleklund commented Mar 26, 2025 •

edited

Loading

akyriako commented Mar 27, 2025

akyriako Mar 30, 2025

mleklund Apr 3, 2025

akyriako Apr 6, 2025 •

edited

Loading

mleklund Apr 7, 2025

akyriako commented Mar 30, 2025 •

edited

Loading

Enable cluster pause functionality. #69

Are you sure you want to change the base?

Enable cluster pause functionality. #69

Conversation

mleklund commented Mar 19, 2025

akyriako commented Mar 25, 2025

mleklund commented Mar 26, 2025 • edited Loading

akyriako commented Mar 27, 2025

akyriako Mar 30, 2025

Choose a reason for hiding this comment

mleklund Apr 3, 2025

Choose a reason for hiding this comment

akyriako Apr 6, 2025 • edited Loading

Choose a reason for hiding this comment

mleklund Apr 7, 2025

Choose a reason for hiding this comment

akyriako commented Mar 30, 2025 • edited Loading

mleklund commented Mar 26, 2025 •

edited

Loading

akyriako Apr 6, 2025 •

edited

Loading

akyriako commented Mar 30, 2025 •

edited

Loading