etcd sometimes(very rare) gets broken after a cluster-reset action #11992

aganesh-suse · 2025-03-20T01:17:34Z

Environment Details

Infrastructure

Cloud
Hosted

Node(s) CPU architecture, OS, and Version:

$ cat /etc/os-release
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP5"

$ uname -m
x86_64

Cluster Configuration:

HA: 3 server/ 1 agent

k3s version:

 k3s -v
k3s version v1.29.15-rc1+k3s1 (5bc2f0ce)
go version go1.23.6

Describe the bug:

Config.yaml:

token: xxxx
cluster-init: true
write-kubeconfig-mode: "0644"
node-external-ip: 1.1.1.1
node-label:
- k3s-upgrade=server

Testing Steps to Reproduce:

Copy config.yaml

$ sudo mkdir -p /etc/rancher/k3s && sudo cp config.yaml /etc/rancher/k3s

Install k3s

curl -sfL https://get.k3s.io | sudo INSTALL_K3S_VERSION='v1.29.15-rc1+k3s1' sh -s - server

Verify Cluster Status:

kubectl get nodes -o wide
kubectl get pods -A

Using the killall script Stop two server nodes (Server 2 and 3)

sudo /usr/local/bin/k3s-killall.sh

Shut down the server on the remaining node - Server1

$ sudo systemctl stop k3s

Run cluster-reset

$ sudo /usr/local/bin/k3s server --cluster-reset

Restart the server process

$ sudo systemctl start k3s

Move/Delete the db directories from other servers (2 and 3)

sudo mv /var/lib/rancher/k3s/server/db /var/lib/rancher/k3s/server/db-backup

Restart the server process on the other servers.
Re-Verify Cluster Status:

kubectl get nodes -o wide
kubectl get pods -A

Expected behavior:

Nodes should be in Ready state
Pods should be in Running state

Actual behavior:

Each server think the nodes are in different states:

ip-172-31-28-209:~ # kubectl get node
NAME                                          STATUS     ROLES                       AGE     VERSION
ip-172-31-24-161.us-east-2.compute.internal   NotReady   control-plane,etcd,master   5h9m    v1.29.15-rc1+k3s1
ip-172-31-24-34.us-east-2.compute.internal    NotReady   control-plane,etcd,master   5h8m    v1.29.15-rc1+k3s1
ip-172-31-26-218.us-east-2.compute.internal   NotReady   <none>                      5h6m    v1.29.15-rc1+k3s1
ip-172-31-28-209.us-east-2.compute.internal   Ready      control-plane,etcd,master   5h12m   v1.29.15-rc1+k3s1

ip-172-31-24-34:~ # kubectl get node
NAME                                          STATUS     ROLES                       AGE     VERSION
ip-172-31-24-161.us-east-2.compute.internal   Ready      control-plane,etcd,master   5h9m    v1.29.15-rc1+k3s1
ip-172-31-24-34.us-east-2.compute.internal    Ready      control-plane,etcd,master   5h9m    v1.29.15-rc1+k3s1
ip-172-31-26-218.us-east-2.compute.internal   Ready      <none>                      5h6m    v1.29.15-rc1+k3s1
ip-172-31-28-209.us-east-2.compute.internal   NotReady   control-plane,etcd,master   5h12m   v1.29.15-rc1+k3s1

ip-172-31-24-161:~ # kubectl get node
NAME                                          STATUS     ROLES                       AGE     VERSION
ip-172-31-24-161.us-east-2.compute.internal   Ready      control-plane,etcd,master   5h10m   v1.29.15-rc1+k3s1
ip-172-31-24-34.us-east-2.compute.internal    Ready      control-plane,etcd,master   5h10m   v1.29.15-rc1+k3s1
ip-172-31-26-218.us-east-2.compute.internal   Ready      <none>                      5h7m    v1.29.15-rc1+k3s1
ip-172-31-28-209.us-east-2.compute.internal   NotReady   control-plane,etcd,master   5h13m   v1.29.15-rc1+k3s1

Some journal logs on the main server:

Mar 19 19:37:06 ip-172-31-28-209 k3s[2729]: time="2025-03-19T19:37:06Z" level=error msg="Sending HTTP/1.1 503 response to 127.0.0.1:52240: runtime core not ready"
.
Mar 19 19:37:10 ip-172-31-28-209 k3s[2729]: time="2025-03-19T19:37:10Z" level=info msg="Failed to get existing traefik HelmChart" error="helmcharts.helm.cattle.io \"traefik\" not found"
.
Mar 19 19:51:26 ip-172-31-28-209 k3s[8493]: {"level":"error","ts":"2025-03-19T19:51:26.637477Z","caller":"etcdserver/server.go:2381","msg":"Validation on configuration change failed","shouldApplyV3":false,"error":"membership: too many learner members in cluster","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange\n\t/go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.19-k3s1.30/etcdserver/server.go:2381\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\t/go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.19-k3s1.30/etcdserver/server.go:2250\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\t/go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.19-k3s1.30/etcdserver/server.go:1462\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\t/go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.19-k3s1.30/etcdserver/server.go:1277\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8\n\t/go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.19-k3s1.30/etcdserver/server.go:1149\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\t/go/pkg/mod/github.com/k3s-io/etcd/pkg/v3@v3.5.19-k3s1.30/schedule/schedule.go:157"}
.
.
Mar 19 19:40:56 ip-172-31-28-209 k3s[2729]: I0319 19:40:56.786805    2729 node_controller.go:431] Initializing node ip-172-31-24-34.us-east-2.compute.internal with cloud provider
Mar 19 19:40:56 ip-172-31-28-209 k3s[2729]: E0319 19:40:56.787012    2729 node_controller.go:240] error syncing 'ip-172-31-24-34.us-east-2.compute.internal': failed to get instance metadata for node ip-172-31-24-34.us-east-2.compute.internal: address annotations not yet set, requeuing
.
.
Mar 20 00:19:24 ip-172-31-28-209 k3s[22284]: E0320 00:19:24.674052   22284 server.go:310] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"

Please know this is not easily reproduced. It doesnt happen everytime.

The text was updated successfully, but these errors were encountered:

aganesh-suse · 2025-03-20T01:22:37Z

Other known related references:

brandond · 2025-03-20T02:15:16Z

@dereknola and I have run into this independently in the past. I suspect that the experimental force-cluster-reset flag that we are using to remove other cluster members has some edge conditions that can cause a split brain. Unfortunately we haven't been able to reproduce it with standalone etcd, nor have we identified anything we can do differently in k3s to avoid it.

The workaround is to restore a snapshot and then rejoin the other nodes. Restoring a snapshot when resetting cluster membership has not been observed to have the same issue with inconsistent cluster state across nodes.

It doesn't even have to be an OLD snapshot - when things are broken you can just take a snapshot on any of the affected nodes, then restore it, and things will work again.

github-project-automation bot added this to K3s Development Mar 20, 2025

github-project-automation bot moved this to New in K3s Development Mar 20, 2025

brandond moved this from New to Stalled in K3s Development Mar 21, 2025

brandond added this to the Backlog milestone Mar 21, 2025

github-project-automation bot added this to K3s Backlog Mar 21, 2025

brandond added the area/etcd label Mar 21, 2025

brandond moved this to Bugs in K3s Backlog Mar 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd sometimes(very rare) gets broken after a cluster-reset action #11992

etcd sometimes(very rare) gets broken after a cluster-reset action #11992

aganesh-suse commented Mar 20, 2025 •

edited

Loading

aganesh-suse commented Mar 20, 2025 •

edited by brandond

Loading

brandond commented Mar 20, 2025 •

edited

Loading

etcd sometimes(very rare) gets broken after a cluster-reset action #11992

etcd sometimes(very rare) gets broken after a cluster-reset action #11992

Comments

aganesh-suse commented Mar 20, 2025 • edited Loading

Environment Details

Testing Steps to Reproduce:

aganesh-suse commented Mar 20, 2025 • edited by brandond Loading

brandond commented Mar 20, 2025 • edited Loading

aganesh-suse commented Mar 20, 2025 •

edited

Loading

aganesh-suse commented Mar 20, 2025 •

edited by brandond

Loading

brandond commented Mar 20, 2025 •

edited

Loading