You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Nodes should be in Ready state
Pods should be in Running state
Actual behavior:
Each server think the nodes are in different states:
ip-172-31-28-209:~ # kubectl get node
NAME STATUS ROLES AGE VERSION
ip-172-31-24-161.us-east-2.compute.internal NotReady control-plane,etcd,master 5h9m v1.29.15-rc1+k3s1
ip-172-31-24-34.us-east-2.compute.internal NotReady control-plane,etcd,master 5h8m v1.29.15-rc1+k3s1
ip-172-31-26-218.us-east-2.compute.internal NotReady <none> 5h6m v1.29.15-rc1+k3s1
ip-172-31-28-209.us-east-2.compute.internal Ready control-plane,etcd,master 5h12m v1.29.15-rc1+k3s1
ip-172-31-24-34:~ # kubectl get node
NAME STATUS ROLES AGE VERSION
ip-172-31-24-161.us-east-2.compute.internal Ready control-plane,etcd,master 5h9m v1.29.15-rc1+k3s1
ip-172-31-24-34.us-east-2.compute.internal Ready control-plane,etcd,master 5h9m v1.29.15-rc1+k3s1
ip-172-31-26-218.us-east-2.compute.internal Ready <none> 5h6m v1.29.15-rc1+k3s1
ip-172-31-28-209.us-east-2.compute.internal NotReady control-plane,etcd,master 5h12m v1.29.15-rc1+k3s1
ip-172-31-24-161:~ # kubectl get node
NAME STATUS ROLES AGE VERSION
ip-172-31-24-161.us-east-2.compute.internal Ready control-plane,etcd,master 5h10m v1.29.15-rc1+k3s1
ip-172-31-24-34.us-east-2.compute.internal Ready control-plane,etcd,master 5h10m v1.29.15-rc1+k3s1
ip-172-31-26-218.us-east-2.compute.internal Ready <none> 5h7m v1.29.15-rc1+k3s1
ip-172-31-28-209.us-east-2.compute.internal NotReady control-plane,etcd,master 5h13m v1.29.15-rc1+k3s1
Some journal logs on the main server:
Mar 19 19:37:06 ip-172-31-28-209 k3s[2729]: time="2025-03-19T19:37:06Z" level=error msg="Sending HTTP/1.1 503 response to 127.0.0.1:52240: runtime core not ready"
.
Mar 19 19:37:10 ip-172-31-28-209 k3s[2729]: time="2025-03-19T19:37:10Z" level=info msg="Failed to get existing traefik HelmChart" error="helmcharts.helm.cattle.io \"traefik\" not found"
.
Mar 19 19:51:26 ip-172-31-28-209 k3s[8493]: {"level":"error","ts":"2025-03-19T19:51:26.637477Z","caller":"etcdserver/server.go:2381","msg":"Validation on configuration change failed","shouldApplyV3":false,"error":"membership: too many learner members in cluster","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange\n\t/go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.19-k3s1.30/etcdserver/server.go:2381\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\t/go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.19-k3s1.30/etcdserver/server.go:2250\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\t/go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.19-k3s1.30/etcdserver/server.go:1462\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\t/go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.19-k3s1.30/etcdserver/server.go:1277\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8\n\t/go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.19-k3s1.30/etcdserver/server.go:1149\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\t/go/pkg/mod/github.com/k3s-io/etcd/pkg/v3@v3.5.19-k3s1.30/schedule/schedule.go:157"}
.
.
Mar 19 19:40:56 ip-172-31-28-209 k3s[2729]: I0319 19:40:56.786805 2729 node_controller.go:431] Initializing node ip-172-31-24-34.us-east-2.compute.internal with cloud provider
Mar 19 19:40:56 ip-172-31-28-209 k3s[2729]: E0319 19:40:56.787012 2729 node_controller.go:240] error syncing 'ip-172-31-24-34.us-east-2.compute.internal': failed to get instance metadata for node ip-172-31-24-34.us-east-2.compute.internal: address annotations not yet set, requeuing
.
.
Mar 20 00:19:24 ip-172-31-28-209 k3s[22284]: E0320 00:19:24.674052 22284 server.go:310] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token has been invalidated]"
Please know this is not easily reproduced. It doesnt happen everytime.
The text was updated successfully, but these errors were encountered:
@dereknola and I have run into this independently in the past. I suspect that the experimental force-cluster-reset flag that we are using to remove other cluster members has some edge conditions that can cause a split brain. Unfortunately we haven't been able to reproduce it with standalone etcd, nor have we identified anything we can do differently in k3s to avoid it.
The workaround is to restore a snapshot and then rejoin the other nodes. Restoring a snapshot when resetting cluster membership has not been observed to have the same issue with inconsistent cluster state across nodes.
It doesn't even have to be an OLD snapshot - when things are broken you can just take a snapshot on any of the affected nodes, then restore it, and things will work again.
Environment Details
Infrastructure
Node(s) CPU architecture, OS, and Version:
Cluster Configuration:
k3s version:
Describe the bug:
Config.yaml:
Testing Steps to Reproduce:
Expected behavior:
Nodes should be in Ready state
Pods should be in Running state
Actual behavior:
Each server think the nodes are in different states:
Some journal logs on the main server:
Please know this is not easily reproduced. It doesnt happen everytime.
The text was updated successfully, but these errors were encountered: