声明:以下实验环境为 k8s v1.21.0, 其它版本不一定适用。
k8s突然发现启动不了了,用 journalctl -xefu kubelet
查看一下日志, 发现有如下打印:
1
part of the existing bootstrap client certificate in /etc/kubernetes/kubelet.conf is expired: 2023-05-19 01:57:11 +0000 UTC
很明显是证书过期的问题,可以用命令 kubeadm certs check-expiration
查看证书过期时间确认一下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
master> kubeadm certs check-expiration
[check-expiration] Reading configuration from the cluster...
[check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[check-expiration] Error reading configuration from the Cluster. Falling back to default configuration
CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED
admin.conf May 22, 2023 05:24 UTC <invalid> no
apiserver May 22, 2023 05:24 UTC <invalid> ca no
apiserver-etcd-client May 22, 2023 05:24 UTC <invalid> etcd-ca no
apiserver-kubelet-client May 22, 2023 05:24 UTC <invalid> ca no
controller-manager.conf May 22, 2023 05:25 UTC <invalid> no
etcd-healthcheck-client May 22, 2023 05:24 UTC <invalid> etcd-ca no
etcd-peer May 22, 2023 05:24 UTC <invalid> etcd-ca no
etcd-server May 22, 2023 05:24 UTC <invalid> etcd-ca no
front-proxy-client May 22, 2023 05:24 UTC <invalid> front-proxy-ca no
scheduler.conf May 22, 2023 05:25 UTC <invalid> no
CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED
ca Nov 20, 2030 05:24 UTC 7y no
etcd-ca Nov 20, 2030 05:24 UTC 7y no
front-proxy-ca Nov 20, 2030 05:24 UTC 7y no
可以确认是证书过期的问题了,K8S默认证书只时间只有一年时间,用以下的方式重新分配证书可以解决这个问题。
首先在master上执行kubeadm certs renew all
分配新的证书:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
master> kubeadm certs renew all
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[renew] Error reading configuration from the Cluster. Falling back to default configuration
certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed
certificate for serving the Kubernetes API renewed
certificate the apiserver uses to access etcd renewed
certificate for the API server to connect to kubelet renewed
certificate embedded in the kubeconfig file for the controller manager to use renewed
certificate for liveness probes to healthcheck etcd renewed
certificate for etcd nodes to communicate with each other renewed
certificate for serving etcd renewed
certificate for the front proxy client renewed
certificate embedded in the kubeconfig file for the scheduler manager to use renewed
Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates.
然后可以再次执行kubeadm certs check-expiration
查看一下证书的过期时间是否已经更新。
然后重启kubelet服务,发现并未启动成功,而是提示 failed to run Kubelet: unable to load bootstrap kubeconfig
在网络上找到如下方法可以解决:
1
2
3
4
5
6
7
$ cd /etc/kubernetes/pki/
$ mv {apiserver.crt,apiserver-etcd-client.key,apiserver-kubelet-client.crt,front-proxy-ca.crt,front-proxy-client.crt,front-proxy-client.key,front-proxy-ca.key,apiserver-kubelet-client.key,apiserver.key,apiserver-etcd-client.crt} ~/
$ kubeadm init phase certs all
$ cd /etc/kubernetes/
$ mv {admin.conf,controller-manager.conf,kubelet.conf,scheduler.conf} ~/
$ kubeadm init phase kubeconfig all
$ cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
再次重启kubelet服务,成功启动了。现在查看一下集群的状态:
1
2
3
4
5
> kubectl get nodes
NAME STATUS ROLES AGE VERSION
s1001.lab.org Ready master 2y182d v1.21.0
s1002.lab.org NotReady node 2y182d v1.21.0
s1003.lab.org NotReady node 2y182d v1.21.0
此时master节点是正常运行的,而node节点却还没有启动。
我这里采用的方式是让node节点重新加入集群,这样证书会自动更新。首先在master节点上创建token:
1
2
master> kubeadm token create --print-join-command
kubeadm join 172.16.10.11:6443 --token xu2m0g.7t5rlkq2yjwpkrtv --discovery-token-ca-cert-hash sha256:0bc85cc15d5b5b76e4209d8c3781330aa38309f29c271f3c45617a053d4bb23a
然后在node节点上删除原证书信息(删除前注意先备份,以免发生意外),然后再执行上面打印的加入集群的命令:
1
2
3
4
rm -rf /etc/kubernetes/kubelet.conf
rm -rf /etc/kubernetes/pki/ca.crt
kubeadm token create --print-join-command
kubeadm join 172.16.10.11:6443 --token xu2m0g.7t5rlkq2yjwpkrtv --discovery-token-ca-cert-hash sha256:0bc85cc15d5b5b76e4209d8c3781330aa38309f29c271f3c45617a053d4bb23a
确认节点成功加入集群后,就可以重复以上操作将其它节点重新加入集群。
再次查看集群状态,已经都正常启动了。
1
2
3
4
5
master> kubectl get nodes
NAME STATUS ROLES AGE VERSION
s1001.lab.org Ready master 2y182d v1.21.0
s1002.lab.org Ready node 2y182d v1.21.0
s1003.lab.org Ready node 2y182d v1.21.0
PS: 我这里用的 k8s v1.21 版本是两年的,现在也算个老版本了,目前 k8s 的最新版是 v1.27。这里只是记录一下解决问题的过程。后面时间充裕了再更新下k8s版本。