k 8 sクラスタ災害復旧-元のマシンが稼働

19132 ワード

環境準備:
192.168.244.11  k8s-company01-master01
192.168.244.12  k8s-company01-master02
192.168.244.13  k8s-company01-master03
192.168.244.15  k8s-company01-lb
192.168.244.14  k8s-company01-worker001

3台のマスターが2台または3台ダウンします
2台または3台のmasterをダウンタイムした後、クラスタがダウンタイムし、workerノードのpodが正常に動作します.ここでは、マシンが正常に修復され、正常に起動することを考慮します.マシンが修復できないリカバリプロセスの参考:k 8 sクラスタリカバリテスト
シミュレーションテスト:
  • 停止192.168.244.212192.168.244.113 2台masterマシン
  • 192.168.244.11のetcdを正常に動作させる
  • 192.168.244.212192.168.244.113起動後、クラスタ全体
  • を回復する.
    12と13のマシンを停止し、クラスタが動作しないようにします.
    閉じる前にクラスタは正常です.
    [root@k8s-company01-master01 ~]# kubectl get nodes
    NAME                      STATUS   ROLES    AGE     VERSION
    k8s-company01-master01    Ready    master   11m     v1.14.1
    k8s-company01-master02    Ready    master   9m23s   v1.14.1
    k8s-company01-master03    Ready    master   7m10s   v1.14.1
    k8s-company01-worker001   Ready       13s     v1.14.1
    [root@k8s-company01-master01 ~]#  kubectl -n kube-system get pod
    NAME                                             READY   STATUS    RESTARTS   AGE
    calico-kube-controllers-749f7c8df8-dqqkb         1/1     Running   1          5m6s
    calico-kube-controllers-749f7c8df8-mdrnz         1/1     Running   1          5m6s
    calico-kube-controllers-749f7c8df8-w89sk         1/1     Running   0          5m6s
    calico-node-6r9jj                                1/1     Running   0          22s
    calico-node-cnlqs                                1/1     Running   0          5m6s
    calico-node-fb5dh                                1/1     Running   0          5m6s
    calico-node-pmxrh                                1/1     Running   0          5m6s
    calico-typha-646cdc958c-gd6xj                    1/1     Running   0          5m6s
    coredns-56c9dc7946-hw4s8                         1/1     Running   1          11m
    coredns-56c9dc7946-nr5zp                         1/1     Running   1          11m
    etcd-k8s-company01-master01                      1/1     Running   0          10m
    etcd-k8s-company01-master02                      1/1     Running   0          9m31s
    etcd-k8s-company01-master03                      1/1     Running   0          7m18s
    kube-apiserver-k8s-company01-master01            1/1     Running   0          10m
    kube-apiserver-k8s-company01-master02            1/1     Running   0          9m31s
    kube-apiserver-k8s-company01-master03            1/1     Running   0          6m12s
    kube-controller-manager-k8s-company01-master01   1/1     Running   1          10m
    kube-controller-manager-k8s-company01-master02   1/1     Running   0          9m31s
    kube-controller-manager-k8s-company01-master03   1/1     Running   0          6m24s
    kube-proxy-gnkxl                                 1/1     Running   0          7m19s
    kube-proxy-jd82z                                 1/1     Running   0          11m
    kube-proxy-rsswz                                 1/1     Running   0          9m32s
    kube-proxy-tcx5s                                 1/1     Running   0          22s
    kube-scheduler-k8s-company01-master01            1/1     Running   1          10m
    kube-scheduler-k8s-company01-master02            1/1     Running   0          9m31s
    kube-scheduler-k8s-company01-master03            1/1     Running   0          6m14s
    

    終了後:
    [root@k8s-company01-master01 ~]# kubectl get nodes
    Unable to connect to the server: unexpected EOF
    [root@k8s-company01-master01 ~]#  kubectl -n kube-system get pod
    Unable to connect to the server: http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=NO_ERROR, debug=""
    [root@k8s-company01-master01 ~]# ETCDCTL_API=2 etcdctl  --endpoints https://192.168.244.11:2379,https://192.168.244.12:2379,https://192.168.244.13:2379  --cert-file=/etc/kubernetes/pki/etcd/server.crt --key-file=/etc/kubernetes/pki/etcd/server.key --ca-file=/etc/kubernetes/pki/etcd/ca.crt cluster-health
    cluster may be unhealthy: failed to list members
    Error:  client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://192.168.244.11:2379 exceeded header timeout
    ; error #1: dial tcp 192.168.244.13:2379: connect: no route to host
    ; error #2: client: endpoint https://192.168.244.12:2379 exceeded header timeout
    
    error #0: client: endpoint https://192.168.244.11:2379 exceeded header timeout
    error #1: dial tcp 192.168.244.13:2379: connect: no route to host
    error #2: client: endpoint https://192.168.244.12:2379 exceeded header timeout
    
    

    クラスタが正常に動作せず、etcdも使用できません.
    11ノード上のetcdを単一ノードクラスタとして起動する
    クラスタ内のetcd読み出しの構成は/etc/kubernetes/manifests/etcd.yamlである.
    2つのコマンドを1つのノードクラスタとして追加します.
        - etcd
        - --advertise-client-urls=https://192.168.244.11:2379
        - --cert-file=/etc/kubernetes/pki/etcd/server.crt
        - --client-cert-auth=true
        - --data-dir=/var/lib/etcd
        - --initial-advertise-peer-urls=https://192.168.244.11:2380
        - --initial-cluster=k8s-company01-master01=https://192.168.244.11:2380
        - --initial-cluster-state=new      ##   1
        - --force-new-cluster              ##   2
        - --key-file=/etc/kubernetes/pki/etcd/server.key
        - --listen-client-urls=https://127.0.0.1:2379,https://192.168.244.11:2379
        - --listen-peer-urls=https://192.168.244.11:2380
        - --name=k8s-company01-master01
        - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
        - --peer-client-cert-auth=true
        - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
        - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
        - --snapshot-count=10000
        - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    
    

    上の2つの新しい命令に注意してください.
    構成を変更すると、etcdが起動するパラメータは単一ノードクラスタ形式になります.この場合、etcdの回復によりクラスタは正常に動作します.
    [root@k8s-company01-master01 ~]#  kubectl get node
    NAME                      STATUS     ROLES    AGE   VERSION
    k8s-company01-master01    Ready      master   23m   v1.14.1
    k8s-company01-master02    NotReady   master   21m   v1.14.1
    k8s-company01-master03    NotReady   master   19m   v1.14.1
    k8s-company01-worker001   Ready         12m   v1.14.1
    [root@k8s-company01-master01 ~]#  kubectl -n kube-system get pod
    NAME                                             READY   STATUS             RESTARTS   AGE
    calico-kube-controllers-749f7c8df8-dqqkb         1/1     Running            3          20m
    calico-kube-controllers-749f7c8df8-mdrnz         1/1     Running            1          20m
    calico-kube-controllers-749f7c8df8-w89sk         1/1     Running            2          20m
    calico-node-6r9jj                                1/1     Running            0          15m
    calico-node-cnlqs                                1/1     Running            1          20m
    calico-node-fb5dh                                1/1     Running            0          20m
    calico-node-pmxrh                                1/1     Running            1          20m
    calico-typha-646cdc958c-gd6xj                    1/1     Running            0          20m
    coredns-56c9dc7946-hw4s8                         1/1     Running            7          26m
    coredns-56c9dc7946-nr5zp                         1/1     Running            3          26m
    etcd-k8s-company01-master01                      1/1     Running            0          3m36s
    etcd-k8s-company01-master02                      0/1     CrashLoopBackOff   4          24m
    etcd-k8s-company01-master03                      0/1     CrashLoopBackOff   4          22m
    kube-apiserver-k8s-company01-master01            1/1     Running            7          25m
    kube-apiserver-k8s-company01-master02            0/1     CrashLoopBackOff   3          24m
    kube-apiserver-k8s-company01-master03            0/1     CrashLoopBackOff   3          21m
    kube-controller-manager-k8s-company01-master01   1/1     Running            1          25m
    kube-controller-manager-k8s-company01-master02   1/1     Running            1          24m
    kube-controller-manager-k8s-company01-master03   1/1     Running            1          21m
    kube-proxy-gnkxl                                 1/1     Running            1          22m
    kube-proxy-jd82z                                 1/1     Running            0          26m
    kube-proxy-rsswz                                 1/1     Running            1          24m
    kube-proxy-tcx5s                                 1/1     Running            0          15m
    kube-scheduler-k8s-company01-master01            1/1     Running            1          25m
    kube-scheduler-k8s-company01-master02            1/1     Running            1          24m
    kube-scheduler-k8s-company01-master03            1/1     Running            1          21m
    [root@k8s-company01-master01 ~]# ETCDCTL_API=2 etcdctl  --endpoints https://192.168.244.11:2379,https://192.168.244.12:2379,https://192.168.244.13:2379  --cert-file=/etc/kubernetes/pki/etcd/server.crt --key-file=/etc/kubernetes/pki/etcd/server.key --ca-file=/etc/kubernetes/pki/etcd/ca.crt cluster-health
    member eff3fafa1597fbf0 is healthy: got healthy result from https://192.168.244.11:2379
    cluster is healthy
    
    

    etcdクラスタにはノードが1つしかありません.またpodの中には正常に動作しないものもあり、etcdの他の2つのノードはCrashLoopBackOff状態にある.
    etcdクラスタを復元し、k 8 sクラスタ全体を正常に復元
    12サーバーと13サーバーを起動します.
    11サーバに12と13のetcdノードを追加し、クラスタを構成します.追加する前に、12と13のetcdのデータを消去します.
    cd /var/lib/etcd
    rm -rf member/
    

    12ノードの追加(ノードを追加する操作は、11で行います):
    [root@k8s-company01-master01 ~]# ETCDCTL_API=3 etcdctl member add etcd-k8s-company01-master02 --peer-urls="https://192.168.244.12:2380" --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt
    Member 9ada83de146cad81 added to cluster 7b96e402e17890a5
    
    ETCD_NAME="etcd-k8s-company01-master02"
    ETCD_INITIAL_CLUSTER="etcd-k8s-company01-master02=https://192.168.244.12:2380,k8s-company01-master01=https://192.168.244.11:2380"
    ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.244.12:2380"
    ETCD_INITIAL_CLUSTER_STATE="existing"
    

    上記の内容を出力し、追加に成功したことを示します.この場合、12のkubeletサービスを再起動する必要があります.
    systemctl restart kubelet.service
    
    [root@k8s-company01-master01 ~]# ETCDCTL_API=2 etcdctl  --endpoints https://192.168.244.11:2379,https://192.168.244.12:2379,https://192.168.244.13:2379  --cert-file=/etc/kubernetes/pki/etcd/server.crt --key-file=/etc/kubernetes/pki/etcd/server.key --ca-file=/etc/kubernetes/pki/etcd/ca.crt cluster-health
    member 9ada83de146cad81 is healthy: got healthy result from https://192.168.244.12:2379
    member eff3fafa1597fbf0 is healthy: got healthy result from https://192.168.244.11:2379
    cluster is healthy
    

    クラスタは2つのノードになりました.
    13ノードの追加を続行します(追加後、13のkubeletサービスを再起動します):
    [root@k8s-company01-master01 ~]# ETCDCTL_API=3 etcdctl member add etcd-k8s-company01-master03 --peer-urls="https://192.168.244.13:2380" --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --cacert=/etc/kubernetes/pki/etcd/ca.crt
    Member efa2b7e4c407fb7a added to cluster 7b96e402e17890a5
    
    ETCD_NAME="etcd-k8s-company01-master03"
    ETCD_INITIAL_CLUSTER="k8s-company01-master02=https://192.168.244.12:2380,etcd-k8s-company01-master03=https://192.168.244.13:2380,k8s-company01-master01=https://192.168.244.11:2380"
    ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.244.13:2380"
    ETCD_INITIAL_CLUSTER_STATE="existing"
    [root@k8s-company01-master01 ~]# ETCDCTL_API=2 etcdctl  --endpoints https://192.168.244.11:2379,https://192.168.244.12:2379,https://192.168.244.13:2379  --cert-file=/etc/kubernetes/pki/etcd/server.crt --key-file=/etc/kubernetes/pki/etcd/server.key --ca-file=/etc/kubernetes/pki/etcd/ca.crt cluster-health
    member 9ada83de146cad81 is healthy: got healthy result from https://192.168.244.12:2379
    member efa2b7e4c407fb7a is healthy: got healthy result from https://192.168.244.13:2379
    member eff3fafa1597fbf0 is healthy: got healthy result from https://192.168.244.11:2379
    cluster is healthy
    

    etcdクラスタは正常に回復しました.
    クラスタが正常に戻りました:
    [root@k8s-company01-master01 ~]#  kubectl get node
    NAME                      STATUS   ROLES    AGE   VERSION
    k8s-company01-master01    Ready    master   33m   v1.14.1
    k8s-company01-master02    Ready    master   31m   v1.14.1
    k8s-company01-master03    Ready    master   29m   v1.14.1
    k8s-company01-worker001   Ready       22m   v1.14.1
    [root@k8s-company01-master01 ~]#  kubectl -n kube-system get pod
    NAME                                             READY   STATUS    RESTARTS   AGE
    calico-kube-controllers-749f7c8df8-dqqkb         1/1     Running   3          27m
    calico-kube-controllers-749f7c8df8-mdrnz         1/1     Running   1          27m
    calico-kube-controllers-749f7c8df8-w89sk         1/1     Running   2          27m
    calico-node-6r9jj                                1/1     Running   0          22m
    calico-node-cnlqs                                1/1     Running   1          27m
    calico-node-fb5dh                                1/1     Running   0          27m
    calico-node-pmxrh                                1/1     Running   1          27m
    calico-typha-646cdc958c-gd6xj                    1/1     Running   0          27m
    coredns-56c9dc7946-hw4s8                         1/1     Running   7          33m
    coredns-56c9dc7946-nr5zp                         1/1     Running   3          33m
    etcd-k8s-company01-master01                      1/1     Running   0          10m
    etcd-k8s-company01-master02                      1/1     Running   7          31m
    etcd-k8s-company01-master03                      1/1     Running   7          29m
    kube-apiserver-k8s-company01-master01            1/1     Running   7          32m
    kube-apiserver-k8s-company01-master02            1/1     Running   6          31m
    kube-apiserver-k8s-company01-master03            1/1     Running   7          28m
    kube-controller-manager-k8s-company01-master01   1/1     Running   2          32m
    kube-controller-manager-k8s-company01-master02   1/1     Running   1          31m
    kube-controller-manager-k8s-company01-master03   1/1     Running   1          28m
    kube-proxy-gnkxl                                 1/1     Running   1          29m
    kube-proxy-jd82z                                 1/1     Running   0          33m
    kube-proxy-rsswz                                 1/1     Running   1          31m
    kube-proxy-tcx5s                                 1/1     Running   0          22m
    kube-scheduler-k8s-company01-master01            1/1     Running   2          32m
    kube-scheduler-k8s-company01-master02            1/1     Running   1          31m
    kube-scheduler-k8s-company01-master03            1/1     Running   1          28m
    
    

    ノード追加に成功したがetcdクラスタがダウンしている場合は、11ノードが追加に失敗したノードを自動的に蹴り出し、繰り返し追加すればよい.