開発環境k 8 s再起動問題
9693 ワード
k 8 s 1.17.2高可用性バージョン再起動問題開発環境は平均10分ごとに再起動されます. テスト環境は6日間に1日平均1回再起動されます. cube環境は1時間に1回再起動します. ramdiskは現在18時間まで を再起動していない.
1.テスト環境:192.168.1.1.1135パスワード:xxxx構成:3 master 8 c 32 g etcdバージョン:3.2.20
1.1再起動回数:
1.2応答時間:
2.開発環境:192.168.2.1103パスワード:xxxx配置:3 master 4 c 8 g etcdバージョン:3.4.3
2.1再起動回数:
2.2応答時間:
3.エラー現象:
kube-controller-managerとkube-schedulerの再起動原因
apiserverも複数回登場
4.cube環境10.10.5.8パスワード:xxxx構成:3 master 8 c 16 g
4.1再起動回数:
4.2応答時間:
5.解決策
5.1 32 g 6 cに拡張したが、ioの操作は依然として遅く、シャットダウンはハードディスクの読み書き速度にある.
5.2 ramdiskをします.メモリを
5.2.1 io操作比較結果: ramdisk 2.3 GB/s データ盤24.0 MB/s
5.3検証結果:etcdの応答速度が著しく向上した.
5.4 18時間経過後、0回再起動
5.5 ramdisk純在を使用する問題
データがメモリに格納されているため、電源が切れると失われます.
6まとめ
開発環境ではハードディスクの読み書き速度を向上させる必要があり、現在のハードディスクの読み書き熟読は24.0 MB/sにすぎず、etcdデータベースではrequest timeout、健康診断タイムアウト、サービス再起動などの問題がしばしば発生している.
1.テスト環境:192.168.1.1.1135パスワード:xxxx構成:3 master 8 c 32 g etcdバージョン:3.2.20
1.1再起動回数:
[root@portal135 ~]# kubectl get pods -n kube-system|grep "kube-.*bocloud.com"
kube-apiserver-portal134.bocloud.com 1/1 Running 2 5d23h
kube-apiserver-portal135.bocloud.com 1/1 Running 8 5d23h
kube-apiserver-portal136.bocloud.com 1/1 Running 3 5d23h
kube-controller-manager-portal134.bocloud.com 1/1 Running 2 5d23h
kube-controller-manager-portal135.bocloud.com 1/1 Running 7 5d23h
kube-controller-manager-portal136.bocloud.com 1/1 Running 5 5d23h
kube-scheduler-portal134.bocloud.com 1/1 Running 2 5d23h
kube-scheduler-portal135.bocloud.com 1/1 Running 7 5d23h
kube-scheduler-portal136.bocloud.com 1/1 Running 5 5d23h
1.2応答時間:
[root@portal134 upload]# export NODE_IPS="192.168.1.134 192.168.1.135 192.168.1.136"
[root@portal134 upload]# for ip in ${NODE_IPS}; do ETCDCTL_API=3 etcdctl --endpoints=https://${ip}:2379 --cacert=/etc/etcd/ssl/ca.crt --cert=/etc/etcd/ssl/client.crt --key=/etc/etcd/ssl/client.key endpoint health; done
https://192.168.1.134:2379 is healthy: successfully committed proposal: took = 2.711272ms
https://192.168.1.135:2379 is healthy: successfully committed proposal: took = 2.089683ms
https://192.168.1.136:2379 is healthy: successfully committed proposal: took = 1.935061ms
2.開発環境:192.168.2.1103パスワード:xxxx配置:3 master 4 c 8 g etcdバージョン:3.4.3
2.1再起動回数:
[root@boc-108 ~]# kubectl get pods -n kube-system|grep "kube-.*.dev"
kube-apiserver-boc-103.dev 1/1 Running 63 4d4h
kube-apiserver-boc-104.dev 1/1 Running 67 4d4h
kube-apiserver-boc-108.dev 1/1 Running 82 4d4h
kube-controller-manager-boc-103.dev 1/1 Running 543 4d4h
kube-controller-manager-boc-104.dev 1/1 Running 561 4d4h
kube-controller-manager-boc-108.dev 1/1 Running 558 4d4h
kube-scheduler-boc-103.dev 1/1 Running 562 4d4h
kube-scheduler-boc-104.dev 1/1 Running 556 4d4h
kube-scheduler-boc-108.dev 1/1 Running 561 4d4h
2.2応答時間:
[root@boc-103 test]# export NODE_IPS="192.168.2.103 192.168.2.104 192.168.2.108"
[root@boc-103 test]# for ip in ${NODE_IPS}; do ETCDCTL_API=3 etcdctl --endpoints=https://${ip}:2379 --cacert=/etc/etcd/ssl/ca.crt --cert=/etc/etcd/ssl/client.crt --key=/etc/etcd/ssl/client.key endpoint health; done
https://192.168.2.103:2379 is healthy: successfully committed proposal: took = 1.318040913s
https://192.168.2.104:2379 is healthy: successfully committed proposal: took = 1.557571675s
https://192.168.2.108:2379 is healthy: successfully committed proposal: took = 54.749774ms
3.エラー現象:
kube-controller-managerとkube-schedulerの再起動原因
etcdserver: request timed out
E0224 03:54:36.433286 1 cronjob_controller.go:125] Failed to extract job list: etcdserver: request timed out
E0224 03:54:37.592575 1 leaderelection.go:331] error retrieving resource lock kube-system/kube-controller-manager: etcdserver: request timed out
I0224 03:54:38.047024 1 leaderelection.go:288] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
I0224 03:54:38.047089 1 event.go:281] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"", Name:"", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' portal135.bocloud.com_46d96c59-fa2a-43d3-aa0e-4c969a287338 stopped leading
F0224 03:54:38.047158 1 controllermanager.go:279] leaderelection lost
apiserverも複数回登場
E0224 03:54:37.591267 1 status.go:71] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
4.cube環境10.10.5.8パスワード:xxxx構成:3 master 8 c 16 g
4.1再起動回数:
[root@master ~]# kubectl get pods -n kube-system|grep .novalocal
kube-apiserver-master.novalocal 1/1 Running 1 8h
kube-apiserver-node-1.novalocal 1/1 Running 1 8h
kube-apiserver-node-2.novalocal 1/1 Running 1 8h
kube-controller-manager-master.novalocal 1/1 Running 7 8h
kube-controller-manager-node-1.novalocal 1/1 Running 8 8h
kube-controller-manager-node-2.novalocal 1/1 Running 6 8h
kube-scheduler-master.novalocal 1/1 Running 8 8h
kube-scheduler-node-1.novalocal 1/1 Running 6 8h
kube-scheduler-node-2.novalocal 1/1 Running 8 8h
4.2応答時間:
https://10.10.5.8:2379 is healthy: successfully committed proposal: took = 15.970648ms
https://10.10.5.48:2379 is healthy: successfully committed proposal: took = 13.325127ms
https://10.10.5.38:2379 is healthy: successfully committed proposal: took = 18.190006ms
5.解決策
5.1 32 g 6 cに拡張したが、ioの操作は依然として遅く、シャットダウンはハードディスクの読み書き速度にある.
5.2 ramdiskをします.メモリを
/var/lib/etcd
にマウントし、etcdデータをハードディスクとして格納します.mkdir /var/lib/etcd && mount -t tmpfs -o size=2G tmpfs /var/lib/etcd && echo "tmpfs /var/lib/etcd tmpfs defaults,size=2G 0 0" >> /etc/fstab
5.2.1 io操作比較結果:
[root@node-139 ~]# dd bs=1M count=1000 if=/dev/zero of=/media/tmp/a.txt conv=fdatasync
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.457043 s, 2.3 GB/s
[root@node-139 ~]# dd bs=1M count=1000 if=/dev/zero of=/home/a.txt conv=fdatasync
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 43.7001 s, 24.0 MB/s
:/media/tmp ramdisk
5.3検証結果:etcdの応答速度が著しく向上した.
[root@node-218 ~]# for ip in ${NODE_IPS}; do ETCDCTL_API=3 etcdctl --endpoints=https://${ip}:2379 --cacert=/etc/etcd/ssl/ca.crt --cert=/etc/etcd/ssl/client.crt --key=/etc/etcd/ssl/client.key endpoint health; done
https://192.168.2.218:2379 is healthy: successfully committed proposal: took = 11.960397ms
https://192.168.2.219:2379 is healthy: successfully committed proposal: took = 10.482817ms
https://192.168.2.220:2379 is healthy: successfully committed proposal: took = 10.786569ms
[root@node-218 ~]# kubectl get pods -n kube-system|grep .dev
kube-apiserver-node-218.dev 1/1 Running 0 10m
kube-apiserver-node-219.dev 1/1 Running 0 10m
kube-apiserver-node-220.dev 1/1 Running 0 11m
kube-controller-manager-node-218.dev 1/1 Running 1 10m
kube-controller-manager-node-219.dev 1/1 Running 1 10m
kube-controller-manager-node-220.dev 1/1 Running 1 11m
kube-scheduler-node-218.dev 1/1 Running 1 10m
kube-scheduler-node-219.dev 1/1 Running 0 11m
kube-scheduler-node-220.dev 1/1 Running 1 11m
[root@node-218 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 28G 3.5G 25G 13% /
devtmpfs 5.8G 0 5.8G 0% /dev
tmpfs 5.8G 0 5.8G 0% /dev/shm
tmpfs 5.8G 34M 5.8G 1% /run
tmpfs 5.8G 0 5.8G 0% /sys/fs/cgroup
/dev/sda1 1014M 145M 870M 15% /boot
tmpfs 1.2G 0 1.2G 0% /run/user/0
tmpfs 2.0G 125M 1.9G 7% /var/lib/etcd
192.168.2.214:/opt/share 28G 2.5G 26G 9% /abcsys/upload
5.4 18時間経過後、0回再起動
[root@node-218 ~]# kubectl get pods -n kube-system|grep dev
kube-apiserver-node-218.dev 1/1 Running 0 18h
kube-apiserver-node-219.dev 1/1 Running 0 18h
kube-apiserver-node-220.dev 1/1 Running 0 18h
kube-controller-manager-node-218.dev 1/1 Running 1 18h
kube-controller-manager-node-219.dev 1/1 Running 1 18h
kube-controller-manager-node-220.dev 1/1 Running 1 18h
kube-scheduler-node-218.dev 1/1 Running 1 18h
kube-scheduler-node-219.dev 1/1 Running 0 18h
kube-scheduler-node-220.dev 1/1 Running 1 18h
5.5 ramdisk純在を使用する問題
データがメモリに格納されているため、電源が切れると失われます.
6まとめ
開発環境ではハードディスクの読み書き速度を向上させる必要があり、現在のハードディスクの読み書き熟読は24.0 MB/sにすぎず、etcdデータベースではrequest timeout、健康診断タイムアウト、サービス再起動などの問題がしばしば発生している.