Runing a descheduler on your own K8S cluster
v1.0
2019-08-01
一,前言
當我們建置完成一個多 worker nodes 的 K8S cluster 之後,本身就能提供相當的容錯能力。也就是當其中一臺 worker 節點故障或是用 drain 命令排除的時候,運行其上的 pods 會自動轉移至其他良好的 worker 節點上。然而,當故障節點修復上線或用 uncordon 命令加回 cluster 之後,已經處於運行狀態的 pods 並不會自動轉移回康復節點, 只有新產生的 pods 才會被安排在其上運行。
有時候,不同的 node 或因資源不同或因工作量不同,會導致各自有不同的負載:有些節點很忙、有些則很閒。
爲解決上述情況,除了管理員人爲調配之外,我們也可以藉助一個名爲 DeScheduler 的工具來實現服務的自動化平衡調度。
二,關於 DeScheduler
如下爲 DeScheduler 的 GitHub:
https://github.com/kubernetes-incubator/descheduler有興趣的朋友可以先去瞭解它的設檔細節,尤其是其中的 Policy and Strategies:
- RemoveDuplicates
- LowNodeUtilization
- RemovePodsViolatingInterPodAntiAffinity
- RemovePodsViolatingNodeAffinity
這裏暫時不討論這 4 種基本策略的內容了,我們只需要將所要的策略配置爲 ConfigMap 即可(一個單一的 ConfigMap 可以配置多個策略)。
三,部署 DeScheduler
3.1 建立一個空白目錄:
mkdir descheduler-yaml
cd descheduler-yaml
3.2 建置 ClusterRole & ServiceAccount:
cat > cluster_role.yaml << END
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: descheduler
namespace: kube-system
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "watch", "list"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "watch", "list", "delete"]
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
END
kubectl apply -f cluster_role.yaml
kubectl create sa descheduler -n kube-system
kubectl create clusterrolebinding descheduler \
-n kube-system \
--clusterrole=descheduler \
--serviceaccount=kube-system:descheduler
3.3 配置 ConfigMap:
cat > config_map.yaml << END
apiVersion: v1
kind: ConfigMap
metadata:
name: descheduler
namespace: kube-system
data:
policy.yaml: |-
apiVersion: descheduler/v1alpha1
kind: DeschedulerPolicy
strategies:
RemoveDuplicates:
enabled: true
LowNodeUtilization:
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
cpu: 20
memory: 20
pods: 20
targetThresholds:
cpu: 50
memory: 50
pods: 50
RemovePodsViolatingInterPodAntiAffinity:
enabled: true
RemovePodsViolatingNodeAffinity:
enabled: true
params:
nodeAffinityType:
- requiredDuringSchedulingIgnoredDuringExecution
END
kubectl apply -f config_map.yaml
#注意: 這裏我們將 RemoveDuplicates 策略設定爲啓用(enabled: true),如果不需要自動將 pods 分配到新成員節點的話,可以設定爲停用(enabled: false)。如果系統資源充足而達不到觸發條件,可以調低 LowNodeUtilization 策略的 thresholds 數值(cpu, momory, pod 數量,這三者要同時滿足才能觸發作動)。
3.4 配置 CronJob:
cat > cron_job.yaml << END
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: descheduler
namespace: kube-system
spec:
schedule: "*/30 * * * *"
jobTemplate:
metadata:
name: descheduler
annotations:
scheduler.alpha.kubernetes.io/critical-pod: "true"
spec:
template:
spec:
serviceAccountName: descheduler
containers:
- name: descheduler
image: komljen/descheduler:v0.6.0
volumeMounts:
- mountPath: /policy-dir
name: policy-volume
command:
- /bin/descheduler
- --v=4
- --max-pods-to-evict-per-node=10
- --policy-config-file=/policy-dir/policy.yaml
restartPolicy: "OnFailure"
volumes:
- name: policy-volume
configMap:
name: descheduler
END
kubectl apply -f cron_job.yaml
# 注意:這裏設定的 CronJob 是每 30 分鐘執行一次,假如想在測試時更快驗證效果,可以調整爲更短的時間。
四,驗證
4.1 確認 CronJob:
kubectl get cronjobs -n kube-system
確定可以看到類似如下的結果:
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
descheduler */30 * * * * False 0 2m 32m4.2 確認完成工作的 pods:
kubectl get pods -n kube-system | grep Completed
會看到類似如下的結果:
descheduler-1564670400-67tqx 0/1 Completed 0 1H
descheduler-1564670700-2vwhv 0/1 Completed 0 32m
descheduler-1564671000-g69nc 0/1 Completed 0 2m4.3 檢查 logs:
kubectl -n kube-system logs descheduler-1564671000-g69nc
如果沒觸發任何作動的話,最後一行會類似如下:
...
I0505 11:55:08.160964 1 node_affinity.go:72] Evicted 0 pods4.4 排擠測試節點:
kubectl drain worker03.localdomain --ignore-daemonsets --delete-local-data --grace-period=0 --force
kubectl get nodes worker03.localdomain
確認節點狀態類似如下:
NAME STATUS ROLES AGE VERSION
worker03.localdomain Ready,SchedulingDisabled <none> 71d v1.15.0等待片刻,確認所有 running pods 都只運行在其他 workers 上面:
web-6fc4fb46d-k5pzr 1/1 Running 0 88s 10.47.0.23 worker01.localdomain <none> <none>
web-6fc4fb46d-l9h4j 1/1 Running 0 52s 10.39.0.24 worker02.localdomain <none> <none>
web-6fc4fb46d-mqwqv 1/1 Running 0 85s 10.47.0.26 worker01.localdomain <none> <none>
web-6fc4fb46d-phr8r 1/1 Running 0 71s 10.47.0.8 worker01.localdomain <none> <none>
web-6fc4fb46d-t5ct7 1/1 Running 0 80s 10.47.0.27 worker01.localdomain <none> <none>
web-6fc4fb46d-vq4mk 1/1 Running 0 5m38s 10.39.0.8 worker02.localdomain <none> <none>
web-6fc4fb46d-ww2nq 1/1 Running 0 6m8s 10.47.0.10 worker01.localdomain <none> <none>
web-6fc4fb46d-wz8vl 1/1 Running 0 58s 10.39.0.22 worker02.localdomain <none> <none>
web-6fc4fb46d-xvk48 1/1 Running 0 5m25s 10.39.0.11 worker02.localdomain <none> <none>
web-6fc4fb46d-xxr5q 1/1 Running 0 5m56s 10.47.0.16 worker01.localdomain <none> <none>
web-6fc4fb46d-zcg6l 1/1 Running 0 2m29s 10.47.0.18 worker01.localdomain <none> <none>
web-6fc4fb46d-zh7zv 1/1 Running 0 11m 10.39.0.19 worker02.localdomain <none> <none>
web-6fc4fb46d-zldt7 1/1 Running 0 5m31s 10.39.0.4 worker02.localdomain <none> <none>
web-6fc4fb46d-zxrxw 1/1 Running 0 31m 10.39.0.7 worker02.localdomain <none> <none>4.4 將節點復原:
kubectl uncordon worker03.localdomain
kubectl get nodes
確認所有節點都處於 ready 狀態:
NAME STATUS ROLES AGE VERSION
master01.localdomain Ready master 71d v1.15.0
master02.localdomain Ready master 71d v1.15.0
master03.localdomain Ready master 71d v1.15.0
worker01.localdomain Ready <none> 71d v1.15.0
worker02.localdomain Ready <none> 71d v1.15.0
worker03.localdomain Ready <none> 71d v1.15.0觀察 CronJob 剛有執行最近一次工作:
kubectl get pods -n kube-system | grep Completed
看到類似如下的結果:
descheduler-1564671600-spl42 0/1 Completed 0 1h
descheduler-1564671900-2sn9j 0/1 Completed 0 30m
descheduler-1564672200-sq5zw 0/1 Completed 0 77s再次檢查 logs:
kubectl -n kube-system logs descheduler-1564672200-sq5zw
將會發現有相當數量的 pods 被 evicted 了:
...
I0801 15:11:00.012104 1 node_affinity.go:72] Evicted 20 pods確認 pods 被重新分配回節點:
web-6fc4fb46d-n687t 1/1 Running 0 87s 10.42.0.17 worker03.localdomain <none> <none>
web-6fc4fb46d-nzdrs 1/1 Running 0 91s 10.42.0.16 worker03.localdomain <none> <none>
web-6fc4fb46d-qrn6n 1/1 Running 0 2m8s 10.47.0.14 worker01.localdomain <none> <none>
web-6fc4fb46d-qxd8v 1/1 Running 0 2m1s 10.39.0.15 worker02.localdomain <none> <none>
web-6fc4fb46d-rpw8b 1/1 Running 0 70s 10.42.0.11 worker03.localdomain <none> <none>
web-6fc4fb46d-rxxrn 1/1 Running 0 2m3s 10.47.0.19 worker01.localdomain <none> <none>
web-6fc4fb46d-svts8 1/1 Running 0 2m6s 10.47.0.15 worker01.localdomain <none> <none>
web-6fc4fb46d-v9q9c 1/1 Running 0 2m4s 10.47.0.17 worker01.localdomain <none> <none>
web-6fc4fb46d-x5vrs 1/1 Running 0 110s 10.39.0.21 worker02.localdomain <none> <none>
web-6fc4fb46d-xfrnh 1/1 Running 0 76s 10.42.0.8 worker03.localdomain <none> <none>
web-6fc4fb46d-xmz64 1/1 Running 0 7m11s 10.42.0.4 worker03.localdomain <none> <none>
web-6fc4fb46d-z2xhw 1/1 Running 0 7m9s 10.42.0.7 worker03.localdomain <none> <none>
web-6fc4fb46d-zkv95 1/1 Running 0 7m12s 10.42.0.2 worker03.localdomain <none> <none>
web-6fc4fb46d-zltxl 1/1 Running 0 105s 10.47.0.6 worker01.localdomain <none> <none>五,結論
透過 DeScheduler 我們可以實現動態的服務應用自動化配置,將負載平均到所有的 worker 節點,可確保 cluster 資源消費的合理化,有助於提升容錯性與服務穩定性。值得一試。