K8s的亲和力(Affinity)是一种调度规则,用用于Pod资源,用于控制Pod的调度行为,比如Pod应该或者尽量调度到哪些节点上、Pod不能哪些Pod部署在一起等,从而实现更灵活的资源管理和资源隔离。
K8s的亲和力和支持三种类型:节点亲和力(NodeAffinity)、Pod亲和力(PodAffinity)、Pod反亲和力(PodAntiAffinity)。同时每种亲和力也分为了强制和非强制调度策略,用于实现调度规则是属于尽量满足还是必须满足。
名称 | 描述 |
---|---|
NodeAffinity(节点亲和力) | 用于指定 Pod 应该被调度到哪些节点上,或者避免被调度到哪些节点上 |
PodAffinity(Pod 亲和力) | 将新 Pod 调度到与指定标签现有 Pod 相同的拓扑域 |
PodAntiAffinity(Pod 反亲和力) | 强制新 Pod 远离与指定标签现有 Pod 相同的拓扑域 |
requiredDuringSchedulingIgnoredDuringExecution | 必须满足指定条件,否则 Pod 无法调度(处于 Pending 状态) |
preferredDuringSchedulingIgnoredDuringExecution | 优先满足条件,但即使不满足也允许调度(仅影响调度优先级) |
在K8s中,拓扑域(Topology Domain)通常用于标识一组具有相似属性、相似网络特性的节点,这些节点通常位于同一个物理位置或者某个网络子网中。拓扑域一般用于亲和力配置,用于优化资源分配和提高系统的高可用性等。
在一个超大规模的集群中,可以使用拓扑域用于标记节点所在的机房、子网等信息。拓扑域的划分非常简单,只需要给节点添加一些标签即可,也就是同样的标签(Key和value均相同)表示属于同一个拓扑域。
拓扑键(Topology Key):用于指定拓扑域,比如指定 Topology Key 为
failure-domain.beta.kubernetes.io/zone
,即表示按照具有该标签的拓扑域作为作用范围。
在一个超大规模的集群中,可以使用不同的区域和可用区划分拓扑,同时可以精确到数据中心或机房,甚至某个机柜。
# 比如按照区域划分拓扑域:
[root@k8s-master01 ~]# kubectl label node k8s-master01 k8s-node01 region=beijing
[root@k8s-master01 ~]# kubectl label node k8s-node02 region=nanjing
# 同时如果一个区域具备多个数据中心,也可以按照可用区进行再次划分,比如北京区域有两个数据中心位于海淀和朝阳:
[root@k8s-master01 ~]# kubectl label node k8s-master01 k8s-node02 zone=beijing-haidian
[root@k8s-master01 ~]# kubectl label node k8s-node01 zone=beijing-chaoyang
# 如果需要进一步划分,可以按照不同机房进行划分拓扑域,比如海淀的可用域机器处于不同
的机房:
[root@k8s-master01 ~]# kubectl label node k8s-master01 engineroom=beijing-haidian-c1
节点 | 区域 | 数据中心 | 机房 |
---|---|---|---|
k8s-master01 | beijing | beijing-haidian | beijing-haidian-c1 |
k8s-node01 | beijing | beijing-chaoyang | |
k8s-node02 | nanjing | beijing-haidian |
# 使用 podAntiAffinity 和 requiredDuringSchedulingIgnoredDuringExecution 可以强制让某个应用和其他应用不处于同一个拓扑域:
[root@k8s-master01 ~]# vim diff-nodes.yaml
[root@k8s-master01 ~]# cat diff-nodes.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: diff-nodes
name: diff-nodes
spec:
replicas: 2
selector:
matchLabels:
app: diff-nodes
template:
metadata:
labels:
app: diff-nodes
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- diff-nodes
topologyKey: kubernetes.io/hostname
namespaces: []
containers:
- image: crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/nginx:1.15
name: diff-nodes
imagePullPolicy: IfNotPresent
[root@k8s-master01 ~]# kubectl create -f diff-nodes.yaml
[root@k8s-master01 ~]# kubectl get po
NAME READY STATUS RESTARTS AGE
diff-nodes-65758d8878-hbp78 1/1 Running 0 104s
diff-nodes-65758d8878-sqh5v 1/1 Running 0 104s
- labelSelector:Pod选择器配置,可以配置多个,支持key=value(matchLabels)和正则两种方式(matchExpressions)
- matchExpressions:匹配的Pod
- operator:匹配和节点亲和力一致
- In:相当于 Key = value 的形式
- NotIn:相当于 Key != value 的形式
- Exists:存在这个Key名的Pod
- DoesNotExist:不存在这个Key名的Pod
- topologyKey: 匹配的拓扑域的key,也就是节点上label的key,key和value相同的为同一个域,可以用于标注不同的机房和地区
- namespaces: 和哪些命名空间的Pod进行匹配,为空为当前命名空间
- namespacesSelector:使用标签匹配命名空间,为空为当前命名空间,如果为{}表示所有命名空间,支持 key=value 和正则两种方式,写法和上述labelSelector一致
# 扩充到四个节点(我们目前是三个)
[root@k8s-master01 ~]# kubectl scale deploy diff-nodes --replicas=4
# 发现第四个pod无法启动
[root@k8s-master01 ~]# kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
diff-nodes-65758d8878-hbp78 1/1 Running 0 11m 172.16.85.197 k8s-node01
diff-nodes-65758d8878-lf2gk 1/1 Running 0 34s 172.16.58.226 k8s-node02
diff-nodes-65758d8878-p4b66 0/1 Pending 0 34s
diff-nodes-65758d8878-sqh5v 1/1 Running 0 11m 172.16.32.129 k8s-master01
# 发现没有足够的节点
[root@k8s-master01 ~]# kubectl describe po diff-nodes-65758d8878-p4b66
....
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 26s default-scheduler 0/3 nodes are available: 3 node(s) didn't match pod anti-affinity rules. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.
# 使用 podAntiAffinity 和 preferredDuringSchedulingIgnoredDuringExecution 可以尽量让某个应用和其他应用不处于同一个拓扑域:
[root@k8s-master01 ~]# vim diff-nodes.yaml
[root@k8s-master01 ~]# cat diff-nodes.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: diff-nodes
name: diff-nodes
spec:
replicas: 2
selector:
matchLabels:
app: diff-nodes
template:
metadata:
labels:
app: diff-nodes
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- diff-nodes
topologyKey: kubernetes.io/hostname
namespaces: []
weight: 100
containers:
- image: crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/nginx:1.15
name: diff-nodes
imagePullPolicy: IfNotPresent
[root@k8s-master01 ~]# kubectl create -f diff-nodes.yaml
[root@k8s-master01 ~]# kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
diff-nodes-64cddf94f8-ccv2l 1/1 Running 0 5s 172.16.32.130 k8s-master01
diff-nodes-64cddf94f8-x4mzl 1/1 Running 0 5s 172.16.85.198 k8s-node01
[root@k8s-master01 ~]# kubectl scale deploy diff-nodes --replicas=3
[root@k8s-master01 ~]# kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
diff-nodes-64cddf94f8-ccv2l 1/1 Running 0 26s 172.16.32.130 k8s-master01
diff-nodes-64cddf94f8-gbnkb 1/1 Running 0 5s 172.16.58.227 k8s-node02
diff-nodes-64cddf94f8-x4mzl 1/1 Running 0 26s 172.16.85.198 k8s-node01
# 重启pod也可以在同一节点部署
[root@k8s-master01 ~]# kubectl rollout restart deploy diff-nodes
[root@k8s-master01 ~]# kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
diff-nodes-64cddf94f8-ccv2l 1/1 Running 0 80s 172.16.32.130 k8s-master01
diff-nodes-64cddf94f8-gbnkb 1/1 Running 0 59s 172.16.58.227 k8s-node02
diff-nodes-64cddf94f8-x4mzl 1/1 Terminating 0 80s 172.16.85.198 k8s-node01
diff-nodes-7478fb9978-fhfwq 1/1 Running 0 4s 172.16.85.199 k8s-node01
diff-nodes-7478fb9978-rrbcz 0/1 ContainerCreating 0 1s k8s-master01
[root@k8s-master01 ~]# kubectl scale deploy diff-nodes --replicas=4
deployment.apps/diff-nodes scaled
[root@k8s-master01 ~]# kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
diff-nodes-7478fb9978-dfss8 1/1 Running 0 7s 172.16.85.200 k8s-node01
diff-nodes-7478fb9978-dtb7d 1/1 Running 0 3m44s 172.16.58.228 k8s-node02
diff-nodes-7478fb9978-fhfwq 1/1 Running 0 3m51s 172.16.85.199 k8s-node01
diff-nodes-7478fb9978-rrbcz 1/1 Running 0 3m48s 172.16.32.131 k8s-master01
# 如果集群处于不同的可用域,可以把应用分布在不同的可用域,以提高服务的高可用性:
[root@k8s-master01 ~]# vim zone.yaml
[root@k8s-master01 ~]# cat zone.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: diff-zone
name: diff-zone
spec:
replicas: 2
selector:
matchLabels:
app: diff-zone
template:
metadata:
labels:
app: diff-zone
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- diff-zone
topologyKey: zone
namespaces: []
containers:
- image: crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/nginx:1.15
name: diff-zone
imagePullPolicy: IfNotPresent
[root@k8s-master01 ~]# kubectl create -f zone.yaml
# 可以看到k8s-node01和k8s-master01都属于同一个域
[root@k8s-master01 ~]# kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
diff-zone-85bc879b8c-nlkf6 1/1 Running 0 6s 172.16.85.203 k8s-node01
diff-zone-85bc879b8c-nx58l 1/1 Running 0 6s 172.16.32.134 k8s-master01
# 扩建一共pod,发现一直处于Pending状态
[root@k8s-master01 ~]# kubectl scale deploy diff-zone --replicas=3
deployment.apps/diff-zone scaled
[root@k8s-master01 ~]# kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
diff-zone-85bc879b8c-d6gff 0/1 Pending 0 2s
diff-zone-85bc879b8c-nlkf6 1/1 Running 0 24s 172.16.85.203 k8s-node01
diff-zone-85bc879b8c-nx58l 1/1 Running 0 24s 172.16.32.134 k8s-master01
如果集群分布在不同的可用域,为了提升基础组件的使用性能,可以把应用程序尽量和缓存服务部署在同一个可用域。
# 首先部署一个缓存服务:
[root@k8s-master01 ~]# kubectl create deploy cache --image=crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/redis:7.2.5
[root@k8s-master01 ~]# kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cache-7fc4bfff65-jmp58 1/1 Running 0 9s 172.16.85.205 k8s-node01
[root@k8s-master01 ~]# vim my-app.yaml
[root@k8s-master01 ~]# cat my-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: my-app
name: my-app
spec:
replicas: 1
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cache
topologyKey: zone
namespaces: []
weight: 100
containers:
- image: crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/nginx:1.15
name: my-app
imagePullPolicy: IfNotPresent
[root@k8s-master01 ~]# kubectl create -f my-app.yaml
# k8s-node01只有它自己的zone一样
[root@k8s-master01 ~]# kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cache-7fc4bfff65-4kxlb 1/1 Running 0 8m29s 172.16.85.206 k8s-node01
my-app-777f8b559c-k5djf 1/1 Running 0 5s 172.16.85.207 k8s-node01
# 无论扩建多少个pod它都只会落在k8s-node01上面
[root@k8s-master01 ~]# kubectl scale deploy my-app --replicas=3
[root@k8s-master01 ~]# kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cache-7fc4bfff65-4kxlb 1/1 Running 0 8m37s 172.16.85.206 k8s-node01
my-app-777f8b559c-7vgth 1/1 Running 0 2s 172.16.85.209 k8s-node01
my-app-777f8b559c-k5djf 1/1 Running 0 13s 172.16.85.207 k8s-node01
my-app-777f8b559c-tqzfc 1/1 Running 0 2s 172.16.85.208 k8s-node01
假设集群中有一批机器是高性能机器,而有一些需要密集计算的服务,需要部署至这些机器,以提高计算性能,此时可以使用节点亲和力来控制 Pod 尽量或者必须部署至这些节点上。
# 给k8s-node01加个标签
[root@k8s-master01 ~]# kubectl label node k8s-node01 disktype=ssd
# 比如计算服务只能部署在 ssd 或 nvme 的节点上:
[root@k8s-master01 ~]# vim compute.yaml
[root@k8s-master01 ~]# cat compute.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: compute
name: compute
spec:
replicas: 2
selector:
matchLabels:
app: compute
template:
metadata:
labels:
app: compute
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
- nvme
containers:
- image: crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/nginx:1.15
name: compute
imagePullPolicy: IfNotPresent
[root@k8s-master01 ~]# kubectl create -f compute.yaml
# 都只会落在k8s-node01上面
[root@k8s-master01 ~]# kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
compute-b95b9f746-rccrq 1/1 Running 0 4s 172.16.85.210 k8s-node01
compute-b95b9f746-xnqf5 1/1 Running 0 4s 172.16.85.211 k8s-node01
# 给k8s-node02加个标签
[root@k8s-master01 ~]# kubectl label node k8s-node02 disktype=nvme
# 如果不强制要求,可以让计算服务尽量部署至高性能机器:
[root@k8s-master01 ~]# vim compute.yaml
[root@k8s-master01 ~]# cat compute.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: compute
name: compute
spec:
replicas: 3
selector:
matchLabels:
app: compute
template:
metadata:
labels:
app: compute
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
- weight: 100
preference:
matchExpressions:
- key: disktype
operator: In
values:
- nvme
containers:
- image: crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/nginx:1.15
name: compute
imagePullPolicy: IfNotPresent
[root@k8s-master01 ~]# kubectl create -f compute.yaml
# k8s-node02与k8s-node02权重为2:1
[root@k8s-master01 ~]# kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
compute-5d6d9595b8-29rlb 1/1 Running 0 4s 172.16.58.251 k8s-node02
compute-5d6d9595b8-rtg7f 1/1 Running 0 4s 172.16.85.215 k8s-node01
compute-5d6d9595b8-vgxfr 1/1 Running 0 4s 172.16.58.252 k8s-node02
假如已知集群中有一些机器可能性能不佳或者其他因素的影响,需要控制某个服务尽量不部署至这些机器,此时只需要把 operator 改为 NotIn 即可
# 给k8s-master01加个标签
[root@k8s-master01 ~]# kubectl label node k8s-master01 performance=low
[root@k8s-master01 ~]# vim compute.yaml
[root@k8s-master01 ~]# cat compute.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: compute
name: compute
spec:
replicas: 3
selector:
matchLabels:
app: compute
template:
metadata:
labels:
app: compute
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
preference:
matchExpressions:
- key: performance
operator: NotIn
values:
- low
containers:
- image: crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/nginx:1.15
name: compute
imagePullPolicy: IfNotPresent
[root@k8s-master01 ~]# kubectl create -f compute.yaml
# 都不在k8s-master01上面
[root@k8s-master01 ~]# kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
compute-8648d6cfcf-dbsjr 1/1 Running 0 3s 172.16.85.217 k8s-node01
compute-8648d6cfcf-n57vt 1/1 Running 0 3s 172.16.85.218 k8s-node01
compute-8648d6cfcf-z6bzp 1/1 Running 0 3s 172.16.58.254 k8s-node02
Kubernetes 的 topologySpreadConstraints(拓扑域约束) 是一种高级的调度策略,用于确保工作负载的副本在集群中的不同拓扑域(如节点、可用区、区域等)之间均匀分布
[root@k8s-master01 ~]# vim example.yaml
[root@k8s-master01 ~]# cat example.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: example
name: example
spec:
replicas: 5
selector:
matchLabels:
app: example
template:
metadata:
labels:
app: example
spec:
topologySpreadConstraints:
- maxSkew: 1
whenUnsatisfiable: DoNotSchedule
topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app: example
containers:
- image: crpi-q1nb2n896zwtcdts.cn-beijing.personal.cr.aliyuncs.com/ywb01/nginx:1.15
name: example
imagePullPolicy: IfNotPresent
[root@k8s-master01 ~]# kubectl create -f example.yaml
# 符合最大偏差不超过1
[root@k8s-master01 ~]# kubectl get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
example-8695bf5cbb-5fqc2 1/1 Running 0 8s 172.16.32.142 k8s-master01
example-8695bf5cbb-cz4tn 1/1 Running 0 8s 172.16.58.255 k8s-node02
example-8695bf5cbb-hw7zm 1/1 Running 0 8s 172.16.85.219 k8s-node01
example-8695bf5cbb-j8w2x 1/1 Running 0 8s 172.16.85.220 k8s-node01
example-8695bf5cbb-w28hk 1/1 Running 0 8s 172.16.32.143 k8s-master01
- topologySpreadConstraints:拓扑域约束配置,可以是多个副本均匀分布在不同的域中,配置多个时,需要全部满足
- maxSkew:指定允许的最大偏差。例如,如果 maxSkew 设置为 1,那么在任何拓扑域中,副本的数量最多只能相差 1
- whenUnsatisfiable:指定当无法满足拓扑约束时的行为
- DoNotSchedule:不允许调度新的 Pod,直到满足约束
- ScheduleAnyway:即使不满足约束,也允许调度新的 Pod
- topologyKey:指定拓扑域的键
- labelSelector:指定要应用拓扑约束的 Pod 的标签选择器,通常配置为当前 Pod 的标签
通常会部署多个副本,并且使用亲和力或者其他策略让服务尽量部署至不同的节点
使用拓扑域划分不同的可用域,然后使用亲和力尽量把服务分散开,同时根据需求优先调度某些节点。
此博客来源于:https://edu.51cto.com/lecturer/11062970.html