当您缩减一个 Deployment
或 ReplicaSet
的副本数时,控制器必须从其管理的众多 Pod 中做出选择:删除哪一个?这是一个在应用更新和弹性伸缩中频繁发生的操作。与因节点资源不足而引发的"被动"驱逐不同,控制器的"主动"缩容遵循一套独立的、明确的优先级规则。本文将深入解析控制器在缩容场景下选择待删除 Pod 的内部决策逻辑。
假设一个 Deployment 管理着 5 个完全相同的 Pod,现在需要缩减到 3 个。由于这 5 个 Pod 都来自同一个 Pod Template,它们的 PriorityClass 和 QoS 等级 通常是完全一样的。因此,控制器无法使用这两个指标来区分它们。
那么,控制器究竟依据什么来排序和选择呢?
Talk is cheap, show me the code!
让我们读一下 kubernetes
的源码,并逐步解释:
// ActivePodsWithRanks is a sortable list of pods and a list of corresponding
// ranks which will be considered during sorting. The two lists must have equal
// length. After sorting, the pods will be ordered as follows, applying each
// rule in turn until one matches:
//
// 1. If only one of the pods is assigned to a node, the pod that is not
// assigned comes before the pod that is.
// 2. If the pods' phases differ, a pending pod comes before a pod whose phase
// is unknown, and a pod whose phase is unknown comes before a running pod.
// 3. If exactly one of the pods is ready, the pod that is not ready comes
// before the ready pod.
// 4. If controller.kubernetes.io/pod-deletion-cost annotation is set, then
// the pod with the lower value will come first.
// 5. If the pods' ranks differ, the pod with greater rank comes before the pod
// with lower rank.
// 6. If both pods are ready but have not been ready for the same amount of
// time, the pod that has been ready for a shorter amount of time comes
// before the pod that has been ready for longer.
// 7. If one pod has a container that has restarted more than any container in
// the other pod, the pod with the container with more restarts comes
// before the other pod.
// 8. If the pods' creation times differ, the pod that was created more recently
// comes before the older pod.
//
// In 6 and 8, times are compared in a logarithmic scale. This allows a level
// of randomness among equivalent Pods when sorting. If two pods have the same
// logarithmic rank, they are sorted by UUID to provide a pseudorandom order.
//
// If none of these rules matches, the second pod comes before the first pod.
//
// The intention of this ordering is to put pods that should be preferred for
// deletion first in the list.
type ActivePodsWithRanks struct {
// Pods is a list of pods.
Pods []*v1.Pod
// Rank is a ranking of pods. This ranking is used during sorting when
// comparing two pods that are both scheduled, in the same phase, and
// having the same ready status.
Rank []int
// Now is a reference timestamp for doing logarithmic timestamp comparisons.
// If zero, comparison happens without scaling.
Now metav1.Time
}
func (s ActivePodsWithRanks) Len() int {
return len(s.Pods)
}
func (s ActivePodsWithRanks) Swap(i, j int) {
s.Pods[i], s.Pods[j] = s.Pods[j], s.Pods[i]
s.Rank[i], s.Rank[j] = s.Rank[j], s.Rank[i]
}
// Less compares two pods with corresponding ranks and returns true if the first
// one should be preferred for deletion.
func (s ActivePodsWithRanks) Less(i, j int) bool {
// 1. Unassigned < assigned
// If only one of the pods is unassigned, the unassigned one is smaller
if s.Pods[i].Spec.NodeName != s.Pods[j].Spec.NodeName && (len(s.Pods[i].Spec.NodeName) == 0 || len(s.Pods[j].Spec.NodeName) == 0) {
return len(s.Pods[i].Spec.NodeName) == 0
}
// 2. PodPending < PodUnknown < PodRunning
if podPhaseToOrdinal[s.Pods[i].Status.Phase] != podPhaseToOrdinal[s.Pods[j].Status.Phase] {
return podPhaseToOrdinal[s.Pods[i].Status.Phase] < podPhaseToOrdinal[s.Pods[j].Status.Phase]
}
// 3. Not ready < ready
// If only one of the pods is not ready, the not ready one is smaller
if podutil.IsPodReady(s.Pods[i]) != podutil.IsPodReady(s.Pods[j]) {
return !podutil.IsPodReady(s.Pods[i])
}
// 4. lower pod-deletion-cost < higher pod-deletion cost
if utilfeature.DefaultFeatureGate.Enabled(features.PodDeletionCost) {
pi, _ := helper.GetDeletionCostFromPodAnnotations(s.Pods[i].Annotations)
pj, _ := helper.GetDeletionCostFromPodAnnotations(s.Pods[j].Annotations)
if pi != pj {
return pi < pj
}
}
// 5. Doubled up < not doubled up
// If one of the two pods is on the same node as one or more additional
// ready pods that belong to the same replicaset, whichever pod has more
// colocated ready pods is less
if s.Rank[i] != s.Rank[j] {
return s.Rank[i] > s.Rank[j]
}
// TODO: take availability into account when we push minReadySeconds information from deployment into pods,
// see https://github.com/kubernetes/kubernetes/issues/22065
// 6. Been ready for empty time < less time < more time
// If both pods are ready, the latest ready one is smaller
if podutil.IsPodReady(s.Pods[i]) && podutil.IsPodReady(s.Pods[j]) {
readyTime1 := podReadyTime(s.Pods[i])
readyTime2 := podReadyTime(s.Pods[j])
if !readyTime1.Equal(readyTime2) {
if !utilfeature.DefaultFeatureGate.Enabled(features.LogarithmicScaleDown) {
return afterOrZero(readyTime1, readyTime2)
} else {
if s.Now.IsZero() || readyTime1.IsZero() || readyTime2.IsZero() {
return afterOrZero(readyTime1, readyTime2)
}
rankDiff := logarithmicRankDiff(*readyTime1, *readyTime2, s.Now)
if rankDiff == 0 {
return s.Pods[i].UID < s.Pods[j].UID
}
return rankDiff < 0
}
}
}
// 7. Pods with containers with higher restart counts < lower restart counts
if res := compareMaxContainerRestarts(s.Pods[i], s.Pods[j]); res != nil {
return *res
}
// 8. Empty creation time pods < newer pods < older pods
if !s.Pods[i].CreationTimestamp.Equal(&s.Pods[j].CreationTimestamp) {
if !utilfeature.DefaultFeatureGate.Enabled(features.LogarithmicScaleDown) {
return afterOrZero(&s.Pods[i].CreationTimestamp, &s.Pods[j].CreationTimestamp)
} else {
if s.Now.IsZero() || s.Pods[i].CreationTimestamp.IsZero() || s.Pods[j].CreationTimestamp.IsZero() {
return afterOrZero(&s.Pods[i].CreationTimestamp, &s.Pods[j].CreationTimestamp)
}
rankDiff := logarithmicRankDiff(s.Pods[i].CreationTimestamp, s.Pods[j].CreationTimestamp, s.Now)
if rankDiff == 0 {
return s.Pods[i].UID < s.Pods[j].UID
}
return rankDiff < 0
}
}
return false
}
Deployment
的缩容逻辑实际上是由其控制的 ReplicaSet
来执行的。ReplicaSet
控制器在挑选要删除的 Pod 时,会遵循一个精心设计的排序算法,目标是优先删除"价值最低"的 Pod,以最小化对服务的影响。
这个排序逻辑直接实现在 ReplicaSet
控制器内部,而不是一个通用的工具函数中。它定义了明确的优先级顺序,排名越靠前,越优先被删除。
根据 ActivePodsWithRanks.Less()
方法的实现,共有 8 层决策规则,按顺序应用直到找到匹配的规则:
节点分配状态: 如果只有一个 Pod 被分配到节点,未分配节点的 Pod 会优先于已分配节点的 Pod 被删除。
Pod 阶段状态: 如果 Pod 的阶段不同,按照 Pending < Unknown < Running 的顺序,Pending 状态的 Pod 优先于 Unknown 状态,Unknown 状态优先于 Running 状态被删除。
就绪状态: 如果只有一个 Pod 就绪,未就绪的 Pod 会优先于就绪的 Pod 被删除。
Pod 删除成本注解 (controller.kubernetes.io/pod-deletion-cost
): 这是最重要、最直接的人工干预手段。从 Kubernetes v1.22 开始成为 Beta 特性,您可以给 Pod 添加这个注解来影响删除顺序。
int32
的字符串。Pod 排名 (Rank): 如果 Pod 的排名不同,排名值较高的 Pod 会优先于排名值较低的 Pod 被删除。排名通常基于以下因素计算:
就绪时间对比: 如果两个 Pod 都处于就绪状态但就绪时间不同,会优先删除就绪时间较短的 Pod(使用对数时间比较,增加随机性)。
容器重启次数: 如果一个 Pod 的容器重启次数比另一个 Pod 的任何容器都多,重启次数较多的 Pod 会被优先删除,因为频繁重启通常表示 Pod 不稳定。
Pod 创建时间: 如果 Pod 的创建时间不同,会优先删除创建时间较晚 (Newest) 的 Pod,使用对数时间比较以增加随机性。这是一种保护长时间运行的、可能包含重要状态或缓存的旧 Pod 的策略。
与 ReplicaSet
不同,StatefulSet
的缩容逻辑非常简单直接:严格按照 Pod 序号的倒序进行删除。例如,一个 3 副本的 StatefulSet
(ss-0
, ss-1
, ss-2
) 缩容到 2 个副本时,一定会先删除 ss-2
。这是为了保证其有序、稳定的特性。
pod-deletion-cost
假设您有一个应用,其中一些 Pod 负责处理实时请求,另一些 Pod 可能因为负载下降而处于空闲状态。您可以通过一个外部监控系统,在 Pod 空闲时为其添加一个较低的删除成本。
示例:将一个 Pod 标记为易于删除
# 为 my-pod-xyz 添加一个很低的删除成本
kubectl annotate pod my-pod-xyz controller.kubernetes.io/pod-deletion-cost="-100"
当下次缩容发生时,这个 Pod 将会因为其极低的删除成本而被优先选中。
Deployment
/ReplicaSet
)在缩容时,不使用 PriorityClass
或 QoS
作为主要的决策依据。controller.kubernetes.io/pod-deletion-cost
注解是控制缩容行为最直接、最强大的工具,允许将应用层面的状态反馈给 Kubernetes。