k8s源码分析——kubernetes 中事件处理机制

Kubernetes Events虽不常被提起,却意义非凡。它存储在Etcd里,记录了集群运行所遇到的各种大事件。本系列文章将一步一步地揭开Kubernetes Events的神秘面纱。

当集群中的 node 或 pod 异常时,大部分用户会使用 kubectl 查看对应的 events,那么 events 是从何而来的?其实 k8s 中的各个组件会将运行时产生的各种事件汇报到 apiserver,对于 k8s 中的可描述资源,使用 kubectl describe 都可以看到其相关的 events,那 k8s 中又有哪几个组件都上报 events 呢?

只要在 k8s.io/kubernetes/cmd 目录下暴力搜索一下就能知道哪些组件会产生 events:

$ grep -R -n -i "EventRecorder" .

可以看出,controller-manage、kube-proxy、kube-scheduler、kubelet 都使用了 EventRecorder,本文只讲述 kubelet 中对 Events 的使用。

kubelet 事件机制

我们知道 kubernetes 是分布式的架构,apiserver 是整个集群的交互中心,客户端主要和它打交道,kubelet 是各个节点上的 worker,负责执行具体的任务。对于用户来说,每次创建资源的时候,除了看到它的最终状态(一般是运行态),希望看到资源执行的过程,中间经过了哪些步骤。这些反馈信息对于调试来说非常重要,有些任务会失败或者卡在某个步骤,有了这些信息,我们就能够准确地定位问题。

kubelet 需要把关键步骤中的执行事件发送到 apiserver,这样客户端就能通过查询知道整个流程发生了哪些事情,不需要登录到 kubelet 所在的节点查看日志的内容或者容器的运行状态。

Events 的定义

events 在 k8s.io/api/core/v1/types.go 中进行定义,结构体如下所示:

// Event is a report of an event somewhere in the cluster.
type Event struct {
	metav1.TypeMeta `json:",inline"`
	// Standard object's metadata.
	// More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata
	metav1.ObjectMeta `json:"metadata" protobuf:"bytes,1,opt,name=metadata"`

	// The object that this event is about.
	InvolvedObject ObjectReference `json:"involvedObject" protobuf:"bytes,2,opt,name=involvedObject"`

	// This should be a short, machine understandable string that gives the reason
	// for the transition into the object's current status.
	// TODO: provide exact specification for format.
	// +optional
	Reason string `json:"reason,omitempty" protobuf:"bytes,3,opt,name=reason"`

	// A human-readable description of the status of this operation.
	// TODO: decide on maximum length.
	// +optional
	Message string `json:"message,omitempty" protobuf:"bytes,4,opt,name=message"`

	// The component reporting this event. Should be a short machine understandable string.
	// +optional
	Source EventSource `json:"source,omitempty" protobuf:"bytes,5,opt,name=source"`

	// The time at which the event was first recorded. (Time of server receipt is in TypeMeta.)
	// +optional
	FirstTimestamp metav1.Time `json:"firstTimestamp,omitempty" protobuf:"bytes,6,opt,name=firstTimestamp"`

	// The time at which the most recent occurrence of this event was recorded.
	// +optional
	LastTimestamp metav1.Time `json:"lastTimestamp,omitempty" protobuf:"bytes,7,opt,name=lastTimestamp"`

	// The number of times this event has occurred.
	// +optional
	Count int32 `json:"count,omitempty" protobuf:"varint,8,opt,name=count"`

	// Type of this event (Normal, Warning), new types could be added in the future
	// +optional
	Type string `json:"type,omitempty" protobuf:"bytes,9,opt,name=type"`

	// Time when this Event was first observed.
	// +optional
	EventTime metav1.MicroTime `json:"eventTime,omitempty" protobuf:"bytes,10,opt,name=eventTime"`

	// Data about the Event series this event represents or nil if it's a singleton Event.
	// +optional
	Series *EventSeries `json:"series,omitempty" protobuf:"bytes,11,opt,name=series"`

	// What action was taken/failed regarding to the Regarding object.
	// +optional
	Action string `json:"action,omitempty" protobuf:"bytes,12,opt,name=action"`

	// Optional secondary object for more complex actions.
	// +optional
	Related *ObjectReference `json:"related,omitempty" protobuf:"bytes,13,opt,name=related"`

	// Name of the controller that emitted this Event, e.g. `kubernetes.io/kubelet`.
	// +optional
	ReportingController string `json:"reportingComponent" protobuf:"bytes,14,opt,name=reportingComponent"`

	// ID of the controller instance, e.g. `kubelet-xyzf`.
	// +optional
	ReportingInstance string `json:"reportingInstance" protobuf:"bytes,15,opt,name=reportingInstance"`
}

其中 InvolvedObject 代表和事件关联的对象,source 代表事件源,使用 kubectl 看到的事件一般包含 Type、Reason、Age、From、Message 几个字段。

k8s 中 events 目前只有两种类型:"Normal" 和 "Warning":

// Valid values for event types (new types could be added in future)
const (
	// Information only and will not cause any problems
	EventTypeNormal string = "Normal"
	// These events are to warn that something might go wrong
	EventTypeWarning string = "Warning"
)

事件机制源码分析

这部分我们讲直接分析 kubelet 的源码,了解事件机制实现的来龙去脉。

谁会发送事件?

kubernetes 是以 pod 为核心概念的,不管是 deployment、statefulSet、replicaSet,最终都会创建出来 pod。因此事件机制也是围绕 pod 进行的,在 pod 生命周期的关键步骤都会产生事件消息。比如 Controller Manager 会记录节点注册和销毁的事件、Deployment 扩容和升级的事件;kubelet 会记录镜像回收事件、volume 无法挂载事件等;Scheduler 会记录调度事件等。这篇文章只关心 kubelet 的情况,其他组件实现原理是一样的。

查看 pkg/kubelet/kubelet.go 文件的代码,你会看到类似下面的代码:

kl.recorder.Eventf(kl.nodeRef, api.EventTypeWarning, events.ContainerGCFailed, err.Error())

上面这行代码是容器 GC 失败的时候出现的,它发送了一条事件消息,通知 apiserver 容器 GC 失败的原因。除了 kubelet 本身之外,kubelet 的各个组件(比如 imageManager、probeManager 等)也会有这个字段,记录重要的事件,读者可以搜索源码去看 kubelet 哪些地方会发送事件。

recorder 是 kubelet 结构的一个字段:

type kubelet struct {
    ...
    // The EventBroader to use
    recorder    record.EventRecorder
    ...
}

它的类型是 record.EventRecorder,这是个定义了三个方法的 interface,代码在 pkg/client/record/event.go 文件中:

type EventRecorder interface {
    Event(object runtime.Object, eventtype, reason, message string)

    Eventf(object runtime.Object, eventtype, reason, messageFmt string, args ...interface{})

    PastEventf(object runtime.Object, timestamp metav1.Time, eventtype, reason, messageFmt string, args ...interface{})
}

EventRecord的具体实例是recorderImpl:

type recorderImpl struct {
	scheme *runtime.Scheme
	source v1.EventSource
	*watch.Broadcaster
	clock clock.Clock
}

EventRecorder 中包含的几个方法都是产生指定格式的 events,Event() 和 Eventf() 的功能类似 fmt.Println() 和 fmt.Printf(),kubelet 中的各个模块会调用 EventRecorder 生成 events。recorderImpl 是 EventRecorder 实际的对象。EventRecorder 的每个方法会调用 generateEvent,在 generateEvent 中初始化 events 。

以下是生成 events 的函数:

func (recorder *recorderImpl) generateEvent(object runtime.Object, annotations map[string]string, timestamp metav1.Time, eventtype, reason, message string) {
	ref, err := ref.GetReference(recorder.scheme, object)
	if err != nil {
		klog.Errorf("Could not construct reference to: '%#v' due to: '%v'. Will not report event: '%v' '%v' '%v'", object, err, eventtype, reason, message)
		return
	}

	if !validateEventType(eventtype) {
		klog.Errorf("Unsupported event type: '%v'", eventtype)
		return
	}

	event := recorder.makeEvent(ref, annotations, eventtype, reason, message)
	event.Source = recorder.source

	go func() {
		// NOTE: events should be a non-blocking operation
		defer utilruntime.HandleCrash()
		recorder.Action(watch.Added, event)
	}()
}

func (recorder *recorderImpl) makeEvent(ref *v1.ObjectReference, annotations map[string]string, eventtype, reason, message string) *v1.Event {
	t := metav1.Time{Time: recorder.clock.Now()}
	namespace := ref.Namespace
	if namespace == "" {
		namespace = metav1.NamespaceDefault
	}
	return &v1.Event{
		ObjectMeta: metav1.ObjectMeta{
			Name:        fmt.Sprintf("%v.%x", ref.Name, t.UnixNano()),
			Namespace:   namespace,
			Annotations: annotations,
		},
		InvolvedObject: *ref,
		Reason:         reason,
		Message:        message,
		FirstTimestamp: t,
		LastTimestamp:  t,
		Count:          1,
		Type:           eventtype,
	}
}

// Action distributes the given event among all watchers.
func (m *Broadcaster) Action(action EventType, obj runtime.Object) {
	m.incoming <- Event{action, obj}
}

初始化完 events 后会调用 recorder.Action() 将 events 发送到 Broadcaster 的事件接收队列中, Action() 是 Broadcaster 中的方法。

总结起来就是EventRecorder是用来生成一个Event对象,并且写入到管道channel中。

EventBroadcaster 的初始化

events 的整个生命周期都与 EventBroadcaster 有关,kubelet 中对 EventBroadcaster 的初始化在k8s.io/kubernetes/cmd/kubelet/app/server.go中:


func RunKubelet(kubeServer *options.KubeletServer, kubeDeps *kubelet.Dependencies, runOnce bool) error {
  ...
  // event 初始化
  makeEventRecorder(kubeDeps, nodeName)
  ...
}
 
 
func makeEventRecorder(kubeDeps *kubelet.Dependencies, nodeName types.NodeName) {
  if kubeDeps.Recorder != nil {
    return
  }
  // 初始化 EventBroadcaster实例,同时进入loop循环,从Broadcaster中读取管道中的event信息,
  //广播给所有watchers
  eventBroadcaster := record.NewBroadcaster()
  // 初始化 EventRecorder,用于生产Event,
  kubeDeps.Recorder = eventBroadcaster.NewRecorder(legacyscheme.Scheme, v1.EventSource{Component: componentKubelet, Host: string(nodeName)})
  // 记录 events 到本地日志
  eventBroadcaster.StartLogging(glog.V(3).Infof)
  if kubeDeps.EventClient != nil {
    glog.V(4).Infof("Sending events to api server.")
    // 上报 events 到 apiserver
  eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeDeps.EventClient.Events("")})
  } else {
    glog.Warning("No api server defined - no events will be sent to API server.")
  }
}

Events 的广播

上面已经说了,EventBroadcaster 初始化时会初始化一个 Broadcaster,Broadcaster 的作用就是接收所有的 events 并进行广播,Broadcaster 的实现在 k8s.io/apimachinery/pkg/watch/mux.go 中,Broadcaster 初始化完成后会在后台启动一个 goroutine,然后接收所有从 EventRecorder 发送过来的 events,Broadcaster 中有一个 map 会保存每一个注册的 watcher, 接着将 events 广播给所有的 watcher,每个 watcher 都有一个接收消息的 channel,watcher 可以通过它的 ResultChan() 方法从 channel 中读取数据进行消费。

以下是 Broadcaster 广播 events 的实现:

// 新建一个eventBroadcaster实例
。。。
eventBroadcaster := record.NewBroadcaster()
。。。

// Creates a new event broadcaster.
func NewBroadcaster() EventBroadcaster {
	return &eventBroadcasterImpl{watch.NewBroadcaster(maxQueuedEvents, watch.DropIfChannelFull), defaultSleepDuration}
}

func NewBroadcaster(queueLength int, fullChannelBehavior FullChannelBehavior) *Broadcaster {
	m := &Broadcaster{
		watchers:            map[int64]*broadcasterWatcher{},
		incoming:            make(chan Event, incomingQueueLength),
		watchQueueLength:    queueLength,
		fullChannelBehavior: fullChannelBehavior,
	}
	m.distributing.Add(1)
	go m.loop()
	return m
}

// loop receives from m.incoming and distributes to all watchers.
func (m *Broadcaster) loop() {
	// Deliberately not catching crashes here. Yes, bring down the process if there's a
	// bug in watch.Broadcaster.
	for event := range m.incoming {
		if event.Type == internalRunFunctionMarker {
			event.Object.(functionFakeRuntimeObject)()
			continue
		}
		m.distribute(event)
	}
	m.closeAll()
	m.distributing.Done()
}

注意:watcher注册是在调用函数StartEventWatcher中完成的watcher := eventBroadcaster.Watch(),也就是说实际上在初始化一个eventBroadcaster实例时,Broadcaster结构体中的watchers为空,之后在进行event处理时才注册watcher加入到Broadcaster结构体中。具体来说就是调用StartLogging和StartRecordingToSink(Events 的处理中会介绍到这两个方法)。

Events 的处理

那么 watcher 是从何而来呢?每一个要处理 events 的 client 都需要初始化一个 watcher,处理 events 的方法是在 EventBroadcaster 中定义的,以下是 EventBroadcaster 中对 events 处理的三个函数:

// StartEventWatcher starts sending events received from this EventBroadcaster to the given event handler function.
// The return value can be ignored or used to stop recording, if desired.
func (eventBroadcaster *eventBroadcasterImpl) StartEventWatcher(eventHandler func(*v1.Event)) watch.Interface {
	watcher := eventBroadcaster.Watch()    //注册一个watcher加入到Broadcaster结构体中
	go func() {
		defer utilruntime.HandleCrash()
		for watchEvent := range watcher.ResultChan() {
			event, ok := watchEvent.Object.(*v1.Event)
			if !ok {
				// This is all local, so there's no reason this should
				// ever happen.
				continue
			}
			eventHandler(event)
		}
	}()
	return watcher
}

StartEventWatcher() 首先实例化一个 watcher,每个 watcher 都会被塞入到 Broadcaster 的 watcher 列表中,watcher 从 Broadcaster 提供的 channel 中读取 events,然后再调用 eventHandler 进行处理,StartLogging() 和 StartRecordingToSink() 都是对 StartEventWatcher() 的封装,都会传入自己的处理函数。

// StartLogging starts sending events received from this EventBroadcaster to the given logging function.
// The return value can be ignored or used to stop recording, if desired.
func (eventBroadcaster *eventBroadcasterImpl) StartLogging(logf func(format string, args ...interface{})) watch.Interface {
	return eventBroadcaster.StartEventWatcher(
		func(e *v1.Event) {
			logf("Event(%#v): type: '%v' reason: '%v' %v", e.InvolvedObject, e.Type, e.Reason, e.Message)
		})
}

StartLogging() 传入的 eventHandler 仅将 events 保存到日志中。

func (eventBroadcaster *eventBroadcasterImpl) StartRecordingToSink(sink EventSink) watch.Interface {
	// The default math/rand package functions aren't thread safe, so create a
	// new Rand object for each StartRecording call.
	randGen := rand.New(rand.NewSource(time.Now().UnixNano()))
	eventCorrelator := NewEventCorrelator(clock.RealClock{})
	return eventBroadcaster.StartEventWatcher(    // 本质是封装成函数StartEventWatcher
		func(event *v1.Event) {
			recordToSink(sink, event, eventCorrelator, randGen, eventBroadcaster.sleepDuration)        // 具体调用的eventHandler()
		})
}


func recordToSink(sink EventSink, event *v1.Event, eventCorrelator *EventCorrelator, randGen *rand.Rand, sleepDuration time.Duration) {
	// Make a copy before modification, because there could be multiple listeners.
	// Events are safe to copy like this.
	eventCopy := *event
	event = &eventCopy
	result, err := eventCorrelator.EventCorrelate(event)
	if err != nil {
		utilruntime.HandleError(err)
	}
	if result.Skip {
		return
	}
	tries := 0
	for {
		if recordEvent(sink, result.Event, result.Patch, result.Event.Count > 1, eventCorrelator) {
			break
		}
		tries++
		if tries >= maxTriesPerEvent {
			klog.Errorf("Unable to write event '%#v' (retry limit exceeded!)", event)
			break
		}
		// Randomize the first sleep so that various clients won't all be
		// synced up if the master goes down.
		if tries == 1 {
			time.Sleep(time.Duration(float64(sleepDuration) * randGen.Float64()))
		} else {
			time.Sleep(sleepDuration)
		}
	}
}

StartRecordingToSink() 方法先根据当前时间生成一个随机数发生器 randGen,增加随机数是为了在重试时增加随机性,防止 apiserver 重启的时候所有的事件都在同一时间发送事件,接着实例化一个EventCorrelator,EventCorrelator 会对事件做一些预处理的工作,其中包括过滤、聚合、缓存等操作,具体代码不做详细分析,最后将 recordToSink() 函数作为处理函数,recordToSink() 会将处理后的 events 发送到 apiserver,这是 StartEventWatcher() 的整个工作流程。

func recordEvent(sink EventSink, event *v1.Event, patch []byte, updateExistingEvent bool, eventCorrelator *EventCorrelator) bool {
	var newEvent *v1.Event
	var err error
	if updateExistingEvent {
		newEvent, err = sink.Patch(event, patch)
	}
	// Update can fail because the event may have been removed and it no longer exists.
	if !updateExistingEvent || (updateExistingEvent && isKeyNotFoundError(err)) {
		// Making sure that ResourceVersion is empty on creation
		event.ResourceVersion = ""
		newEvent, err = sink.Create(event)
	}
	if err == nil {
		// we need to update our event correlator with the server returned state to handle name/resourceversion
		eventCorrelator.UpdateState(newEvent)
		return true
	}

	// If we can't contact the server, then hold everything while we keep trying.
	// Otherwise, something about the event is malformed and we should abandon it.
	switch err.(type) {
	case *restclient.RequestConstructionError:
		// We will construct the request the same next time, so don't keep trying.
		klog.Errorf("Unable to construct event '%#v': '%v' (will not retry!)", event, err)
		return true
	case *errors.StatusError:
		if errors.IsAlreadyExists(err) {
			klog.V(5).Infof("Server rejected event '%#v': '%v' (will not retry!)", event, err)
		} else {
			klog.Errorf("Server rejected event '%#v': '%v' (will not retry!)", event, err)
		}
		return true
	case *errors.UnexpectedObjectError:
		// We don't expect this; it implies the server's response didn't match a
		// known pattern. Go ahead and retry.
	default:
		// This case includes actual http transport errors. Go ahead and retry.
	}
	klog.Errorf("Unable to write event: '%v' (may retry after sleeping)", err)
	return false
}

sink.Create 和 sink.Patch 是自动生成的 apiserver 的 client。

Event 总结

通过这篇文章,我们了解到了整个事件机制的来龙去脉。最后,我们再做一个总结,看看事件流动的整个过程:

  1. kubelet 通过 recorder 对象提供的 EventEventf 和 PastEventf 方法产生特性的事件
  2. recorder 根据传递过来的参数新建一个 Event 对象,并把它发送给 EventBroadcaster 的管道
  3. EventBroadcaster 后台运行的 goroutine 从管道中读取事件消息,把它广播给之前注册的 handler 进行处理
  4. kubelet 有两个 handler,它们分别把事件记录到日志和发送给 apiserver。记录到日志很简单,直接打印就行
  5. 发送给 apiserver 的 handler 叫做 EventSink,它在发送事件给 apiserver 之前会先做预处理
  6. 预处理操作是 EventCorrelator 完成的,它会对事件做过滤、汇聚和去重操作,返回处理后的事件(可能是原来的事件,也可能是新创建的事件)
  7. 最后通过 restclient (eventClient) 调用对应的方法,给 apiserver 发送请求,这个过程如果出错会进行重试
  8. apiserver 接收到事件的请求把数据更新到 etcd

事件的产生过程是这样的,那么这些事件都有什么用呢?它一般用于调试,用户可以通过 kubectl 命令获取整个集群或者某个 pod 的事件信息。kubectl get events 可以看到所有的事件,kubectl describe pod PODNAME 能看到关于某个 pod 的事件。对于前者很好理解,kubectl 会直接访问 apiserver 的 event 资源,而对于后者 kubectl 还根据 pod 的名字进行搜索,匹配 InvolvedObject 名称和 pod 名称匹配的事件。

我们来思考一下事件机制的框架,有哪些我们可以借鉴的设计思想呢?我想最重要的一点是:需求决定实现

Event 和 kubernetes 中其他的资源不同,它有一个很重要的特性就是它可以丢失。如果某个事件丢了,并不会影响集群的正常工作。事件的重要性远低于集群的稳定性,所以我们看到事件整个流程中如果有错误,会直接忽略这个事件。

事件的另外一个特性是它的数量很多,相比于 pod 或者 deployment 等资源,事件要比它们多很多,而且每次有事件都要对 etcd 进行写操作。整个集群如果不加管理地往 etcd 中写事件,会对 etcd 造成很大的压力,而 etcd 的可用性是整个集群的基础,所以每个组件在写事件之前,会对事件进行汇聚和去重工作,减少最终的写操作。

参考资料

https://cizixs.com/2017/06/22/kubelet-source-code-analysis-part4-event/
https://www.kubernetes.org.cn/1195.html
https://www.jianshu.com/p/bd4941069427

 

你可能感兴趣的:(kubernetes)