监控系列讲座（十）常见系统监控指标之CPU

1. 简介

我们前面说过了，从运维复杂度的角度考虑系统架构，而模板就是一个我们需要考虑的问题。一些传统的监控软件，它内部提供了非常丰富的模板，让我们的软件安装完成之后，就可以用最少的工作量，或者一个漂亮的图形，比如zabbix，它内置了大量的模板，让我们展示起来非常方便。而比较新的监控软件则采取了不同的思路，比如grafana，他官方提供的模板非常少，大部分都来自于社区，然后他把社区内的一些模板集中起来，做了一些统一的验证和质量控制，做到了grafana.com/grafana/dashborads和pugins中，以扩展的形式来丰富软件的功能。我们后面说grafana的时候会详细再说，我们今天就是要使用

系统命令
grafana的dashborad（grafana7+dashboard1860）我用了一个下载了480多万次的模板做参考
zabbix的dashborad（zabbix5）
node_exporter（1.0.1）

中常用的一些指标来做例子，来说明我们常用的一些指标，比如

CPU
内存
硬盘
网络
进程

2. 监控CPU

CPU是计算机的大脑，他中间涉及到了非常多的指标。首先我们要知道的，是CPU主要由两个空间来使用，一个是用户空间，一个是系统/内核空间。

2.1. 系统上查看CPU指标

我们在troubleshooting的时候通常会登录控制台来定位问题，那么久需要用到一些系统工具或者命令来查看CPU指标

通过命令，比如：TOP，vmstat，sar
通过文件，一切皆文件，实际上我们CPU的指标，都是储存在文件中的，命令或者工具不过是把它们的瞬时状态拿出来做了一些出来，用我们比较舒服的方式展示出来。一般是来自于/proc/stat

我们以vmstat为例，他的输出如下：

root@node5:~# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 3316576  25828 290740    0    0     9    10   91  111  3  1 96  0  0

咱们先说CPU部分，他分为了system和CPU，我们man一下

   System
       # 系统时钟周期每秒的中断的次数
       in: The number of interrupts per second, including the clock.
       # 系统上下文每秒切换的次数
       cs: The number of context switches per second.

   CPU
       These are percentages of total CPU time.
       # 运行用户空间代码所消耗的时间
       us: Time spent running non-kernel code.  (user time, including nice time)
       # 运行内核空间代码所消耗的时间
       sy: Time spent running kernel code.  (system time)
       # 没有程序执行的时间
       id: Time spent idle.  Prior to Linux 2.5.41, this includes IO-wait time.
       # 程序花费在等待运行时候的时间
       wa: Time spent waiting for IO.  Prior to Linux 2.5.41, included in idle.
       # 交给虚拟机运行的时候，主要是说hypervisor层使用的时间
       st: Time stolen from a virtual machine.  Prior to Linux 2.6.11, unknown.

System主要是衡量CPU中断的指标，我们知道，CPU是通过切换时钟周期来实现多任务并发的，他的切换速度足够快，才让我们感觉系统是在同时运行很多程序。但是，切换可以简单分为两个步骤。

程序知道应该让出CPU使用权而主动放弃，CPU出于某种考虑强制要求切换
切换时，如果下一个时钟周期还是上一个时钟周期内运行的程序，则不需要切换上下文，如果其他程序获得了执行权，CPU会把他对应的指令从内存或者CPU缓存中拿出来进行计算，这个时候就需要切换上下文。

这个指标有两种异常

一种是切换次数非常频繁，那么说明系统非常忙碌。
第二种是in的次数非常高，但是cs次数非常低，说明一个某个程序占据了大量的CPU计算能力

CPU主要是衡量CPU本身的消耗时间，他包括了内核空间和用户空间，这里面的单位是percentage，也就是百分比，也就是说，前三个us，sy和id加起来应该是100%

2.2. zabbix上的CPU监控指标

zabbix的CPU监控指标在templates/Operating Systems/Template OS Linux by Zabbix agent的item下面，有一个CPU指标，里面一共有17项。

image-20200721201156652.png

file

这里面把系统的中断细分了一下

CPU softirq time（软中断）：The amount of time the CPU has been servicing software interrupts.（这边就是我们说的软件主动放弃的情况）
CPU interrupt time（硬中断）：The amount of time the CPU has been servicing hardware interrupts.（由于各种原因，CPU控制器主动要求中断）

我们发现还有个参数叫niced time，这个是说用做nice加权的进程分配的用户态cpu时间比

The time the CPU has spent running users' processes that have been niced.

然后guest niced time就是用做nice加权的虚拟出来的主机分配的用户态cpu时间比

Time spent running a niced guest (virtual CPU for guest operating systems under the control of the Linux kernel)

其他的应该都比较好理解了。

2.3. grafana上的CPU监控指标

感觉还没有zabbix的指标全面，但是可以非常清晰的看到我们想要的东西

image-20200721211732297.png

file

都是我们见过的，我就不详细说了，实际上每个dashboard的数据源都是我们下面要说的node_exporter，让我们看看node_eporter的指标是不是更全

2.4. node_exporter上的CPU监控指标

我们可以curl一下目标机器的9100端口，然后找到带CPU的指标

# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 18064.83
node_cpu_seconds_total{cpu="0",mode="iowait"} 34
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 0
node_cpu_seconds_total{cpu="0",mode="softirq"} 18.34
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 80.34
node_cpu_seconds_total{cpu="0",mode="user"} 277.38
node_cpu_seconds_total{cpu="1",mode="idle"} 18080.73
node_cpu_seconds_total{cpu="1",mode="iowait"} 23.76
node_cpu_seconds_total{cpu="1",mode="irq"} 0
node_cpu_seconds_total{cpu="1",mode="nice"} 0
node_cpu_seconds_total{cpu="1",mode="softirq"} 6.62
node_cpu_seconds_total{cpu="1",mode="steal"} 0
node_cpu_seconds_total{cpu="1",mode="system"} 71.46
node_cpu_seconds_total{cpu="1",mode="user"} 313.63
node_cpu_seconds_total{cpu="2",mode="idle"} 18077
node_cpu_seconds_total{cpu="2",mode="iowait"} 30.14
node_cpu_seconds_total{cpu="2",mode="irq"} 0
node_cpu_seconds_total{cpu="2",mode="nice"} 0
node_cpu_seconds_total{cpu="2",mode="softirq"} 7.69
node_cpu_seconds_total{cpu="2",mode="steal"} 0
node_cpu_seconds_total{cpu="2",mode="system"} 71.42
node_cpu_seconds_total{cpu="2",mode="user"} 307.3
node_cpu_seconds_total{cpu="3",mode="idle"} 18084.29
node_cpu_seconds_total{cpu="3",mode="iowait"} 24.86
node_cpu_seconds_total{cpu="3",mode="irq"} 0
node_cpu_seconds_total{cpu="3",mode="nice"} 0
node_cpu_seconds_total{cpu="3",mode="softirq"} 7.04
node_cpu_seconds_total{cpu="3",mode="steal"} 0
node_cpu_seconds_total{cpu="3",mode="system"} 72.85
node_cpu_seconds_total{cpu="3",mode="user"} 304.74

# TYPE node_cpu_guest_seconds_total counter
node_cpu_guest_seconds_total{cpu="0",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="0",mode="user"} 0
node_cpu_guest_seconds_total{cpu="1",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="1",mode="user"} 0
node_cpu_guest_seconds_total{cpu="2",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="2",mode="user"} 0
node_cpu_guest_seconds_total{cpu="3",mode="nice"} 0
node_cpu_guest_seconds_total{cpu="3",mode="user"} 0
...

其实就是我们在grafana的dashboard上看到的，只不过他是用0-3标识了4颗CPU，然后做了一个聚合。具体可以看到的内容和zabbix的差不多。也就是说，dashboard上并没有展示所有的参数，随着node_exporter版本的升级，暴露的指标越来越详细
为了方便大家学习，请大家加我的微信，我会把大家加到微信群（微信群的二维码会经常变）和qq群821119334，问题答案云原生技术课堂，有问题可以一起讨论

个人微信
640.jpeg
腾讯课堂
640-20200506145837072.jpeg
微信公众号
640-20200506145842007.jpeg
专题讲座

2020 CKA考试视频真题讲解 https://www.bilibili.com/video/BV167411K7hp

2020 CKA考试指南 https://www.bilibili.com/video/BV1sa4y1479B/

2020年 5月CKA考试真题 https://mp.weixin.qq.com/s/W9V4cpYeBhodol6AYtbxIA