Linux 之 Perf

Concept

PMU : Performance Monitoring Unit

perf_events

perf_events is an event-oriented observability tool, which can help you solve advanced performance and troubleshooting functions.

perf_events is part of the Linux kernel, under tools/perf. While it uses many Linux tracing features, some are not yet exposed via the perf command, and need to be used via the ftrace interface instead.

perf_events Multiplexing

You may wish to measure more events simultaneously than hardware can support (NMI watchdog may steal one too)

If there are more events than counters, the kernel uses time multiplexing (switch frequency = HZ, generally 100 or 1000) to give each event a chance to access the monitoring hardware. Multiplexing only applies to PMU events. With multiplexing, an event is not measured all the time. At the end of the run, the tool scales the count based on total time enabled vs time running. The actual formula is:

final_count = raw_count * time_enabled/time_running

Events are currently managed in round-robin fashion. Therefore each event will eventually get a chance to run. If there are N counters, then up to the first N events on the round-robin list are programmed into the PMU. In certain situations it may be less than that because some events may not be measured together or they compete for the same counter. Furthermore, the perf_events interface allows multiple tools to measure the same thread or CPU at the same time. Each event is added to the same round-robin list. There is no guarantee that all events of a tool are stored sequentially in the list.

简单实验,由此可以看出有四个 counter

perf stat -e cycles,cycles,cycles,cycles,cycles
^C
 Performance counter stats for 'system wide':

    49,606,709,293      cycles                                                        (80.04%)
    49,542,219,044      cycles                                                        (80.04%)
    49,512,868,392      cycles                                                        (80.03%)
    49,528,694,848      cycles                                                        (80.02%)
    49,539,412,782      cycles                                                        (80.01%)

perf stat -e cycles,cycles,cycles,cycles,cycles,cycles
^C
 Performance counter stats for 'system wide':

    22,251,237,320      cycles                                                        (66.86%)
    22,232,407,299      cycles                                                        (66.81%)
    22,232,703,664      cycles                                                        (66.77%)
    22,233,091,623      cycles                                                        (66.74%)
    22,193,679,137      cycles                                                        (66.70%)
    22,163,799,967      cycles                                                        (66.70%)
The types of events are:
  • Hardware Events: CPU performance monitoring counters.
  • Software Events: These are low level events based on kernel counters.
    For example, CPU migrations, minor faults, major faults, etc.
  • Kernel Tracepoint Events: This are static kernel-level
    instrumentation points that are hardcoded in interesting and logical
    places in the kernel.
  • User Statically-Defined Tracing (USDT): These are static tracepoints
    for user-level programs and applications.
  • Dynamic Tracing: Software can be dynamically instrumented, creating
    events in any location. For kernel software, this uses the kprobes
    framework. For user-level software, uprobes.
  • Timed Profiling: Snapshots can be collected at an arbitrary
    frequency, using perf record -FHz. This is commonly used for CPU
    usage profiling, and works by creating custom timed interrupt events.
ppp
Skew and PEBS

There’s a problem with event profiling that you don’t really encounter with CPU profiling (timed sampling). With timed sampling, it doesn’t matter if there was a small sub-microsecond delay between the interrupt and reading the instruction pointer (IP). Some CPU profilers introduce this jitter on purpose, as another way to avoid lockstep sampling. But for event profiling, it does matter: if you’re trying to capture the IP on some PMC event, and there’s a delay between the PMC overflow and capturing the IP, then the IP will point to the wrong address. This is skew. Another contributing problem is that micro-ops are processed in parallel and out-of-order, while the instruction pointer points to the resumption instruction, not the instruction that caused the event. I’ve talked about this before.

The solution is “precise sampling”, which on Intel is PEBS (Precise Event-Based Sampling), and on AMD it is IBS (Instruction-Based Sampling). These use CPU hardware support to capture the real state of the CPU at the time of the event. perf can use precise sampling by adding a :p modifier to the PMC event name, eg, “-e instructions:p”. The more p’s, the more accurate. Here are the docs from tools/perf/Documentation/perf-list.txt:

The 'p' modifier can be used for specifying how precise the instruction
address should be. The 'p' modifier can be specified multiple times:

 0 - SAMPLE_IP can have arbitrary skid
 1 - SAMPLE_IP must have constant skid
 2 - SAMPLE_IP requested to have 0 skid
 3 - SAMPLE_IP must have 0 skid
perf.data

perf record在当前目录产生一个perf.data文件(如果这个文件已经存在,旧的文件会被改名为perf.data.old),用来记录过程数据。之后运行的perf report命令会输出统计的结果。perf.data只包含原始数据,perf report需要访问本地的符号表,pid和进程的对应关系等信息来生成报告。所以perf.data不能直接拷贝到其他机器上用的。

Perf

perf record 同时支持 3 种栈回溯方式:fp, dwarf, lbr,可以通过 --call-graph 参数指定,而 -g 就相当于 --call-graph fp.

fp 就是 Frame Pointer,即 x86 中的 EBP 寄存器,fp 指向当前栈帧栈底地址,此地址保存着上一栈帧的 EBP 值,,根据 fp 就可以逐级回溯调用栈。然而这一特性是会被优化掉的,而且这还是 GCC 的默认行为,在不手动指定 -fno-omit-frame-pointer 时默认都会进行此优化,此时 EBP 被当作一般的通用寄存器使用,以此为依据进行栈回溯显然是错误的。不过尝试指定 -fno-omit-frame-pointer 后依然没法获取到正确的调用栈,根据 GCC 手册的说明,指定了此选项后也并不保证所有函数调用都会使用 fp…… 看来只有放弃使用 fp 进行回溯了。

dwarf 是一种调试文件格式,GCC 编译时附加的 -g 参数生成的就是 dwarf 格式的调试信息,其中包括了栈回溯所需的全部信息,使用 libunwind 即可展开这些信息。dwarf 的进一步介绍可参考 “关于DWARF”,值得一提的是,GDB 进行栈回溯时使用的正是 dwarf 调试信息。实际测试表明使用 dwarf 可以很好的获取到准确的调用栈。

lbr 即 Last Branch Records,是较新的 Intel CPU 中提供的一组硬件寄存器,其作用是记录之前若干次分支跳转的地址,主要目的就是用来支持 perf 这类性能分析工具,其详细说明可参考 “An introduction to last branch records” & “Advanced usage of last branch records”。此方法是性能与准确性最高的手段,然而它存在一个很大的局限性,由于硬件 Ring Buffer 寄存器的大小是有限的,lbr 能记录的栈深度也是有限的,具体值取决于特定 CPU 实现,一般就是 32 层,若超过此限制会得到错误的调用栈。

sample_period

Software events may have a default period. This means that when you use them for sampling, you’re sampling a subset of events, not tracing every event. You can check with perf record -vv.

Sampling a subset by default may be a good thing, especially for high frequency events like context switches.

To specify a sampling period, instead, the -c option must be used. For instance, to collect a sample every 2000 occurrences of event instructions only at the user level only:

perf record -c 10000 -e cycles:ppp -C 0 sleep 5
[ perf record: Woken up 257 times to write data ]
[ perf record: Captured and wrote 66.366 MB perf.data (1682147 samples) ]

Samples: 1M of event 'cycles:ppp', Event count (approx.): 16821470000
Overhead  Command          Shared Object             Symbol

perf report
    -n, --show-nr-samples
                          Show a column with the number of samples
    -C, --cpu
           Collect samples only on the list of CPUs provided. Multiple
           CPUs can be provided as a comma-separated list with no space:
           0,1. Ranges of CPUs are specified with -: 0-2. In per-thread
           mode with inheritance mode on (default), samples are captured
           only when the thread executes on the designated CPUs. Default
           is to monitor all CPUs.

Linux 之 Perf_第1张图片

Default event

Default event: cycle counting
By default, perf record uses the cycles event as the sampling event. This is a generic hardware event that is mapped to a hardware-specific PMU event by the kernel. For Intel, it is mapped to UNHALTED_CORE_CYCLES. This event does not maintain a constant correlation to time in the presence of CPU frequency scaling. Intel provides another event, called UNHALTED_REFERENCE_CYCLES but this event is NOT currently available with perf_events.

On AMD systems, the event is mapped to CPU_CLK_UNHALTED and this event is also subject to frequency scaling. On any Intel or AMD processor, the cycle event does not count when the processor is idle, i.e., when it calls mwait().

Sampling mechanism

Hardware event sampling is performed in the following way. PERF configures the hardware performance counters to count the selected events. It also configures each counter to generate an interrupt after the occurrence of the number of events specified by the sampling period. PERF starts the counters and launches the workload. When the sampling period expires for a counter, it generates an interrupt. PERF then reads the restart program counter value from the interrupt stack and writes the PC value (and other information) to a sample buffer. PERF writes the sample buffer to the profile data file when the buffer is full. Finally, PERF re-arms (no pun intended) the performance counter and sets the sampling period after making any needed adjustments for sampling frequency. This whole process continues until the workload completes and PERF disables the hardware performance counters.

When we decrease the sampling period, we increase the number of sampling interrupts and the number of interrupts (samples) to be handled. Decreasing the sampling period increases workload perturbation. Further, the core cannot execute the workload while it is handling an interrupt. More CPU cycles are “stolen” from the workload then the sampling period is shortened because more interrupts must be handled. A shorter sampling period increases statistical accuracy, but it comes at the cost of longer workload elapsed time.

perf stat
l2_rqsts.all_demand_miss
l2_rqsts.all_pf
l2_rqsts.demand_data_rd_hit
l2_rqsts.demand_data_rd_miss
l2_rqsts.miss
l2_rqsts.pf_miss
l2_rqsts.rfo_miss
l2_trans.l2_wb
l2_rqsts.miss                                     
       [All requests that miss L2 cache]
# perf stat -e l2_rqsts.miss -I 1000 -C 0 sleep 10
#           time             counts unit events
     1.000356222            213,888      l2_rqsts.miss
     2.001538827            169,999      l2_rqsts.miss
     3.001849629            840,332      l2_rqsts.miss
     4.002167643            182,219      l2_rqsts.miss
     5.002446657            217,416      l2_rqsts.miss
     6.002611179          1,258,836      l2_rqsts.miss
     7.002893517          1,123,108      l2_rqsts.miss
     8.003138038          1,014,948      l2_rqsts.miss
     9.003384195          1,173,721      l2_rqsts.miss
    10.003566874            418,776      l2_rqsts.miss
    10.004632920             13,491      l2_rqsts.miss

利用event来统计性能 metrics

比如 sw_prefetch_access.t1_t2
在代码的相关位置插入以下指令,一般在forloop里面

_mm_prefetch(memory_address_, _MM_HINT_T2);

然后使用perf stat 来捕获counter 数量,
实验表明效果还是挺不错的

# perf stat -e inst_retired.any,inst_retired.any_p,sw_prefetch_access.t0,sw_prefetch_access.t1_t2    -p $!  sleep 5

 Performance counter stats for process id '144495':

    32,733,406,923      inst_retired.any
    32,733,698,727      inst_retired.any_p
           806,260      sw_prefetch_access.t0
       902,886,959      sw_prefetch_access.t1_t2

       5.005418935 seconds time elapsed

VS remove AVX in Fixed width types

 Performance counter stats for process id '169232':

    16,805,056,337      inst_retired.any
    16,805,127,576      inst_retired.any_p
           808,181      sw_prefetch_access.t0
       391,681,924      sw_prefetch_access.t1_t2

       5.005929544 seconds time elapsed

instructions event

inst_retired.any 表示指令没有被充分利用,经常 被切换

pidstat

pidstat 可以产生 minflt/s , majflt/s , page fault 统计

 taskset -c 2 ../debug/src/benchmarks/GoogleBenchmarkColumnarToRow & pidstat -r -u -p $! -t 1 &
Events

The cpu-cycles event is mapped to the hardware event UNHALTED_CORE_CYCLES on Intel processors. On AMD processors, however, it is mapped to the hardware event CPU_CLK_UNHALTED.

实例
perf record --call-graph lbr ./GoogleBenchmarkColumnarToRow
perf report -v -n  -g
#或-g
perf report -g

Linux 之 Perf_第2张图片

通过perf list可以查看Perf支持的各种Event:

# perf list

List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]

  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  ...
perf 命令
  • perf record —— 采样,生成perf.data二进制文件
  • perf annotate/perf report/perf script —— 分析perf.data文件,annotate可以查看代码,report可以统计分析,script是直接转化成文本格式
  • perf stat —— counter,统计event的出现次数
  • perf top ——整个系统的分析,类似于top命令,但可以具体到函数,可以指定event
plt

代码段被定义为过程连接表(Procedure Linkage Table,PLT),而data段中的坑被定义为全局偏移表(Global Offset Table, GOT)。

	#include 

int main()
{
    printf("hello");
    printf("hello again");
    return 0;
}

用objdump查看

	0000000000400526 
: 400526: 55 push %rbp 400527: 48 89 e5 mov %rsp,%rbp 40052a: bf d4 05 40 00 mov $0x4005d4,%edi 40052f: b8 00 00 00 00 mov $0x0,%eax 400534: e8 c7 fe ff ff callq 400400 400539: bf da 05 40 00 mov $0x4005da,%edi 40053e: b8 00 00 00 00 mov $0x0,%eax 400543: e8 b8 fe ff ff callq 400400 400548: b8 00 00 00 00 mov $0x0,%eax 40054d: 5d pop %rbp 40054e: c3 retq 40054f: 90 nop

链接器发现printf定义在动态库中,于是将printf函数的位置设置为地址为0x400400的printf@plt。

X86_64 汇编指令

callq

就是 call 指令(q代表要处理的字节长度,q是8字节,l是4字节,w是2字节)。call 指令指示 CPU 应跳转到某个地址去执行新的函数。

jmpq

jmpq 就是jmp 指令。q是gnu汇编的用法。q表示跳转到64位地址。
l表示32位地址。

参考链接

https://blog.csdn.net/hknaruto/article/details/126778632

你可能感兴趣的:(Linux,Linux)