Callchain Capture and Analysis with PmcTools
Callchain capture is an experimental feature in the PmcTools toolset. You will need to apply the following patch against -CURRENT, before you can try it out: http://people.freebsd.org/~jkoshy/download/pmctools-callgraph-patch.gz
Introduction
With this patch, at the time of a PMC interrupt, hwpmc(4) will record a caller information from the stack in addition to the interrupted PC value.
In this wiki page we will take a brief look at the new information this feature can provide. We will also examine the shortcomings of the current set of heuristics used by tools.
How to capture callchains
pmcstat(8) will configure sampling PMCs to capture callchains by default, so you do not need to do anything special to turn this feature on.
The new -N option to pmcstat(8) turns off callchain capture for the times you prefer 'flat' profiles.
For example, to allocate system-wide sampling PMCs that count "instructions retired" on all CPUs of your system, use:
# pmcstat -S instructions -O sample.out
Analysis
The information captured in the log can be post-processed by pmcstat. Two kinds of summaries can be generated:
A callchain summary that summarizes frequently sampled functions and their callers. Callchain summaries are requested using the new -G option.
Traditional gprof(1) compatible gmon.out files can also be generated. gmon.out output is specified using the -g option.
An Example
The samples for this example were collected during a short interval (10 seconds) while the system was executing a make buildworld.
The Callchain Summary
The command pmcstat -R sample.out -G cc-summary-file generates a callchain summary.
# cat cc-summary-file @ p4-instr-retired [95282 samples] 03.10% [2951] fnmatch1 @ /usr/bin/ld 98.58% [2909] fnmatch 01.42% [42] fnmatch1
The first line of the summary informs us that the log file had 95282 valid samples for PMC event "p4-instr-retired".
The most frequently sampled function was fnmatch1 in executable /usr/bin/ld, with 2951 samples out of the total. Function fnmatch, in the same executable object, was the most frequent caller of function fnmatch1.
The next most frequent samples fell in function Buf_AddByte in executable /home/obj/home/fcpi/src/make.i386/make:
02.30% [2194] Buf_AddByte @ /home/obj/home/fcpi/src/make.i386/make 45.81% [1005] Parse_File 17.01% [171] ReadMakefile 87.72% [150] TryReadMakefile 12.28% [21] main 26.66% [585] Var_SubstOnly 22.22% [130] For_Run 100.0% [130] parse_for 12.40% [272] parse_for 99.26% [270] Parse_File 100.0% [270] ReadMakefile 04.79% [105] CondGetArg 89.52% [94] CondToken 100.0% [94] CondT ...output snipped for brevity...
The callchain summary is sorted with the most frequently sampled locations and their locations first.
Traditional gprof(1) style profiles
The -g option to pmcstat(8) turns on gmon.out generation:
# pmcstat -R sample.out -g
You would then run gprof(1) on the generated gmon.out files as before:
# gprof /usr/bin/ld p4-instr-retired/ld.gmon > ld.gprof # gprof /boot/kernel/kernel p4-instr-retired/kernel.gmon > kernel.gprof # gprof /lib/libc.so.7 p4-instr-retired/libc.so.7.gmon > libc.gprof
Lets take a look at what the kernel was doing:
# cat kernel.gprof ...header snipped... called/total parents index %time self descendents called+self name index called/total children <spontaneous> [1] 24.7 64.00 2696.05 syscall [1] 1.00 1476.19 37/37 stat [7] 1.00 655.74 1/1 open [18] 0.00 204.05 72/72 execve [39] 2.00 129.17 2/2 read [57] 0.00 61.12 4/4 lstat [102] 0.00 32.95 1/1 mkdir [155] 0.00 27.99 2/2 unlink [165] 1.00 20.70 2/2 write [193] 0.00 16.52 1/1 fstatfs [231] 1.00 14.20 2/2 munmap [238] 1.00 12.31 1/1 sigaction [248] 12.60 0.00 7/10 generic_copyout [215] 2.00 8.00 1/1 fstat [269] 7.64 0.88 2/11 userret [126] 4.86 0.00 1/7 sched_userret [149] 0.00 2.13 1/1 getdirentries [422]
As to be expected, gprof(1) has propagated most of the samples to function syscall.
<spontaneous> [2] 21.0 2348.00 0.00 sse2_pagezero [2]
2348 samples fell into sse2_pagezero. However, hwpmc(4) was not able to trace the call stack further.
0.60 25.96 1/87 kern_rmdir [166] 0.60 25.96 1/87 kern_execve [60] 0.60 25.96 1/87 kern_unlink [164] 0.60 25.96 1/87 kern_lstat [159] 0.60 25.96 1/87 lstat [102] 0.60 25.96 1/87 vn_open [21] 0.60 25.96 1/87 kern_chown [171] 0.60 25.96 1/87 elf32_load_file [170] 0.60 25.96 1/87 kern_mkdir [154] 1.20 51.92 2/87 kern_readlink [118] 1.20 51.92 2/87 kern_access [108] 12.55 545.17 21/87 vn_open_cred [22] 31.68 1375.90 53/87 kern_stat [8] [3] 20.7 52.00 2258.56 87 namei [3] 282.00 1747.26 132/132 lookup [4] 2.12 69.18 3/41 VOP_LOOKUP_APV [10] 3.53 33.70 7/131 vput [16] 7.43 25.85 4/7 vrele [110] 27.33 5.28 18/27 uma_zfree_arg [123] 19.11 5.46 8/18 uma_zalloc_arg [115] 2.60 13.92 1/52 _vn_lock [12] 0.00 9.50 1/1 VOP_READLINK_APV [272] 0.29 1.31 1/17 vfs_busy [168] 1.50 0.00 1/2 compute_cn_lkflags [375] 1.18 0.00 1/39 critical_enter [127]
Here's what ld was doing:
# cat ld.gprof ...header snipped... 18 map_input_to_output_sections [1] [1] 86.9 1.00 6079.64 0+18 map_input_to_output_sections [1] 0.00 6079.64 67/67 walk_wild [2] 18 map_input_to_output_sections [1] ----------------------------------------------- 0.00 6079.64 67/67 map_input_to_output_sections [1] [2] 86.9 0.00 6079.64 67 walk_wild [2] 12.00 6012.36 1742/1742 walk_wild_file [3] 16.88 31.33 17/1840 fnmatch [5] 0.33 3.49 1/21 bfd_check_format [20] 0.08 3.17 1/1818 walk_wild_section [4] ----------------------------------------------- 12.00 6012.36 1742/1742 walk_wild [2] [3] 86.1 12.00 6012.36 1742 walk_wild_file [3] 138.92 5765.95 1817/1818 walk_wild_section [4] 6.67 69.78 20/21 bfd_check_format [20] 22.18 0.00 4/11 strcmp [25] 5.00 0.00 5/610 wildcardp [7] 0.95 2.91 1/20 bfd_check_format_matches [21] ...output snipped...
And libc.so.7:
# cat libc.so.7.gmon ...header snipped... <spontaneous> [1] 95.0 1.00 3167.77 regcomp [1] 382.00 2785.77 1/1 regexec [2] ----------------------------------------------- 1 regexec [2] 382.00 2785.77 1/1 regcomp [1] [2] 95.0 382.00 2785.77 1+1 regexec [2] 2643.00 25.37 1618/1618 fgetwc [3] 4.00 49.24 9/9 malloc [5] 32.00 0.00 1/1 malloc_usable_size [8] 17.33 9.54 7/21 free [4] 5.29 0.00 5/35 __i686.get_pc_thunk.bx [6] 1 regexec [2]
Caveats
hwpmc(4) currently uses a simple heuristic to determine the initial stack frame given an interrupted PC value. This heuristic doesn't always work correctly in the presence of aggressive instruction scheduling that is the default with the current GCC compiler. Thus a small percentage of samples could appear to be missing call frames. For example, a call chain foo->bar->buzz may be collected as foo->buzz, if the interrupted PC value fell in function buzz when it was just about to return to bar.
This has the following undesirable side-effects:
- The callchain summary could show 'call chains' that do not happen in practice.
Caller/callee information tracked by gprof can be similarly incorrect.