- BPF is for linux only
iostat before bpf
- bpf allows kernel run mini programs
- tracing = record events, compute some statistic
- bcc and bpftrace are front end to code in bpf instructions
- bcc = bpf compiler collection
- no need to restart system, or applications. Can immediately start collecting traces and statistics on many kernel info.
- dynamic instrumentation - insert instructions in live code
- kbprobes = dynamic instrumentation for kernel
- uprobes = user level functions dynamic instrumentation
- static instrumentation may needed if code is compiler optimized or changed
- why bpf? efficient, production safe, included in linux kernel already.
- bpf computes statistics in kernel. before you would copy all data to user space to compute statistics, imaging copying 10,000 disk I/O events every second.
- Android uses pinning to load and ping bpf programs
- CO-RE = Compile Once - Run Everywhere
- “BPF: universal in-kernel virtual machine”
- RBP register based stack walking is disabled on many platforms. Netflix re-enabled them with very small performance downgrade. There are other ways to walk stack LBR, OCR.
- flame chart != flame graph. chart = x-axis percentage or time. graph = x-axis time.
- kprobes = substitute address with breakpoint instruction. this is done in live kernel text.
- uprobes = breakpoint is inserted in target instruction.
- uprobes has significant overhead. they can slow down application 10x or more.
- USDT (user-level statically defined tracing). folly has static tracing macro. Need tracing application to use them. bpf can use usdt probes.
- PMC (performance monitoring counters) = programmable hardware counters from CPU, e.g. branch miss
- UNIX is “do one thing, and do it well”. Focus on single purpose tools.
Basics on performance
- first pick what you want to optimize. it is also good if you have performance problem first. don’t follow “interesting” metrics.
- latency = how long to accomplish request
- rate = number of operations per second
- throughput = data size per second
- utilization = how busy resource is over time with percentage
- cost = price performance ratio
- USE method = Utilization Saturation Errors. Use for every resource.
- “start with question, not answer”. Instead of looking into all metrics, ask question (e.g. USE method) and drill down. Some questions may not be answerable with metrics you have.
- there are checklists for different skill level. typical organization have own checklists.
60 seconds checklist for poorly performing Linux
uptime - check CPU load moving averages in 1min, 5min, 15min
dmesg | tail - check past 10 system messages
vmstat 1 - check 1 second virtual memory stats
mpstat -P ALL 1 - check per-CPU time. Check if all CPU are utilized, if single thread is bottleneck. Check high
%iowait — I/O is slow — and
%sys - syscall are high.
pidstat 1- check CPU per process
iostat -xz 1 - check I/O device metrics.
avgqu-sz if larger than 1, then is saturation. if
%util is larger than 60% then likely a problem.
free -m - check that memory is not zero
sar -n DEV 1 - check network device limits
sar -n TCP,ETCP 1 - check number of TCP
accept per second
top - check overall metrics
BCC checklist for poorly performing Linux
execsnoop - check short lived processes
opensnoop - check files being opened
ext4slower - check common operations for ext4 filesystem
biolatency - check disk I/O latency histogram
biosnoop - check disk I/O requests
cachestat - check hit ratio every second for cache
tcpconnect - check tcp
connect source and destination
tcpaccept - check tcp
accept source and destination
tcpretrans - check tcp retransmit source and destination
runqlat - check histogram of threads waiting for CPU
profile - check which code paths consuming most CPU
funccount can count number of invocation of any functions system-wide. Can be millions per second.
stackcount with flame graph is useful for how most frequently this function is reached
trace shows function invocations with arguments and return status, can filter by arguments values
argdist histogram of arguments to function
- confirm that CPU is bottleneck with say
perf stat -d gzip file1 to show Performance Monitoring Counters
runqlat see how long threads are waiting for CPU, check this for saturation, i.e. wait long times in queue
cpudist distribution of on-CPU time for each thread wake up
- p.212 on ML optimization. longer run without interruptions, better performance
cpufreq CPU frequency can change throughout lifetime of a process
- memory events can be millions per second
- (files) file system are cached in memory
- (swap) memory stored in file system
- Netflix runs hosts with swap-_less_ processes. It can be better to kill process and wait for other host than pay cost of swapping memory.
- OOM Killer. System uses some heuristic to find which process to kill to free memory. It usually picks the process with largest memory.
ps aux shows memory by process
pmap shows memory for process by segment (which code that calls which memory)
sar page fault rate