Ftrace events are a mechanism that allows different pieces of code in the kernel to ‘broadcast’ events of interest. Such as a scheduler context-switch
sched_switch for example. In the scheduler core’s
__schedule function, you’ll see something like:
trace_sched_switch(preempt, prev, next);
This immediately results in a write to a per-cpu ring buffer storing info about what the previous task was, what the next one is, and whether the switch is happening as a result of kernel preemption (versus happening for other reasons such as a task waiting for I/O completion).
Under the hood, these ftrace events are actually implemented using tracepoints. All ftrace events are tracepoints, but all tracepoints are not events.
Let’s discuss a bit about how a tracepoint works. Tracepoints are hooks that are inserted into points of code of interest and call a certain function of your choice (also known as a function probe). Inorder for the tracepoint to do anything, you have to register a function using
tracepoint_probe_register. Multiple functions can be registered in a single hook. Once your tracepoint is hit, all functions registered to the tracepoint are executed. Also note that if no function is registered to the tracepoint, then the tracepoint is essentially a NOP with zero-overhead. Actually that’s a lie, there is a branch (and some space) overhead only although negligible.
Here is the heart of the code that executes when a tracepoint is hit:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
static_key_false in the above code will evaluate to false if there’s no probe registered to the tracepoint.
__DO_TRACE does the following in
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
There’s a lot going on there, but main part is the loop that goes through all function pointers (probes) that were registered to the tracepoint and calls them one after the other.
Now, here’s some secrets. Since all ftrace events are tracepoints under the hood, you can piggy back onto interesting events in your kernel with your own probes. This allows you to write interesting tracers. Infact this is precisely how blktrace works, and also is how SystemTap hooks into ftrace events.
Checkout a module I wrote that hooks onto
sched_switch to build some histograms. The code there is still buggy but if you mess with it and improve it please share your work.
Now that we know a good amount about tracepoints, ftrace events are easy.
An ftrace event being based on tracepoints, makes full use of it but it has to do more. Ofcourse, it has to write events out to the ring buffer.
When you enable an ftrace event using debug-fs, at that instant the ftrace events framework registers an “event probe” function at the tracepoint that represents the event. How? Using
tracepoint_probe_register just as we discussed.
The code for this is in the file
kernel/trace/trace_events.c in function
1 2 3 4 5 6 7 8 9 10 11 12
The probe function
call->class->probe for trace events is defined in the file
include/trace/trace_events.h and does the job of writing to the ring buffer. In a nutshell, the code gets a handle into the ring buffer, does assignment of the values to the entry structure and writes it out. There is some magic going on here to accomodate arbitrary number of arguments but I am yet to figure that out.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Let me know any comments you have or any other ftrace event behavior you’d like explained.