LinuxInternals.org

by Joel Fernandes

ARMv8: Flamegraph and NMI Support

| Comments

Non-maskable interrupts (NMI) is a really useful feature for debugging, that hardware can provide. Unfortunately ARM doesn’t provide an out-of-the-box NMI interrupt mechanism. This post shows a flamegraph issue due to missing NMI support, and the upstream work being done to simulate NMI in ARMv8.

Some great Linux kernel features that rely on NMI to work properly are:

  • Backtrace from all CPUs: A number of places in the kernel rely on dumping the stacks of all CPUs at the time of a failure to determine what was going on. Some of them are Hung Task detection, Hard/soft lockup detector and spinlock debugging code.

  • Perf profiling and flamegraphs: To be able to profile code that runs in interrupt handlers, or in sections of code that disable interrupts, Perf relies on NMI support in the architecture. flamegraphs are a great visual representation of perf profile output. Below is a flamegraph I generated from perf profile output, that shows just what happens on an architecture like ARMv8 with missing NMI support. Perf is using maskable-interrupts on this platform for profiling:

As you can see in the area of the flamegraph where the arrow is pointed, a large amount of time is spent in _raw_spin_unlock_irqrestore. It can baffle anyone looking at this data for the first time, and make them think that most of the time is spent in the unlock function. What’s actually happenning is because perf is using a maskable interrupt in ARMv8 to do its profiling, any section of code that disables interrupts will not be see in the flamegraph (not be profiled). In other words perf is unable to peek into sections of code where interrupts are disabled. As a result, when interrupts are reenabled during the _raw_spin_unlock_irqrestore, the perf interrupt routine then kicks in and records the large number of samples that elapsed in the interrupt-disable section but falsely accounts it to the _raw_spin_unlock_restore function during which the perf interrupt got a chance to run. Hence the flamegraph anomaly. It is indeed quite sad that ARM still doesn’t have a true NMI which perf would love to make use of.

BUT! Daniel Thompson has been hard at work trying to simulate Non-maskable interrupts on ARMv8. The idea is based on using interrupt priorities and is the subject of the rest of this post.

NMI Simulation using priorities

To simulate an NMI, Daniel creates 2 groups of interrupts in his patchset. One group is for all ‘normal’ interrupts, and the other for non-maskable interrupts (NMI). Non-maskable interrupts are assigned a higher priority than the normal interrupt group. Inorder to ‘mask’ interrupts in this approach, Daniel replaces the regular interrupt masking scheme in the kernel which happens at the CPU-core level, with setting of the interrupt controller’s PMR (priority mask register). When the PMR is set to a certain value, only interrupts which have a higher priority than what’s in the PMR will be signaled to a CPU core, all other interrupts will be silenced (masked). By using this technique, it is possible to mask normal interrupts while keeping the NMI unmasked all the time.

Just how does he do this? So, a small primer on interrupts in the ARM world. ARM uses the GIC Generic interrupt controller to prioritize and route interrupts to CPU cores. GIC interrupt priorties go from 0 to 255. 0 being highest and 255 being the lowest. By default, the kernel assigns priority 0xa0 (192) to all interrupts. He changes this default priority from 0xa0 to 0xc0 (you’ll see why). He then defines what values of PMR would be consider as “unmasked” vs “masked”. Masked is 0xb0 and unmasked is 0xf0. This results in the following priorities (greater numbers are lower priority).

1
2
3
4
0xf0 (240 decimal)  (11110000 binary) - Interrupts Unmasked (enabled)
0xc0 (192 decimal)  (11000000 binary) - Normal interrupt priority
0xb0 (176 decimal)  (10110000 binary) - Interrupts masked   (disabled)
0x80 (128 decimal)  (10000000 binary) - Non-maskable interrupts

In this new scheme, when interrupts are to be masked (disabled), the PMR is set to 0xf0 and when they are unmasked (enabled), the PMR is set to 0xb0. As you can see, setting the PMR to 0xb0 indeed masks normal interrupts, because 0xb0(PMR) < 0xc0(Normal), however non-maskable interrupts still stay unmasked as 0x80(NMI) < 0xb0(PMR). Also notice that inorder to mask/unmask interrupts, all that needs to be done is flip bit 7 in the PMR (0xb0 –> 0xf0). Daniel largely uses Bit 7 as the mask bit in the patchset.

Quirk 1: Saving of the PMR context during traps

Its suggested in the patchset that during traps, the priority value set in the PMR needs to be saved because it may change during traps. To facilitate this, Daniel found a dummy bit in the PSTATE register (PSR). During any exception, Bit 7 of of the PMR is saved into a PSR bit (he calls it the G bit) and restores it on return from the exception. Look at the changes to kernel_entry macro in the set for this code.

Quirk 2: Ack of masked interrupts

Note that interrupts are masked before the GIC interrupt controller code can even identify the source of the interrupt. When the GIC code eventually runs, it is tasked with identifying the interrupt source. It does so by reading the IAR register. This read also has the affecting of “Acking” the interrupt – in other words, telling the GIC that the kernel has acknowledged the interrupt request for that particular source. Daniel points out that, because the new scheme uses PMR for interrupt masking, its no longer possible to ACK interrupts without first unmasking them (by resetting the PMR) so he temporarily resets PMR, does the IAR read, and restores it. Look for the code in gic_read_iar_common in his patchset to handle this case.

Open questions I have

  • Where in the patchset does Daniel mask NMIs once an NMI is in progress, or is this even needed?

Future work

Daniel has tested his patchset only on the foundation model yet, but it appears that the patch series with modifications should work on the newer Qualcomm chipsets that have the necessary GIC (Generic interrupt controller) access from the core to mess with IRQ priorities. Also, currently Daniel has only implemented CPU backtrace, more work needs to be done for perf support which I’ll look into if I can get backtraces working properly on real silicon first.

Ftrace Events Mechanism

| Comments

Ftrace events are a mechanism that allows different pieces of code in the kernel to ‘broadcast’ events of interest. Such as a scheduler context-switch sched_switch for example. In the scheduler core’s __schedule function, you’ll see something like: trace_sched_switch(preempt, prev, next); This immediately results in a write to a per-cpu ring buffer storing info about what the previous task was, what the next one is, and whether the switch is happening as a result of kernel preemption (versus happening for other reasons such as a task waiting for I/O completion).

Under the hood, these ftrace events are actually implemented using tracepoints. The terms events are tracepoints appear to be used interchangeably, but it appears one could use a trace point if desired without having to do anything with ftrace. Events on the other hand use ftrace.

Let’s discuss a bit about how a tracepoint works. Tracepoints are hooks that are inserted into points of code of interest and call a certain function of your choice (also known as a function probe). Inorder for the tracepoint to do anything, you have to register a function using tracepoint_probe_register. Multiple functions can be registered in a single hook. Once your tracepoint is hit, all functions registered to the tracepoint are executed. Also note that if no function is registered to the tracepoint, then the tracepoint is essentially a NOP with zero-overhead. Actually that’s a lie, there is a branch (and some space) overhead only although negligible.

Here is the heart of the code that executes when a tracepoint is hit:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
        extern struct tracepoint __tracepoint_##name;                   \
        static inline void trace_##name(proto)                          \
        {                                                               \
                if (static_key_false(&__tracepoint_##name.key))         \
                        __DO_TRACE(&__tracepoint_##name,                \
                                TP_PROTO(data_proto),                   \
                                TP_ARGS(data_args),                     \
                                TP_CONDITION(cond),,);                  \
                if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {             \
                        rcu_read_lock_sched_notrace();                  \
                        rcu_dereference_sched(__tracepoint_##name.funcs);\
                        rcu_read_unlock_sched_notrace();                \
                }                                                       \
        }                                                   

The static_key_false in the above code will evaluate to false if there’s no probe registered to the tracepoint.

Digging further, __DO_TRACE does the following in include/linux/tracepoint.h

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#define __DO_TRACE(tp, proto, args, cond, prercu, postrcu)              \
        do {                                                            \
                struct tracepoint_func *it_func_ptr;                    \
                void *it_func;                                          \
                void *__data;                                           \
                                                                        \
                if (!(cond))                                            \
                        return;                                         \
                prercu;                                                 \
                rcu_read_lock_sched_notrace();                          \
                it_func_ptr = rcu_dereference_sched((tp)->funcs);       \
                if (it_func_ptr) {                                      \
                        do {                                            \
                                it_func = (it_func_ptr)->func;          \
                                __data = (it_func_ptr)->data;           \
                                ((void(*)(proto))(it_func))(args);      \
                        } while ((++it_func_ptr)->func);                \
                }                                                       \
                rcu_read_unlock_sched_notrace();                        \
                postrcu;                                                \
        } while (0)

There’s a lot going on there, but main part is the loop that goes through all function pointers (probes) that were registered to the tracepoint and calls them one after the other.

Now, here’s some secrets. Since all ftrace events are tracepoints under the hood, you can piggy back onto interesting events in your kernel with your own probes. This allows you to write interesting tracers. Infact this is precisely how blktrace works, and also is how SystemTap hooks into ftrace events. Checkout a module I wrote that hooks onto sched_switch to build some histograms. The code there is still buggy but if you mess with it and improve it please share your work.

Now that we know a good amount about tracepoints, ftrace events are easy.

An ftrace event being based on tracepoints, makes full use of it but it has to do more. Ofcourse, it has to write events out to the ring buffer. When you enable an ftrace event using debug-fs, at that instant the ftrace events framework registers an “event probe” function at the tracepoint that represents the event. How? Using tracepoint_probe_register just as we discussed.

The code for this is in the file kernel/trace/trace_events.c in function trace_event_reg.

1
2
3
4
5
6
7
8
9
10
11
12
int trace_event_reg(struct trace_event_call *call,
                    enum trace_reg type, void *data)
{
        struct trace_event_file *file = data;

        WARN_ON(!(call->flags & TRACE_EVENT_FL_TRACEPOINT));
        switch (type) {
        case TRACE_REG_REGISTER:
                return tracepoint_probe_register(call->tp,
                                                 call->class->probe,
                                                 file);
...

The probe function call->class->probe for trace events is defined in the file include/trace/trace_events.h and does the job of writing to the ring buffer. In a nutshell, the code gets a handle into the ring buffer, does assignment of the values to the entry structure and writes it out. There is some magic going on here to accomodate arbitrary number of arguments but I am yet to figure that out.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
static notrace void                                                     \
trace_event_raw_event_##call(void *__data, proto)                       \
{                                                                       \
        struct trace_event_file *trace_file = __data;                   \
        struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
        struct trace_event_buffer fbuffer;                              \
        struct trace_event_raw_##call *entry;                           \
        int __data_size;                                                \
                                                                        \
        if (trace_trigger_soft_disabled(trace_file))                    \
                return;                                                 \
                                                                        \
        __data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
                                                                        \
        entry = trace_event_buffer_reserve(&fbuffer, trace_file,        \
                                 sizeof(*entry) + __data_size);         \
                                                                        \
        if (!entry)                                                     \
                return;                                                 \
                                                                        \
        tstruct                                                         \
                                                                        \
        { assign; }                                                     \
                                                                        \
        trace_event_buffer_commit(&fbuffer);                            \
}

Let me know any comments you have or any other ftrace event behavior you’d like explained.

TIF_NEED_RESCHED: Why Is It Needed

| Comments

TIF_NEED_RESCHED is one of the many “thread information flags” stored along side every task in the Linux Kernel. One of the flags which is vital to the working of preemption is TIF_NEED_RESCHED. Inorder to explain why its important and how it works, I will go over 2 cases where TIF_NEED_RESCHED is used.

Preemption

Preemption is the process of forceably grabbing CPU from a user or kernel context and giving it to someone else (user or kernel). It is the means for timesharing a CPU between competing tasks (I will use task as terminology for process). In Linux, the way it works is a timer interrupt (called the tick) interrupts the task that is running and makes a decision about whether a task or a kernel code path (executing on behalf of a task like in a syscall) is to be preempted. This decision is based on whether the task has been running long-enough and something higher priority woke up and needs CPU now, or is ready to run.

These things happen in scheduler_tick(), the exact path is TIMER HARDWARE INTERRUPT –> scheduler_tick –> task_tick_fair –> entity_tick –> check_preempt_tick. entity_tick() updates various run time statistics of the task, and check_preempt_tick() is where TIF_NEED_RESCHED is set.

Here’s a small bit of code in check_preempt_tick

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
        unsigned long ideal_runtime, delta_exec;
        struct sched_entity *se;
        s64 delta;

        ideal_runtime = sched_slice(cfs_rq, curr);
        delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
        if (delta_exec > ideal_runtime) {
                resched_curr(rq_of(cfs_rq));
                /*
                 * The current task ran long enough, ensure it doesn't get
                 * re-elected due to buddy favours.
                 */
                clear_buddies(cfs_rq, curr);
                return;
        }

Here you see a decision is made that the process ran long enough based on its runtime and if so call resched_curr. Turns out resched_curr sets the TIF_NEED_RESCHED for the current task! This informs whoever looks at the flag, that this process should be scheduled out soon.

Even though this flag is set at this point, the task is not going to be preempted yet. This is because preemption happens at specific points such as exit of interrupts. If the flag is set because the timer interrupt (scheduler decided) decided that something of higher priority needs CPU now and sets TIF_NEED_RESCHED, then at the exit of the timer interrupt (interrupt exit path), TIF_NEED_RESCHED is checked, and because it is set – schedule() is called causing context switch to happen to another process of higher priority, instead of just returning to the existing process that the timer interrupted normally would. Lets examine where this happens.

For return from interrupt to user-mode:

If the tick interrupt happened user-mode code was running, then in somewhere in the interrupt exit path for x86, this call chain calls schedule ret_from_intr –> reint_user –> prepare_exit_to_usermode. Here the need_reched flag is checked, and if true schedule() is called.

For return from interrupt to kernel mode, things are a bit different (skip this para if you think it’ll confuse you).

This feature requires kernel preemption to be enabled. The call chain doing the preemption is: ret_from_intr –> reint_kernel –> preempt_schedule_irq (see arch/x86/entry/entry_64.S) which calls schedule. Note that, for return to kernel mode, I see that preempt_schedule_irq calls schedule anyway whether need_resched flag is set or not, this is probably Ok but I am wondering if need_resched should be checked here before schedule is called. Perhaps it would be an optimiziation to avoid unecessarily calling schedule. One reason for not doing so would be, say any other interrupt other than timer tick is returning to the interrupted kernel space, then in these cases for example – if the timer tick didn’t get a chance to run (because all other local interrupts are disabled in Linux until an interrupt finishes, in this case our non-timer interrupt), then we’d want the exit path of the non-timer interrupt to behave just like the exit path of the timer tick interrupt would behave, whether need_resched is set or not.

Critical sections in kernel code where preemption is off

One nice example of a code path where preemption is off is the mutex_lock path in the kernel. In the kernel, there is an optimization where if a mutex is already locked and not available, but if the lock owner (the task currently holding the lock) is running on another CPU, then the mutex temporarily becomes a spinlock (which means it will spin until it can get the lock) instead of behaving like a mutex (which sleeps until the lock is available). The pseudo code looks like this:

1
2
3
4
5
6
7
8
9
10
mutex_lock() {
  disable_preempt();
  if (lock can't be acquired and the lock holding task is currently running) {
  while (lock_owner_running && !need_resched()) {
      cpu_relax();
  }
  }
  enable_preempt();
  acquire_lock_or_sleep();
}

The lock path does exactly what I described. cpu_relax() is arch specific which is called when the CPU has to do nothing but wait. It gives hints to the CPU that it can put itself into an idle state or use its resources for someone else. For x86, it involves calling the halt instruction.

What I noticed is the Ftrace latency tracer complained about a long delay in the preempt disabled path of mutex_lock for one of my tests, and I made some noise about it on the mailing list. Disabling preemption for long periods is generally a bad thing to do because during this duration, no other task can be scheduled on the CPU. However, Steven pointed out that for this particular case, since we’re checking for need_resched() and breaking out of the loop, we should be Ok. What would happen is, the scheduling timer interrupt (which calls scheduler_tick() I mentioned earlier) comes in and checks if higher priority tasks need CPU, and if they do, it sets TIF_NEED_RESCHED. Once the timer interrupt returns to our tightly spinning loop in mutex_lock, we would break out of the loop having noticed need_resched() and, re-enable preemption as shown in the code above. Thus the long duration of preemption doesn’t turn out to be a problem as long tasks that need CPU are prioritized correctly. need_resched() achieved this fairness.

Next time you see if (need_resched()) in kernel code, you’ll have a better idea why its there :). Let me know your comments if any.