LinuxInternals.org

by Joel Fernandes

Ftrace Events Mechanism

| Comments

Ftrace events are a mechanism that allows different pieces of code in the kernel to ‘broadcast’ events of interest. Such as a scheduler context-switch sched_switch for example. In the scheduler core’s __schedule function, you’ll see something like: trace_sched_switch(preempt, prev, next); This immediately results in a write to a per-cpu ring buffer storing info about what the previous task was, what the next one is, and whether the switch is happening as a result of kernel preemption (versus happening for other reasons such as a task waiting for I/O completion).

Under the hood, these ftrace events are actually implemented using tracepoints. The terms events are tracepoints appear to be used interchangeably, but it appears one could use a trace point if desired without having to do anything with ftrace. Events on the other hand use ftrace.

Let’s discuss a bit about how a tracepoint works. Tracepoints are hooks that are inserted into points of code of interest and call a certain function of your choice (also known as a function probe). Inorder for the tracepoint to do anything, you have to register a function using tracepoint_probe_register. Multiple functions can be registered in a single hook. Once your tracepoint is hit, all functions registered to the tracepoint are executed. Also note that if no function is registered to the tracepoint, then the tracepoint is essentially a NOP with zero-overhead. Actually that’s a lie, there is a branch (and some space) overhead only although negligible.

Here is the heart of the code that executes when a tracepoint is hit:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
        extern struct tracepoint __tracepoint_##name;                   \
        static inline void trace_##name(proto)                          \
        {                                                               \
                if (static_key_false(&__tracepoint_##name.key))         \
                        __DO_TRACE(&__tracepoint_##name,                \
                                TP_PROTO(data_proto),                   \
                                TP_ARGS(data_args),                     \
                                TP_CONDITION(cond),,);                  \
                if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {             \
                        rcu_read_lock_sched_notrace();                  \
                        rcu_dereference_sched(__tracepoint_##name.funcs);\
                        rcu_read_unlock_sched_notrace();                \
                }                                                       \
        }                                                   

The static_key_false in the above code will evaluate to false if there’s no probe registered to the tracepoint.

Digging further, __DO_TRACE does the following in include/linux/tracepoint.h

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#define __DO_TRACE(tp, proto, args, cond, prercu, postrcu)              \
        do {                                                            \
                struct tracepoint_func *it_func_ptr;                    \
                void *it_func;                                          \
                void *__data;                                           \
                                                                        \
                if (!(cond))                                            \
                        return;                                         \
                prercu;                                                 \
                rcu_read_lock_sched_notrace();                          \
                it_func_ptr = rcu_dereference_sched((tp)->funcs);       \
                if (it_func_ptr) {                                      \
                        do {                                            \
                                it_func = (it_func_ptr)->func;          \
                                __data = (it_func_ptr)->data;           \
                                ((void(*)(proto))(it_func))(args);      \
                        } while ((++it_func_ptr)->func);                \
                }                                                       \
                rcu_read_unlock_sched_notrace();                        \
                postrcu;                                                \
        } while (0)

There’s a lot going on there, but main part is the loop that goes through all function pointers (probes) that were registered to the tracepoint and calls them one after the other.

Now, here’s some secrets. Since all ftrace events are tracepoints under the hood, you can piggy back onto interesting events in your kernel with your own probes. This allows you to write interesting tracers. Infact this is precisely how blktrace works, and also is how SystemTap hooks into ftrace events. Checkout a module I wrote that hooks onto sched_switch to build some histograms. The code there is still buggy but if you mess with it and improve it please share your work.

Now that we know a good amount about tracepoints, ftrace events are easy.

An ftrace event being based on tracepoints, makes full use of it but it has to do more. Ofcourse, it has to write events out to the ring buffer. When you enable an ftrace event using debug-fs, at that instant the ftrace events framework registers an “event probe” function at the tracepoint that represents the event. How? Using tracepoint_probe_register just as we discussed.

The code for this is in the file kernel/trace/trace_events.c in function trace_event_reg.

1
2
3
4
5
6
7
8
9
10
11
12
int trace_event_reg(struct trace_event_call *call,
                    enum trace_reg type, void *data)
{
        struct trace_event_file *file = data;

        WARN_ON(!(call->flags & TRACE_EVENT_FL_TRACEPOINT));
        switch (type) {
        case TRACE_REG_REGISTER:
                return tracepoint_probe_register(call->tp,
                                                 call->class->probe,
                                                 file);
...

The probe function call->class->probe for trace events is defined in the file include/trace/trace_events.h and does the job of writing to the ring buffer. In a nutshell, the code gets a handle into the ring buffer, does assignment of the values to the entry structure and writes it out. There is some magic going on here to accomodate arbitrary number of arguments but I am yet to figure that out.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
static notrace void                                                     \
trace_event_raw_event_##call(void *__data, proto)                       \
{                                                                       \
        struct trace_event_file *trace_file = __data;                   \
        struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
        struct trace_event_buffer fbuffer;                              \
        struct trace_event_raw_##call *entry;                           \
        int __data_size;                                                \
                                                                        \
        if (trace_trigger_soft_disabled(trace_file))                    \
                return;                                                 \
                                                                        \
        __data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
                                                                        \
        entry = trace_event_buffer_reserve(&fbuffer, trace_file,        \
                                 sizeof(*entry) + __data_size);         \
                                                                        \
        if (!entry)                                                     \
                return;                                                 \
                                                                        \
        tstruct                                                         \
                                                                        \
        { assign; }                                                     \
                                                                        \
        trace_event_buffer_commit(&fbuffer);                            \
}

Let me know any comments you have or any other ftrace event behavior you’d like explained.

TIF_NEED_RESCHED: Why Is It Needed

| Comments

TIF_NEED_RESCHED is one of the many “thread information flags” stored along side every task in the Linux Kernel. One of the flags which is vital to the working of preemption is TIF_NEED_RESCHED. Inorder to explain why its important and how it works, I will go over 2 cases where TIF_NEED_RESCHED is used.

Preemption

Preemption is the process of forceably grabbing CPU from a user or kernel context and giving it to someone else (user or kernel). It is the means for timesharing a CPU between competing tasks (I will use task as terminology for process). In Linux, the way it works is a timer interrupt (called the tick) interrupts the task that is running and makes a decision about whether a task or a kernel code path (executing on behalf of a task like in a syscall) is to be preempted. This decision is based on whether the task has been running long-enough and something higher priority woke up and needs CPU now, or is ready to run.

These things happen in scheduler_tick(), the exact path is TIMER HARDWARE INTERRUPT –> scheduler_tick –> task_tick_fair –> entity_tick –> check_preempt_tick. entity_tick() updates various run time statistics of the task, and check_preempt_tick() is where TIF_NEED_RESCHED is set.

Here’s a small bit of code in check_preempt_tick

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
        unsigned long ideal_runtime, delta_exec;
        struct sched_entity *se;
        s64 delta;

        ideal_runtime = sched_slice(cfs_rq, curr);
        delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
        if (delta_exec > ideal_runtime) {
                resched_curr(rq_of(cfs_rq));
                /*
                 * The current task ran long enough, ensure it doesn't get
                 * re-elected due to buddy favours.
                 */
                clear_buddies(cfs_rq, curr);
                return;
        }

Here you see a decision is made that the process ran long enough based on its runtime and if so call resched_curr. Turns out resched_curr sets the TIF_NEED_RESCHED for the current task! This informs whoever looks at the flag, that this process should be scheduled out soon.

Even though this flag is set at this point, the task is not going to be preempted yet. This is because preemption happens at specific points such as exit of interrupts. If the flag is set because the timer interrupt (scheduler decided) decided that something of higher priority needs CPU now and sets TIF_NEED_RESCHED, then at the exit of the timer interrupt (interrupt exit path), TIF_NEED_RESCHED is checked, and because it is set – schedule() is called causing context switch to happen to another process of higher priority, instead of just returning to the existing process that the timer interrupted normally would. Lets examine where this happens.

For return from interrupt to user-mode:

If the tick interrupt happened user-mode code was running, then in somewhere in the interrupt exit path for x86, this call chain calls schedule ret_from_intr –> reint_user –> prepare_exit_to_usermode. Here the need_reched flag is checked, and if true schedule() is called.

For return from interrupt to kernel mode, things are a bit different (skip this para if you think it’ll confuse you).

This feature requires kernel preemption to be enabled. The call chain doing the preemption is: ret_from_intr –> reint_kernel –> preempt_schedule_irq (see arch/x86/entry/entry_64.S) which calls schedule. Note that, for return to kernel mode, I see that preempt_schedule_irq calls schedule anyway whether need_resched flag is set or not, this is probably Ok but I am wondering if need_resched should be checked here before schedule is called. Perhaps it would be an optimiziation to avoid unecessarily calling schedule. One reason for not doing so would be, say any other interrupt other than timer tick is returning to the interrupted kernel space, then in these cases for example – if the timer tick didn’t get a chance to run (because all other local interrupts are disabled in Linux until an interrupt finishes, in this case our non-timer interrupt), then we’d want the exit path of the non-timer interrupt to behave just like the exit path of the timer tick interrupt would behave, whether need_resched is set or not.

Critical sections in kernel code where preemption is off

One nice example of a code path where preemption is off is the mutex_lock path in the kernel. In the kernel, there is an optimization where if a mutex is already locked and not available, but if the lock owner (the task currently holding the lock) is running on another CPU, then the mutex temporarily becomes a spinlock (which means it will spin until it can get the lock) instead of behaving like a mutex (which sleeps until the lock is available). The pseudo code looks like this:

1
2
3
4
5
6
7
8
9
10
mutex_lock() {
  disable_preempt();
  if (lock can't be acquired and the lock holding task is currently running) {
  while (lock_owner_running && !need_resched()) {
      cpu_relax();
  }
  }
  enable_preempt();
  acquire_lock_or_sleep();
}

The lock path does exactly what I described. cpu_relax() is arch specific which is called when the CPU has to do nothing but wait. It gives hints to the CPU that it can put itself into an idle state or use its resources for someone else. For x86, it involves calling the halt instruction.

What I noticed is the Ftrace latency tracer complained about a long delay in the preempt disabled path of mutex_lock for one of my tests, and I made some noise about it on the mailing list. Disabling preemption for long periods is generally a bad thing to do because during this duration, no other task can be scheduled on the CPU. However, Steven pointed out that for this particular case, since we’re checking for need_resched() and breaking out of the loop, we should be Ok. What would happen is, the scheduling timer interrupt (which calls scheduler_tick() I mentioned earlier) comes in and checks if higher priority tasks need CPU, and if they do, it sets TIF_NEED_RESCHED. Once the timer interrupt returns to our tightly spinning loop in mutex_lock, we would break out of the loop having noticed need_resched() and, re-enable preemption as shown in the code above. Thus the long duration of preemption doesn’t turn out to be a problem as long tasks that need CPU are prioritized correctly. need_resched() achieved this fairness.

Next time you see if (need_resched()) in kernel code, you’ll have a better idea why its there :). Let me know your comments if any.

Tying 2 Voltage Sources/signals Together

| Comments

Recently I asked a question on StackExchange about what happens when 2 voltage signals are tied together. What’s the resultant voltage and what decides this voltage? The whole train of thought started when I was trying to contemplate what happens when you use pull-ups on signals that are not Open Drain.

I create and simulated a Circuit with the same scenario in LTSpice. “V” is the voltage between the “+” terminals of V1 and V2 and its shown on the right of the simulation. We will confirm the simulation result by doing some math later.

The question is what is the voltage across the load after hooking them up together. And what do the currents look like? Is there a current flowing between the 2 sources as well (apart from the current flowing to the load) because 5v > 1.8v? The simulator refuses to do a simulation without your author adding an internal resistance to the voltage sources first. All voltages sources have certain internal resistances, so that’s fair. This can be considered analogous to having a voltage signal with a certain resistance along its path which limits its current sourcing (or sinking) capabilities.

So I added 1k resistances internally, normally the resistance of a voltage source is far less than this. AA batteries have just 0.1-0.2ohms. Now the circuit looks something like this:

One can simply apply Kirchoff’s current law to the above circuit, the direction of currents would be as in the circuit. I1 and I2 are the currents flowing through R2 and R1 respectively.

By Kirchoff’s law, All the current entering the node labeled V will be equal to the current exiting it even if the currents are not in the actual direction shown above. From this we see:

1
2
3
4
5
6
7
I1 = (1.8 - V) / 1k
I2 = (5 - V)   / 1k
I3 = (V - 0)   / 10k

I3 = I2 + I1
V / 10k  = ((1.8 - V) / 1k) + ((5 - V) / 1k)
V = 3.2381v

Fom this we see the voltage at V is somewhere between 5 and 1.8v. Infact, where it is between 5 and 1.8 depends on how strong or weak the resistances associated with the sources are. If the resistances are lower, then the sources have more of an influence and vice versa. An interesting observation is I1 is negative if you plug V=3.2v in the above equation. This means the current for voltage source V2 (the 1.8v voltage source) is actually flowing into it rather than out of it (its being sinked) and so I1 is actually opposite in direction to the picture shown above.

A simpler case is having 2 voltage sources of the exact same voltage values, in this case the circuit would look like:

Thevenin’s theorem provides an easy simplication into the following, where the equivalent voltage source value is the same but the series resistance is now halved. This results in the following circuit:

Now you can use the Voltage divider concept and easily solve this:

1
2
3
V = V2 * (R2 / (R1 + R2) )
  = 1.8v * ( 10k / (10k + 0.5k) )
  = 1.7142v

As you would notice, the 1k resistance dropped around 0.085v of voltage before getting to the 10k load. Thanks for reading. Please leave your comments or inputs below.