LinuxInternals.org

Tinkering with Embedded Linux

Ftrace Events Mechanism

| Comments

Ftrace events are a mechanism that allows different pieces of code in the kernel to ‘broadcast’ events of interest. Such as a scheduler context-switch sched_switch for example. In the scheduler core’s __schedule function, you’ll see something like: trace_sched_switch(preempt, prev, next); This immediately results in a write to a per-cpu ring buffer storing info about what the previous task was, what the next one is, and whether the switch is happening as a result of kernel preemption (versus happening for other reasons such as a task waiting for I/O completion).

Under the hood, these ftrace events are actually implemented using tracepoints. All ftrace events are tracepoints, but all tracepoints are not events.

Let’s discuss a bit about how a tracepoint works. Tracepoints are hooks that are inserted into points of code of interest and call a certain function of your choice (also known as a function probe). Inorder for the tracepoint to do anything, you have to register a function using tracepoint_probe_register. Multiple functions can be registered in a single hook. Once your tracepoint is hit, all functions registered to the tracepoint are executed. Also note that if no function is registered to the tracepoint, then the tracepoint is essentially a NOP with zero-overhead. Actually that’s a lie, there is a branch (and some space) overhead only although negligible.

Here is the heart of the code that executes when a tracepoint is hit:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
        extern struct tracepoint __tracepoint_##name;                   \
        static inline void trace_##name(proto)                          \
        {                                                               \
                if (static_key_false(&__tracepoint_##name.key))         \
                        __DO_TRACE(&__tracepoint_##name,                \
                                TP_PROTO(data_proto),                   \
                                TP_ARGS(data_args),                     \
                                TP_CONDITION(cond),,);                  \
                if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {             \
                        rcu_read_lock_sched_notrace();                  \
                        rcu_dereference_sched(__tracepoint_##name.funcs);\
                        rcu_read_unlock_sched_notrace();                \
                }                                                       \
        }                                                   

The static_key_false in the above code will evaluate to false if there’s no probe registered to the tracepoint.

Digging further, __DO_TRACE does the following in include/linux/tracepoint.h

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#define __DO_TRACE(tp, proto, args, cond, prercu, postrcu)              \
        do {                                                            \
                struct tracepoint_func *it_func_ptr;                    \
                void *it_func;                                          \
                void *__data;                                           \
                                                                        \
                if (!(cond))                                            \
                        return;                                         \
                prercu;                                                 \
                rcu_read_lock_sched_notrace();                          \
                it_func_ptr = rcu_dereference_sched((tp)->funcs);       \
                if (it_func_ptr) {                                      \
                        do {                                            \
                                it_func = (it_func_ptr)->func;          \
                                __data = (it_func_ptr)->data;           \
                                ((void(*)(proto))(it_func))(args);      \
                        } while ((++it_func_ptr)->func);                \
                }                                                       \
                rcu_read_unlock_sched_notrace();                        \
                postrcu;                                                \
        } while (0)

There’s a lot going on there, but main part is the loop that goes through all function pointers (probes) that were registered to the tracepoint and calls them one after the other.

Now, here’s some secrets. Since all ftrace events are tracepoints under the hood, you can piggy back onto interesting events in your kernel with your own probes. This allows you to write interesting tracers. Infact this is precisely how blktrace works, and also is how SystemTap hooks into ftrace events. Checkout a module I wrote that hooks onto sched_switch to build some histograms. The code there is still buggy but if you mess with it and improve it please share your work.

Now that we know a good amount about tracepoints, ftrace events are easy.

An ftrace event being based on tracepoints, makes full use of it but it has to do more. Ofcourse, it has to write events out to the ring buffer. When you enable an ftrace event using debug-fs, at that instant the ftrace events framework registers an “event probe” function at the tracepoint that represents the event. How? Using tracepoint_probe_register just as we discussed.

The code for this is in the file kernel/trace/trace_events.c in function trace_event_reg.

1
2
3
4
5
6
7
8
9
10
11
12
int trace_event_reg(struct trace_event_call *call,
                    enum trace_reg type, void *data)
{
        struct trace_event_file *file = data;

        WARN_ON(!(call->flags & TRACE_EVENT_FL_TRACEPOINT));
        switch (type) {
        case TRACE_REG_REGISTER:
                return tracepoint_probe_register(call->tp,
                                                 call->class->probe,
                                                 file);
...

The probe function call->class->probe for trace events is defined in the file include/trace/trace_events.h and does the job of writing to the ring buffer. In a nutshell, the code gets a handle into the ring buffer, does assignment of the values to the entry structure and writes it out. There is some magic going on here to accomodate arbitrary number of arguments but I am yet to figure that out.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
static notrace void                                                     \
trace_event_raw_event_##call(void *__data, proto)                       \
{                                                                       \
        struct trace_event_file *trace_file = __data;                   \
        struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\
        struct trace_event_buffer fbuffer;                              \
        struct trace_event_raw_##call *entry;                           \
        int __data_size;                                                \
                                                                        \
        if (trace_trigger_soft_disabled(trace_file))                    \
                return;                                                 \
                                                                        \
        __data_size = trace_event_get_offsets_##call(&__data_offsets, args); \
                                                                        \
        entry = trace_event_buffer_reserve(&fbuffer, trace_file,        \
                                 sizeof(*entry) + __data_size);         \
                                                                        \
        if (!entry)                                                     \
                return;                                                 \
                                                                        \
        tstruct                                                         \
                                                                        \
        { assign; }                                                     \
                                                                        \
        trace_event_buffer_commit(&fbuffer);                            \
}

Let me know any comments you have or any other ftrace event behavior you’d like explained.

TIF_NEED_RESCHED: Why Is It Needed

| Comments

TIF_NEED_RESCHED is one of the many “thread information flags” stored along side every task in the Linux Kernel. One of the flags which is vital to the working of preemption is TIF_NEED_RESCHED. Inorder to explain why its important and how it works, I will go over 2 cases where TIF_NEED_RESCHED is used.

Preemption

Preemption is the process of forceably grabbing CPU from a user or kernel context and giving it to someone else (user or kernel). It is the means for timesharing a CPU between competing tasks (I will use task as terminology for process). In Linux, the way it works is a timer interrupt (called the tick) interrupts the task that is running and makes a decision about whether a task or a kernel code path (executing on behalf of a task like in a syscall) is to be preempted. This decision is based on whether the task has been running long-enough and something higher priority woke up and needs CPU now, or is ready to run.

These things happen in scheduler_tick(), the exact path is TIMER HARDWARE INTERRUPT –> scheduler_tick –> task_tick_fair –> entity_tick –> check_preempt_tick. entity_tick() updates various run time statistics of the task, and check_preempt_tick() is where TIF_NEED_RESCHED is set.

Here’s a small bit of code in check_preempt_tick

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
        unsigned long ideal_runtime, delta_exec;
        struct sched_entity *se;
        s64 delta;

        ideal_runtime = sched_slice(cfs_rq, curr);
        delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
        if (delta_exec > ideal_runtime) {
                resched_curr(rq_of(cfs_rq));
                /*
                 * The current task ran long enough, ensure it doesn't get
                 * re-elected due to buddy favours.
                 */
                clear_buddies(cfs_rq, curr);
                return;
        }

Here you see a decision is made that the process ran long enough based on its runtime and if so call resched_curr. Turns out resched_curr sets the TIF_NEED_RESCHED for the current task! This informs whoever looks at the flag, that this process should be scheduled out soon.

Even though this flag is set at this point, the task is not going to be preempted yet. This is because preemption happens at specific points such as exit of interrupts. If the flag is set because the timer interrupt (scheduler decided) decided that something of higher priority needs CPU now and sets TIF_NEED_RESCHED, then at the exit of the timer interrupt (interrupt exit path), TIF_NEED_RESCHED is checked, and because it is set – schedule() is called causing context switch to happen to another process of higher priority, instead of just returning to the existing process that the timer interrupted normally would. Lets examine where this happens.

For return from interrupt to user-mode:

If the tick interrupt happened user-mode code was running, then in somewhere in the interrupt exit path for x86, this call chain calls schedule ret_from_intr –> reint_user –> prepare_exit_to_usermode. Here the need_reched flag is checked, and if true schedule() is called.

For return from interrupt to kernel mode, things are a bit different (skip this para if you think it’ll confuse you).

This feature requires kernel preemption to be enabled. The call chain doing the preemption is: ret_from_intr –> reint_kernel –> preempt_schedule_irq (see arch/x86/entry/entry_64.S) which calls schedule. Note that, for return to kernel mode, I see that preempt_schedule_irq calls schedule anyway whether need_resched flag is set or not, this is probably Ok but I am wondering if need_resched should be checked here before schedule is called. Perhaps it would be an optimiziation to avoid unecessarily calling schedule. One reason for not doing so would be, say any other interrupt other than timer tick is returning to the interrupted kernel space, then in these cases for example – if the timer tick didn’t get a chance to run (because all other local interrupts are disabled in Linux until an interrupt finishes, in this case our non-timer interrupt), then we’d want the exit path of the non-timer interrupt to behave just like the exit path of the timer tick interrupt would behave, whether need_resched is set or not.

Critical sections in kernel code where preemption is off

One nice example of a code path where preemption is off is the mutex_lock path in the kernel. In the kernel, there is an optimization where if a mutex is already locked and not available, but if the lock owner (the task currently holding the lock) is running on another CPU, then the mutex temporarily becomes a spinlock (which means it will spin until it can get the lock) instead of behaving like a mutex (which sleeps until the lock is available). The pseudo code looks like this:

1
2
3
4
5
6
7
8
9
10
mutex_lock() {
  disable_preempt();
  if (lock can't be acquired and the lock holding task is currently running) {
  while (lock_owner_running && !need_resched()) {
      cpu_relax();
  }
  }
  enable_preempt();
  acquire_lock_or_sleep();
}

The lock path does exactly what I described. cpu_relax() is arch specific which is called when the CPU has to do nothing but wait. It gives hints to the CPU that it can put itself into an idle state or use its resources for someone else. For x86, it involves calling the halt instruction.

What I noticed is the Ftrace latency tracer complained about a long delay in the preempt disabled path of mutex_lock for one of my tests, and I made some noise about it on the mailing list. Disabling preemption for long periods is generally a bad thing to do because during this duration, no other task can be scheduled on the CPU. However, Steven pointed out that for this particular case, since we’re checking for need_resched() and breaking out of the loop, we should be Ok. What would happen is, the scheduling timer interrupt (which calls scheduler_tick() I mentioned earlier) comes in and checks if higher priority tasks need CPU, and if they do, it sets TIF_NEED_RESCHED. Once the timer interrupt returns to our tightly spinning loop in mutex_lock, we would break out of the loop having noticed need_resched() and, re-enable preemption as shown in the code above. Thus the long duration of preemption doesn’t turn out to be a problem as long tasks that need CPU are prioritized correctly. need_resched() achieved this fairness.

Next time you see if (need_resched()) in kernel code, you’ll have a better idea why its there :). Let me know your comments if any.

Resolving Voltage Conflicts: Theory and Simulation

| Comments

Recently I asked a question on StackExchange about what happens when 2 voltage signals are tied together. What’s the resultant voltage and what decides this voltage? The whole train of thought started when I was trying to contemplate what happens when you use pull-ups on signals that are not Open Drain.

I create and simulated a Circuit with the same scenario in LTSpice. “V” is the voltage between the “+” terminals of V1 and V2 and its shown on the right of the simulation. We will confirm the simulation result by doing some math later.

The question is what is the voltage across the load after hooking them up together. And what do the currents look like? Is there a current flowing between the 2 sources as well (apart from the current flowing to the load) because 5v > 1.8v? The simulator refuses to do a simulation without your author adding an internal resistance to the voltage sources first. All voltages sources have certain internal resistances, so that’s fair. This can be considered analogous to having a voltage signal with a certain resistance along its path which limits its current sourcing (or sinking) capabilities.

So I added 1k resistances internally, normally the resistance of a voltage source is far less than this. AA batteries have just 0.1-0.2ohms. Now the circuit looks something like this:

One can simply apply Kirchoff’s current law to the above circuit, the direction of currents would be as in the circuit. I1 and I2 are the currents flowing through R2 and R1 respectively.

By Kirchoff’s law, All the current entering the node labeled V will be equal to the current exiting it even if the currents are not in the actual direction shown above. From this we see:

1
2
3
4
5
6
7
I1 = (1.8 - V) / 1k
I2 = (5 - V)   / 1k
I3 = (V - 0)   / 10k

I3 = I2 + I1
V / 10k  = ((1.8 - V) / 1k) + ((5 - V) / 1k)
V = 3.2381v

Fom this we see the voltage at V is somewhere between 5 and 1.8v. Infact, where it is between 5 and 1.8 depends on how strong or weak the resistances associated with the sources are. If the resistances are lower, then the sources have more of an influence and vice versa. An interesting observation is I1 is negative if you plug V=3.2v in the above equation. This means the current for voltage source V2 (the 1.8v voltage source) is actually flowing into it rather than out of it (its being sinked) and so I1 is actually opposite in direction to the picture shown above.

A simpler case is having 2 voltage sources of the exact same voltage values, in this case the circuit would look like:

Thevenin’s theorem provides an easy simplication into the following, where the equivalent voltage source value is the same but the series resistance is now halved. This results in the following circuit:

Now you can use the Voltage divider concept and easily solve this:

1
2
3
V = V2 * (R2 / (R1 + R2) )
  = 1.8v * ( 10k / (10k + 0.5k) )
  = 1.7142v

As you would notice, the 1k resistance dropped around 0.085v of voltage before getting to the 10k load. Thanks for reading. Please leave your comments or inputs below.

MicroSD Card Remote Switch

| Comments

Recently, I’ve been wanting to remotely be able to program a MicroSD card with a new bootloader or filesystem without removing the card from its embedded target board (such as a Beaglebone or Pandaboard). Due to the lack of any such existing tools, I decided to design my own board. Finally have got it working, below are some pictures and a screencast demo video of the switcher in action! I sprinkled some power and status LED to show the user what’s going on.

The base board requires two SparkFun MicroSD sniffers. The card cage on the sniffer is unused for my purposes. The switcher is controlled through an FTDI cable. I also wrote up a switch program to control the switcher with libftdi. You just have to pass to it the FTDI’s serial number and whether you want to switch to host-mode (for programming the card) or target-mode (for booting the programmed card). Hardware design files are available under a CC-BY-NC-SA 3.0 license.

Screencast

Pictures

Hope you enjoyed it, let me know what yout think in the comments:)

Linux Spinlock Internals

| Comments

This article tries to clarify how spinlocks are implemented in the Linux kernel and how they should be used correctly in the face of preemption and interrupts. The focus of this article will be more on basic concepts than details, as details tend to be forgotten more easily and shouldn’t be too hard to look up although attention is paid to it to the extent that it helps understanding.

Fundamentally somewhere in include/linux/spinlock.h, a decision is made on which spinlock header to pull based on whether SMP is enabled or not:

1
2
3
4
5
#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
# include <linux/spinlock_api_smp.h>
#else
# include <linux/spinlock_api_up.h>
#endif

We’ll go over how things are implemented in both the SMP (Symmetric Multi-Processor) and UP (Uni-Processor) cases.

For the SMP case, __raw_spin_lock* functions in kernel/locking/spinlock.c are called when one calls some version of a spin_lock.

Following is the definition of the most basic version defined with a macro:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#define BUILD_LOCK_OPS(op, locktype)                                    \
void __lockfunc __raw_##op##_lock(locktype##_t *lock)                   \
{                                                                       \
        for (;;) {                                                      \
                preempt_disable();                                      \
                if (likely(do_raw_##op##_trylock(lock)))                \
                        break;                                          \
                preempt_enable();                                       \
                                                                        \
                if (!(lock)->break_lock)                                \
                        (lock)->break_lock = 1;                         \
                while (!raw_##op##_can_lock(lock) && (lock)->break_lock)\
                        arch_##op##_relax(&lock->raw_lock);             \
        }                                                               \
        (lock)->break_lock = 0;                                         \
}                                                                       \

The function has several imporant bits. First it disables preemption on line 5, then tries to atomically acquire the spinlock on line 6. If it succeeds it breaks from the for loop on line 7, leaving preemption disabled for the duration of crticial section being protected by the lock. If it didn’t succeed in acquiring the lock (maybe some other CPU grabbed the lock already), it enables preemption back and spins till it can acquire the lock keeping preemption enabled during this period. Each time it detects that the lock can’t be acquired in the while loop, it calls an architecture specific relax function which has the effect executing some variant of a no-operation instruction that causes the CPU to execute such an instruction efficiently in a lower power state. We’ll talk about the break_lock usage in a bit. Soon as it knows the lock is free, say the raw_spin_can_lock(lock) function returned 1, it goes back to the beginning of the for loop and tries to acquire the lock again.

What’s important to note here is the reason for keeping preemption enabled (we’ll see in a bit that for UP configurations, this is not done). While the kernel is spinning on a lock, other processes shouldn’t be kept from preempting the spinning thread. The lock in these cases have been acquired on a different CPU because (assuming bug free code) it’s impossible the current CPU which is trying to grab the lock has already acquired it, because preemption is disabled on acquiral. So it makes sense for the spinning kernel thread to be preempted giving others CPU time. It is also possible that more than one process on the current CPU is trying to acquire the same lock and spinning on it, in this case the kernel gets continuously preempted between the 2 threads fighting for the lock, while some other CPU in the cluster happily holds the lock, hopefully for not too long.

That’s where the break_lock element in the lock structure comes in. Its used to signal to the lock-holding processor in the cluster that there is someone else trying to acquire the lock. This can cause the lock to released early by the holder if required.

Now lets see what happens in the UP (Uni-Processor) case.

Believe it or not, it’s really this simple:

1
2
3
4
5
6
7
8
#define ___LOCK(lock) \
  do { __acquire(lock); (void)(lock); } while (0)

#define __LOCK(lock) \
  do { preempt_disable(); ___LOCK(lock); } while (0)

// ....skipped some lines.....
#define _raw_spin_lock(lock)                    __LOCK(lock)

All that needs to be done is to disable preemption and acquire the lock. The code really doesn’t do anything other than disable preemption. The references to the lock variable are just to suppress compiler warnings as mentioned in comments in the source file.

There’s no spinning at all here like the UP case and the reason is simple: in the SMP case, remember we had agreed that while a lock is acquired by a particular CPU (in this case just the 1 CPU), no other process on that CPU should have acquired the lock. How could it have gotten a chance to do so with preemption disabled on that CPU to begin with?

Even if the code is buggy, (say the same process tries to acquires the lock twice), it’s still impossible that 2 different processes try to acquire the same lock on a Uni-Processor system considering preemption is disabled on lock acquiral. Following that idea, in the Uni-processor case, since we are running on only 1 CPU, all that needs to be done is to disable preemption, since the fact that we are being allowed to disable preemption to begin with, means that no one else has acquired the lock. Works really well!

Sharing spinlocks between interrupt and process-context

It is possible that a critical section needs to be protected by the same lock in both an interrupt and in non-interrupt (process) execution context in the kernel. In this case spin_lock_irqsave and the spin_unlock_irqrestore variants have to be used to protect the critical section. This has the effect of disabling interrupts on the executing CPU. Imagine what would happen if you just used spin_lock in the process context?

Picture the following:

  1. Process context kernel code acquires lock A using spin_lock.
  2. While the lock is held, an interrupt comes in on the same CPU and executes.
  3. Interrupt Service Routing (ISR) tries to acquire lock A, and spins continuously waiting for it.
  4. For the duration of the ISR, the Process context is blocked and never gets a chance to run and free the lock.
  5. Hard lock up condition on the CPU!

To prevent this, the process context code needs call spin_lock_irqsave which has the effect of disabling interrupts on that particular CPU along with the regular disabling of preemption we saw earlier before trying to grab the lock.

Note that the ISR can still just call spin_lock instead of spin_lock_irqsave because interrupts are disabled anyway during ISR execution. Often times people use spin_lock_irqsave in an ISR, that’s not necessary.

Also note that during the executing of the critical section protected by spin_lock_irqsave, the interrupts are only disabled on the executing CPU. The same interrupt can come in on a different CPU and the ISR will be executed there, but that will not trigger the hard lock condition I talked about, because the process-context code is not blocked and can finish executing the locked critical section and release the lock while the ISR spins on the lock on a different CPU waiting for it. The process context does get a chance to finish and free the lock causing no hard lock up.

Following is what the spin_lock_irqsave code looks like for the SMP case, UP case is similar, look it up. BTW, the only difference here compared to the regular spin_lock I described in the beginning are the local_irq_save and local_irq_restore that accompany the preempt_disable and preempt_enable in the lock code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#define BUILD_LOCK_OPS(op, locktype)                                    \
unsigned long __lockfunc __raw_##op##_lock_irqsave(locktype##_t *lock)  \
{                                                                       \
        unsigned long flags;                                            \
                                                                        \
        for (;;) {                                                      \
                preempt_disable();                                      \
                local_irq_save(flags);                                  \
                if (likely(do_raw_##op##_trylock(lock)))                \
                        break;                                          \
                local_irq_restore(flags);                               \
                preempt_enable();                                       \
                                                                        \
                if (!(lock)->break_lock)                                \
                        (lock)->break_lock = 1;                         \
                while (!raw_##op##_can_lock(lock) && (lock)->break_lock)\
                        arch_##op##_relax(&lock->raw_lock);             \
        }                                                               \
        (lock)->break_lock = 0;                                         \
        return flags;                                                   \
}                                                                       \

Hope this post made a few things more clear, there’s a lot more to spinlocking. A good reference is Rusty’s Unreliable Guide To Locking.

Studying Cache-line Sharing Effects on SMP Systems

| Comments

Having read the chapter on counting and per-CPU counters in Paul Mckenney’s recent book, I thought I would do a small experiment to check how good or bad it would be if those per-CPU counters were close to each other in memory.

Paul talks about using one global shared counter for N threads on N CPUs, and the effects it can have on the cache. Each CPU core’s cache in an SMP system will need exclusive rights on a specific cache line of memory, before it can do the write. This means that, at any given time only one CPU can and should do a write to that part of memory.

This is accomplished typically by an invalidate protocol, where each CPU needs to do some inter-processor communication before it can assume it has exclusive access to that cache line, and also read any copies that may still be in some other core’s cache and not in main memory. This is an expensive operation that is to be avoided at all costs!

Then Paul goes about saying, OK- let’s have a per-thread counter, and have each core increment it independently, and when we need a read out, we would grab a lock and add all of the individual counters together. This works great, assuming each per-thread counter is separated by atleast a cache line. That’s guaranteed, when one uses the __thread primitive nicely separating out the memory to reduce cache line sharing effects.

So I decided to flip this around, and have per-thread counters that were closely spaced and do some counting with them. Instead of using __thread, I created an array of counters, each element belonging to some thread. The counters are still separate and not shared, but they may still be in shared cache-line causing the nasty effects we talked about, which I wanted to measure.

My program sets up N counting threads, and assumes its running each of them on a single core on typical multicore system. Various iterations of per-thread counting is done, with the counters separated by increasing powers of 2 each iteration. After each iteration, I stop all threads, add the per-thread counter values and report the result.

Below are the results of running the program on 3 different SMP systems (2 threads on 2 CPUs, sorry I don’t have better multi-core hardware ATM):

Effect of running on a reference ARM dual-core Cortex-A9 system:

Notice the jump in through-put once the separation changes from 16 to 32 bytes. That gives us a good idea that the L1 cache line size on Cortex-A9 systems is 32 bytes (8 words). Something the author didn’t know for sure in advance (I initially thought it was 64-bytes).

Effect of running on a reference ARM dual-core Cortex-A15 system:

L1 Cache-line size of Cortex A-15 is 64 bytes (8 words). Expected jump for a separation of 64 bytes.

Effect of running on a x86-64 i7-3687U dual-core CPU: .

L1 Cache-line size of this CPU is 64 bytes too (8 words). Expected jump for a separation of 64 bytes.

This shows your parallel programs need to take care of cache-line alignment to avoid false-sharing effects. Also, doing something like this in your program is an indirect way to find out what the cache-line size is for your CPU, or a direct way to get fired, whichever way you want to look at it. ;)

Design of Fork Followed by Exec in Linux

| Comments

A lot of folks ask why do you have to do fork and then an exec, to execute a new program? and why can’t it be done in one step?, or why does fork create a copy-on-writed address space, to only have it thrown away later when you do an exec?. So I decided do a small write up about this topic.

On a separate note, firstly it is important to remember that fork is not used for threading, its primary use is to create a separate process, that is a child of the parent process that called fork.

Normally one might think that doing a fork and separate exec can be combined in one step, and it probably should be. But there are applications of maintaining this separation. Here’s a post that explains why you might need to do just a fork call. Summarizing the post, you may need to setup some initial data and “fork” a bunch of workers. All these works are supposed to execute in their own address space and share only the initial data. In this case, copy-on-write is extremely useful since the initial data can be shared in physical memory and forking this way would be extremely cheap. The kernel marks all these shared pages as read only, and makes writable copies of shared data when they are written to.

There is a small overhead if fork is followed immediately by an exec system call, since the copy-on-write shared address space is of no use and is thrown away anyway. Combining both the fork, exec in this case might might have some advantages, reducing this overhead.

Linux Implementation of Copy-on-write (COW) for shared Virtual Memory Areas

Some of this COW code that executes on a fork can be found in mm/memory.c. There is an is_cow function to detect if a virtual memory area (a region of virtual memory, see /proc/self/maps) is copy-on-write.

1
2
3
4
static inline bool is_cow_mapping(vm_flags_t flags)
{
        return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
}

A VMA (Virtual Memory Area) is a contiguous segment of virtual memory belonging to a particular process. Every VMA has a bunch of VM_ flags associated with it. VM_MAYWRITE, relevant to the above code, is used to mark that a mapped region can be changed to writable by mprotect system call. It is possible that a memory region is initially readonly and the user wants to make it writable. VM_MAYWRITE gives that permission. Note that if if the kernel doesn’t set VM_MAYWRITE, then the region is automatically not COW because there is no question of writing to it.

When a memory mapping is created via the mmap system call, and if MAP_SHARED is passed in flags, the VM_SHARED bit is set for the VMA and as a result the region is not copy-on-write (The above is_cow_mapping function returns false). By definition, shared memory regions are just that – shared. So no need copy-on-write. In other words, If the VMA is a shared mapping or is a read only mapping, then it isn’t a COW mapping.

Let’s take the example of mapping a file using mmap,

By default in the kernel on VMA creation, the VMA flags is set to VM_SHARED = 0 and VM_MAYWRITE = 1. Now if mmap is asked it to create a shared mapping of a file by passing it MAP_SHARED flag, for example, that can be shared with other processes that are being forked, then the VM_SHARED bit is set to 1 for that VMA. Additionally if the file is opened in read only mode, then VM_MAYWRITE is set to 1. This has the effect of making is_cow_mapping return false. Ofcourse, the shared mapping doesn’t need to be a COW.

On the other hand, if MAP_PRIVATE is passed in the flags to mmap, then VM_SHARED bit is set to 0, and VM_MAYWRITE remains at 1 (regardless of whether the file is read or write opened, since writes will not be carried to the underlying file). This makes is_cow_mapping return true. Indeed, private mappings should be copy-on-write enabled.

You can see the code I’m talking about conveniently here.

The important point here is that every mapping is either a COW mapping or not a COW mapping. During the clone system call which is called by fork library call internally, if the CLONE_VM flag is not passed to clone as is the case internally within fork, then all the VMA mappings of the parent process are copied to the child, including the page table entries. In this case, any writes to COW mappings should trigger a copy on write. The main thing to note is the children inherit the COW property of all the copied VMA mappings of its parent and don’t need to be explictly marked as COW.

However, If CLONE_VM is passed, then the VMAs are not copied and the memory descriptor of the child and the parent process are the same, in this case the child and parent share the same address space and are thus are threads. See for yourself. COW or no COW doesn’t matter here.

So here’s a question for you, For N clone system calls with !CLONE_VM passed for spawning N threads, we can just create as many VMA copies as we want each time, the COW mappings will take care of themselves. Right? Almost! There’s more work… the physical pages of both the original VMA and the copy VMA have to be marked as read-only. That’s the only way Copy-on-write of those will be triggered by the CPU when those pages are written to. Here’s the code in copy_one_pte that sets this up:

1
2
3
4
5
6
7
8
     /*
      * If it's a COW mapping, write protect it both
      * in the parent and the child
      */
     if (is_cow_mapping(vm_flags)) {
             ptep_set_wrprotect(src_mm, addr, src_pte);
             pte = pte_wrprotect(pte);
     }

There you go, now when the COW memory region is written to, a page fault happens, and the page fault handler knows that the VMA of the faulting page is a COW and that’s what triggered the page fault. It can then create a copy of the page and restart the faulting instruction, this time removing the write protection if there aren’t any others sharing the VMA. So in short, fork+exec can be expensive if you had done lots of fork calls on a process with a lot of large files. Since all this copying business is wasted on doing a subsequent exec system call.

There is one optimization however, why should you have to do this marking for pages that are not physically present in memory? Those will fault anyway. So the above code is not run if the page is not present, nicely done by checking for !pte_present(pte) to be true before the preceding code.

Please share any comments you may have in the comments section.