A lot of folks ask why do you have to do fork and then an exec, to execute a new program? and why can’t it be done in one step?, or why does fork create a copy-on-writed address space, to only have it thrown away later when you do an exec?. So I decided do a small write up about this topic.
On a separate note, firstly it is important to remember that
fork is not used for threading, its primary use is to create a separate process, that is a child of the parent process that called
Normally one might think that doing a
fork and separate
exec can be combined in one step, and it probably should be. But there are applications of maintaining this separation. Here’s a post that explains why you might need to do just a
fork call. Summarizing the post, you may need to setup some initial data and “fork” a bunch of workers. All these works are supposed to execute in their own address space and share only the initial data. In this case, copy-on-write is extremely useful since the initial data can be shared in physical memory and forking this way would be extremely cheap. The kernel marks all these shared pages as read only, and makes writable copies of shared data when they are written to.
There is a small overhead if
fork is followed immediately by an
exec system call, since the copy-on-write shared address space is of no use and is thrown away anyway. Combining both the
exec in this case might might have some advantages, reducing this overhead.
Linux Implementation of Copy-on-write (COW) for shared Virtual Memory Areas
Some of this COW code that executes on a fork can be found in
mm/memory.c. There is an is_cow function to detect if a virtual memory area (a region of virtual memory, see
/proc/self/maps) is copy-on-write.
1 2 3 4
A VMA (Virtual Memory Area) is a contiguous segment of virtual memory belonging to a particular process. Every VMA has a bunch of VM_ flags associated with it.
VM_MAYWRITE, relevant to the above code, is used to mark that a mapped region can be changed to writable by mprotect system call. It is possible that a memory region is initially readonly and the user wants to make it writable.
VM_MAYWRITE gives that permission. Note that if if the kernel doesn’t set
VM_MAYWRITE, then the region is automatically not COW because there is no question of writing to it.
When a memory mapping is created via the mmap system call, and if
MAP_SHARED is passed in flags, the
VM_SHARED bit is set for the VMA and as a result the region is not copy-on-write (The above is_cow_mapping function returns false). By definition, shared memory regions are just that – shared. So no need copy-on-write. In other words, If the VMA is a shared mapping or is a read only mapping, then it isn’t a COW mapping.
Let’s take the example of mapping a file using mmap,
By default in the kernel on VMA creation, the VMA flags is set to
VM_SHARED = 0 and
VM_MAYWRITE = 1. Now if mmap is asked it to create a shared mapping of a file by passing it
MAP_SHARED flag, for example, that can be shared with other processes that are being forked, then the
VM_SHARED bit is set to 1 for that VMA. Additionally if the file is opened in read only mode, then
VM_MAYWRITE is set to 1. This has the effect of making is_cow_mapping return false. Ofcourse, the shared mapping doesn’t need to be a COW.
On the other hand, if
MAP_PRIVATE is passed in the flags to mmap, then
VM_SHARED bit is set to 0, and
VM_MAYWRITE remains at 1 (regardless of whether the file is read or write opened, since writes will not be carried to the underlying file). This makes is_cow_mapping return true. Indeed, private mappings should be copy-on-write enabled.
You can see the code I’m talking about conveniently here.
The important point here is that every mapping is either a COW mapping or not a COW mapping. During the
clone system call which is called by
fork library call internally, if the
CLONE_VM flag is not passed to
clone as is the case internally within
fork, then all the VMA mappings of the parent process are copied to the child, including the page table entries. In this case, any writes to COW mappings should trigger a copy on write. The main thing to note is the children inherit the COW property of all the copied VMA mappings of its parent and don’t need to be explictly marked as COW.
CLONE_VM is passed, then the VMAs are not copied and the memory descriptor of the child and the parent process are the same, in this case the child and parent share the same address space and are thus are threads. See for yourself. COW or no COW doesn’t matter here.
So here’s a question for you, For N
clone system calls with
!CLONE_VM passed for spawning N threads, we can just create as many VMA copies as we want each time, the COW mappings will take care of themselves. Right? Almost! There’s more work… the physical pages of both the original VMA and the copy VMA have to be marked as read-only. That’s the only way Copy-on-write of those will be triggered by the CPU when those pages are written to. Here’s the code in copy_one_pte that sets this up:
1 2 3 4 5 6 7 8
There you go, now when the COW memory region is written to, a page fault happens, and the page fault handler knows that the VMA of the faulting page is a COW and that’s what triggered the page fault. It can then create a copy of the page and restart the faulting instruction, this time removing the write protection if there aren’t any others sharing the VMA. So in short, fork+exec can be expensive if you had done lots of
fork calls on a process with a lot of large files. Since all this copying business is wasted on doing a subsequent
exec system call.
There is one optimization however, why should you have to do this marking for pages that are not physically present in memory? Those will fault anyway. So the above code is not run if the page is not present, nicely done by checking for
!pte_present(pte) to be true before the preceding code.
Please share any comments you may have in the comments section.