LinuxInternals.org

by Joel Fernandes

USDT for Reliable Userspace Event Tracing

| Comments

Userspace program when compiled to native can place tracepoints in them using a USDT (User Statically Defined Tracing), the details of the tracepoints (such as address and arguments) are placed as a note in the ELF binary for other tools to interpret. This at first seems much better than using Uprobes directly on userspace functions, since the latter not only needs symbol information from the binary but is also at the mercy of compiler optimizations and function inlining. In Android we directly write into tracing_mark_write for userspace tracepoints. While this has its advantages (simplicity), it also means that the only way to “process” the trace information for ALL tracing usecases is by reading back the trace buffer to userspace and processing them offline, thus needing a full user-kernel-user round trip. Further its not possible to easily create triggers on these events (dump the stack or stop tracing when an event fires, for example). Uprobes event tracing on the other hands gets the full benefit that ftrace events do. Also all userspace emitted trace data is a string which has to be parsed and post-processed. USDT seems a nice way to solve some of these problems, since we can then use the user-provided data and create Uprobes at the correct locations, and process these on the fly using BCC without things ever hitting the trace buffer or returning to userspace.

This simple article is based on some notes I made while playing with USDT probes. To build a C program with an SDT probe, you can just include sdt.h and sdt-config.h from the Ubuntu systemtap-sdt-devel package, which works for both arm64 and x86.

C program can be as simple as:

1
2
3
4
#include "sdt.h"
int main() {
  DTRACE_PROBE("test", "probe1");
}

One compiling this, a new .note.stapsdt ELF section is created, which can be read by: readelf -n <bin-path>

1
2
3
4
5
6
7
Displaying notes found in: .note.stapsdt
  Owner                 Data size Description
  stapsdt              0x00000029 NT_STAPSDT (SystemTap probe descriptors)
    Provider: "test"
    Name: "probe1"
    Location: 0x0000000000000664, Base: 0x00000000000006f4, Semaphore: 0x0000000000000000
    Arguments: 

Here there are no arguments, however we can use DTRACE_PROBE2 to pass more, for example for 2 of them:

1
2
3
4
5
6
int main() {
  int a = 1;
  int b = 2;

  DTRACE_PROBE2("test", "probe1", a, b);
}

The readelf tool now reads:

1
2
3
4
5
6
7
Displaying notes found in: .note.stapsdt
  Owner                 Data size Description
  stapsdt              0x00000040 NT_STAPSDT (SystemTap probe descriptors)
    Provider: "test"
    Name: "probe1"
    Location: 0x0000000000000672, Base: 0x0000000000000704, Semaphore: 0x0000000000000000
    Arguments: -4@-4(%rbp) -4@-8(%rbp)

Notice how the arguments show exactly how to access the parameters at the location. In this case, we know the arguments are on the stack and at momention offsets from the base pointer.

Compiling with -f-omit-frame-pointer shows the following in readelf:

1
2
3
4
5
6
7
Displaying notes found in: .note.stapsdt
  Owner                 Data size Description
  stapsdt              0x00000040 NT_STAPSDT (SystemTap probe descriptors)
    Provider: "test"
    Name: "probe1"
    Location: 0x0000000000000670, Base: 0x0000000000000704, Semaphore: 0x0000000000000000
    Arguments: -4@-4(%rsp) -4@-8(%rsp)

Without the base pointer, the compiler relies on the stack pointer for the arguments. However notice that even though the C program is identical to the previous example, the “arguments” in the note section of the ELF has changed. This dynamic nature is one of the key reasons why SDT probes are so much better than say using perf probe directly to install Uprobes in the wild. With some help userspace and the compiler, we have more reliable information to access arguments without needing any DWARF debug info.

Compiling with ARM64 gcc shows a slightly different output for the Arguments. Note that that Arguments is just a string which can be post processed by tools to fetch the probe data.

1
2
3
4
5
6
7
Displaying notes found in: .note.stapsdt
  Owner                 Data size Description
  stapsdt              0x0000003f NT_STAPSDT (SystemTap probe descriptors)
    Provider: "test"
    Name: "probe1"
    Location: 0x000000000000077c, Base: 0x0000000000000820, Semaphore: 0x0000000000000000
    Arguments: -4@[sp, 12] -4@[sp, 8]

USDT limitations as I see it:

  • No information about types is stored. This is kind of sad, since now inorder to know what do with the values, one needs more information. These tracepoints were used with DTrace and SystemTap and turns out the scripts that probe these tracepoints are where the type information is stored or assumed. Uprobes tracer supports “string” but without knowing that the USDT is a string value, there’s no way a Uprobe can be created on it, since all the stap note tells us is that there’s a pointer there (who knows if its a 64-bit integer or a character pointer, for example).

  • Argument names are also not stored. This means arguments have to be in the same order in the debug script as they are in the program being debugged.

It seems with a little bit of work, both these things can be added. Does that warrant a new section or can the stapsdt section be augment without causing breakage of existing tools? I don’t know yet.

SDT parsing logic

BCC has a USDT parser written to extract probes from the ELF. Read parse_stapsdt_note in src/cc/bcc_elf.c in the BCC tree for details.

Dynamic programming languages

Programs that are interpretted can’t provide this information ahead of time. The libstapsdt tries to solve this by creating a shared library on the fly and linking to it from the dynamic program. This seems a bit fragile but appears to have users. There are wrappers in Python and Nodejs. Check this article for more details.

Open question I have: * Do any existing Linux tools handle USDT strings? Uprobe tracer does support strings, so the infrastructure seems to be there. I didn’t see any hints of this in BCC. Neither does libstapsdt seem to have this.

Other ideas

  • Creating Uprobes on the fly when a process is loaded: Ideally speaking, if the ELF note section had all the information that the kernel needed, then we could create the Uprobe events for the uprobe trace events at load time and keep them disabled without needing userspace to do anything else. This seems crude at first, but in the “default” case, it would still have almost no-overhead. This does mean that all the information Uprobe tracing needs will have to be stored in the note section. The other nice thing about this is, you no longer need to know the PID of all processes with USDTs in them. EDIT: This idea is flawed. Uprobes are created before a process is loaded AFAIU now, using the binary path of the executable and libraries. What’s more useful is to maintain a cache of all executables in the file system, and their respective instrumentation points. Then on boot up, perhaps we can create all necessary uprobes from early userspace.

  • For dynamic languages, libstapsdt seems great, but it feels a bit hackish since it creates a temporary file for the stub. Perhaps uprobes can be created after the temporary file is dlopen’ed and then the file can be unlinked if it hasn’t been already, so that there aren’t any more references to the temporary file in the file system. Such a temporary file could also probably be in a RAM based file system perhaps.

References

  1. Brendan Gregg’s USDT ftrace page
  2. Uprobe tracer in the kernel
  3. USDT for dynamic languages
  4. Sasha Goldstein’s “Next Generation Linux Tracing With BPF” article
  5. SystemTap SDT implementation

BPFd- Running BCC Tools Remotely Across Systems and Architectures

| Comments

This article (with some edits) also appeared on LWN.

Introduction

BCC (BPF Compiler Collection) is a toolkit and a suite of kernel tracing tools that allow systems engineers to efficiently and safely get a deep understanding into the inner workings of a Linux system. Because they can’t crash the kernel, they are safer than kernel modules and can be run in production environments. Brendan Gregg has written nice tools and given talks showing the full power of eBPF based tools. Unfortunately, BCC has no support for a cross-development workflow. I define “cross-development” as a development workflow in which the development machine and the target machine running the developed code are different. Cross-development is very typical among Embedded System kernel developers who often develop on a powerful x86 host and then flash and test their code on SoCs (System on Chips) based on the ARM architecture. Not having a cross-development flow gives rise to several complications, lets go over them and discuss a solution called BPFd that cleverly addresses this issue.

In the Android kernel team, we work mostly on ARM64 systems, since most Android devices are on this architecture. BCC tools support on ARM64 systems has stayed broken for years. One of the reasons for this difficulty is with ARM64 inline assembler statements. Unavoidably, kernel header includes in BCC tools result in inclusion of asm headers which in the case of ARM64 has the potential of spewing inline asm ARM64 instructions causing major pains to LLVM’s BPF backend. Recently this issue got fixed by BPF inline asm support (these LLVM commits) and folks could finally run BCC tools on arm64), but..

In order for BCC tools to work at all, they need kernel sources. This is because most tools need to register callbacks on the ever-changing kernel API in order to get their data. Such callbacks are registered using the kprobe infrastructure. When a BCC tool is run, BCC switches its current directory into the kernel source directory before compilation starts, and compiles the C program that embodies the BCC tool’s logic. The C program is free to include kernel headers for kprobes to work and to use kernel data structures.

Even if one were not to use kprobes, BCC also implicity adds a common helpers.h include directive whenever an eBPF C program is being compiled, found in src/cc/export/helpers.h in the BCC sources. This helpers.h header uses the LINUX_VERSION_CODE macro to create a “version” section in the compiled output. LINUX_VERSION_CODE is available only in the specific kernel’s sources being targeted and is used during eBPF program loading to make sure the BPF program is being loaded into a kernel with the right version. As you can see, kernel sources quickly become mandatory for compiling eBPF programs.

In some sense this build process is similar to how external kernel modules are built. Kernel sources are large in size and often can take up a large amount of space on the system being debugged. They can also get out of sync, which may make the tools misbehave.

The other issue is Clang and LLVM libraries need to be available on the target being traced. This is because the tools compile the needed BPF bytecode which are then loaded into the kernel. These libraries take up a lot space. It seems overkill that you need a full-blown compiler infrastructure on a system when the BPF code can be compiled elsewhere and maybe even compiled just once. Further, these libraries need to be cross-compiled to run on the architecture you’re tracing. That’s possible, but why would anyone want to do that if they didn’t need to? Cross-compiling compiler toolchains can be tedious and stressful.

BPFd: A daemon for running eBPF BCC tools across systems

Sources for BPFd can be downloaded here.

Instead of loading up all the tools, compiler infrastructure and kernel sources onto the remote targets being traced and running BCC that way, I decided to write a proxy program named BPFd that receives commands and performs them on behalf of whoever is requesting them. All the heavily lifting (compilation, parsing of user input, parsing of the hash maps, presentation of results etc) is done by BCC tools on the host machine, with BPFd running on the target and being the interface to the target kernel. BPFd encapsulates all the needs of BCC and performs them – this includes loading a BPF program, creating, deleting and looking up maps, attaching a eBPF program to a kprobe, polling for new data that the eBPF program may have written into a perf buffer, etc. If it’s woken up because the perf buffer contains new data, it’ll inform BCC tools on the host about it, or it can return map data whenever requested, which may contain information updated by the target eBPF program.

Simple design

Before this work, the BCC tools architecture was as follows: BCC architecture

BPFd based invocations partition this, thus making it possible to do cross-development and execution of the tools across machine and architecture boundaries. For instance, kernel sources that the BCC tools depend on can be on a development machine, with eBPF code being loaded onto a remote machine. This partioning is illustrated in the following diagram: BCC architecture with BPFd

The design of BPFd is quite simple, it expects commands on stdin (standard input) and provides the results over stdout (standard output). Every command a single line always, no matter how big the command is. This allows easy testing using cat, since one could simply cat a file with commands, and check if BPFd’s stdout contain the expected results. Results from a command, however can be multiple lines.

BPF maps are data structures that a BPF program uses to store data which can be retrieved at a later time. Maps are represented by file descriptor returned by the bpf system call once the map has been successfully created. For example, following is a command to BPFd for creating a BPF hashtable map.

1
BPF_CREATE_MAP 1 count 8 40 10240 0

And the result from BPFd is:

1
bpf_create_map: ret=3

Since BPFd is proxying the map creation, the file descriptor (3 in this example) is mapped into BPFd's file descriptor table. The file descriptor can be used later to look up entries that the BPF program in the kernel may have created, or to clear all entries in the map, as is done by tools that periodically clear the accounting done by a BPF program.

The BPF_CREATE_MAP command in this example tells BPFd to create a map named count with map type 1 (type 1 is a hashtable map), with a key size of 8 bytes and a value size of 40, maximum of 10240 entries and no special flags. BPFd created a map and identified by file descriptor 3.

With the simple standard input/output design, it’s possible to write wrappers around BPFd to handle more advanced communication methods such as USB or Networking. As a part of my analysis work in the Android kernel team, I am communicating these commands over the Android Debug Bridge which interfaces with the target device over either USB or TCP/IP. I have shared several demos below.

Changes to the BCC project for working with BPFd

BCC needed several changes to be able to talk to BPFd over a remote connection. All these changes are available here and will be pushed upstream soon.

Following are all the BCC modifications that have been made:

Support for remote communication with BPFd such as over the network

A new remotes module has been added to BCC tools with an abstraction that different remote types, such as networking or USB must implement. This keeps code duplication to a minimum. By implementing the functions needed for a remote, a new communication method can be easily added. Currently an adb remote and a process remote are provided. The adb remote is for communication with the target device over USB or TCP/IP using the Android Debug Bridge. The process remote is probably useful just for local testing. With the process remote, BPFd is forked on the same machine running BCC and communicates with it over stdin and stdout.

Changes to BCC to send commands to the remote BPFd

libbpf.c is the main C file in the BCC project that talks to the kernel for all things BPF. This is illustrated in the diagram above. Inorder to make BCC perform BPF operations on the remote machine instead of the local machine, parts of BCC that make calls to the local libbpf.c are now instead channeled to the remote BPFd on the target. BPFd on the target then perform the commands on behalf of BCC running locally, by calling into its copy of libbpf.c.

One of the tricky parts to making this work is, not only calls to libbpf.c but certain other paths need to be channeled to the remote machine. For example, to attach to a tracepoint, BCC needs a list of all available tracepoints on the system. This list has to be obtained on the remote system, not the local one and is the exact reason why there exists the GET_TRACE_EVENTS command in BPFd.

Making the kernel build for correct target processor architecture

When BCC compiles the C program encapsulated in a BCC tool into eBPF instructions, it assumes that the eBPF program will run on the same processor architecture that BCC is running on. This is incorrect especially when building the eBPF program for a different target.

Some time ago, before I started this project, I changed this when building the in-kernel eBPF samples (which are simple standalone samples and unrelated to BCC). Now, I have had to make a similar change to BCC so that it compiles the C program correctly for the target architecture.

Installation

Try it out for yourself! Follow the Detailed or Simple instructions. Also, apply this kernel patch to make it faster to run tools like offcputime. I am submitting this patch to LKML as we speak.

BPF Demos: examples of BCC tools running on Android

Running filetop

filetop is a BCC tool which shows you all read/write I/O operations with a similar experience to the top tool. It refreshes every few seconds, giving you a live view of these operations. Goto your bcc directory and set the environment variables needed. For Android running on Hikey960, I run:

1
joel@ubuntu:~/bcc# source arm64-adb.rc

which basically sets the following environment variables:

1
2
3
  export ARCH=arm64
  export BCC_KERNEL_SOURCE=/home/joel/sdb/hikey-kernel/
  export BCC_REMOTE=adb

You could also use the convenient bcc-set script provided in BPFd sources to set these environment variables for you. Check INSTALL.md file in BPFd sources for more information.

Next I start filetop:

1
joel@ubuntu:~/bcc# ./tools/filetop.py 5

This tells the tool to monitor file I/O every 5 seconds.

While filetop is running, I start the stock email app in Android and the output looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
  Tracing... Output every 5 secs. Hit Ctrl-C to end
  13:29:25 loadavg: 0.33 0.23 0.15 2/446 2931
 
  TID    COMM             READS  WRITES R_Kb    W_Kb    T FILE
  3787   Binder:2985_8    44     0      140     0       R profile.db
  3792   m.android.email  89     0      130     0       R Email.apk
  3813   AsyncTask #3     29     0      48      0       R EmailProvider.db
  3808   SharedPreferenc  1      0      16      0       R AndroidMail.Main.xml
  3792   m.android.email  2      0      16      0       R deviceName
  3815   SharedPreferenc  1      0      16      0       R MailAppProvider.xml
  3813   AsyncTask #3     8      0      12      0       R EmailProviderBody.db
  2434   WifiService      4      0      4       0       R iface_stat_fmt
  3792   m.android.email  66     0      2       0       R framework-res.apk

Notice the Email.apk being read by Android to load the email application, and then various other reads happening related to the email app. Finally, WifiService continously reads iface_state_fmt to get network statistics for Android accounting.

Running biosnoop

Biosnoop is another great tool shows you block level I/O operations (bio) happening on the system along with the latency and size of the operation. Following is a sample output of running tools/biosnoop.py while doing random things in the Android system.

1
2
3
4
5
6
7
  TIME(s)        COMM           PID    DISK    T  SECTOR    BYTES   LAT(ms)
  0.000000000    jbd2/sdd13-8   2135   sdd     W  37414248  28672      1.90
  0.001563000    jbd2/sdd13-8   2135   sdd     W  37414304  4096       0.43
  0.003715000    jbd2/sdd13-8   2135   sdd     R  20648736  4096       1.94
  5.119298000    kworker/u16:1  3848   sdd     W  11968512  8192       1.72
  5.119421000    kworker/u16:1  3848   sdd     W  20357128  4096       1.80
  5.448831000    SettingsProvid 2415   sdd     W  20648752  8192       1.70

Running hardirq

This tool measures the total time taken by different hardirqs in the systems. Excessive time spent in hardirq can result in poor real-time performance of the system.

joel@ubuntu:~/bcc# ./tools/hardirqs.py

Output:

1
2
3
4
5
6
7
  Tracing hard irq event time... Hit Ctrl-C to end.
  HARDIRQ                    TOTAL_usecs
  wl18xx                             232
  dw-mci                            1066
  e82c0000.mali                     8514
  kirin                             9977
  timer                            22384

Running biotop

Run biotop while launching the android Gallery app and doing random stuff:

1
joel@ubuntu:~/bcc# ./tools/biotop.py

Output:

1
2
3
4
5
6
7
PID    COMM             D MAJ MIN DISK       I/O  Kbytes  AVGms
4524   droid.gallery3d  R 8   48  ?           33    1744   0.51
2135   jbd2/sdd13-8     W 8   48  ?           15     356   0.32
4313   kworker/u16:4    W 8   48  ?           26     232   1.61
4529   Jit thread pool  R 8   48  ?            4     184   0.27
2135   jbd2/sdd13-8     R 8   48  ?            7      68   2.19
2459   LazyTaskWriterT  W 8   48  ?            3      12   1.77

Open issues as of this writing

While most issues have been fixed, a few remain. Please check the issue tracker and contribute patches or help by testing.

Other usecases for BPFd

While the main usecase at the moment is easier use of BCC tools on cross-development models, another potential usecase that’s gaining interest is easy loading of a BPF program. The BPFd code can be stored on disk in base64 format and sent to bpfd using something as simple as:

1
joel@ubuntu:~/bpfprogs# cat my_bpf_prog.base64 | bpfd

In the Android kernel team, we are also expermenting for certain usecases that need eBPF, with loading a program with a forked BPFd instance, creating maps, and then pinning them for use at a later time once BPFd exits and then kill the BPFd fork since its done. Creating a separate process (fork/exec of BPFd) and having it load the eBPF program for you has the distinct advantage that the runtime-fixing up map file descriptors isn’t needed in the loaded eBPF machine instructions. In other words, the eBPF program’s instructions can be pre-determined and statically loaded. The reason for this convience is BPFd starts with the same number of file descriptors each time before the first map is created.

Conclusion

Building code for instrumentation on a different machine than the one actually running the debugging code is beneficial and BPFd makes this possible. Alternately, one could also write tracing code in their own kernel module on a development machine, copy it over to a remote target, and do similar tracing/debugging. However, this is quite unsafe since kernel modules can crash the kernel. On the other hand, eBPF programs are verified before they’re run and are guaranteed to be safe when loaded into the kernel, unlike kernel modules. Furthermore, the BCC project offers great support for parsing the output of maps, processing them and presenting results all using the friendly Python programming language. BCC tools are quite promising and could be the future for easier and safer deep tracing endeavours. BPFd can hopefully make it even more easier to run these tools for folks such as Embedded system and Android developers who typically compile their kernels on their local machine and run them on a non-local target machine.

If you have any questions, feel to reach out to me or drop me a note in the comments section.

ARMv8: Flamegraph and NMI Support

| Comments

Non-maskable interrupts (NMI) is a really useful feature for debugging, that hardware can provide. Unfortunately ARM doesn’t provide an out-of-the-box NMI interrupt mechanism. This post shows a flamegraph issue due to missing NMI support, and the upstream work being done to simulate NMI in ARMv8.

Some great Linux kernel features that rely on NMI to work properly are:

  • Backtrace from all CPUs: A number of places in the kernel rely on dumping the stacks of all CPUs at the time of a failure to determine what was going on. Some of them are Hung Task detection, Hard/soft lockup detector and spinlock debugging code.

  • Perf profiling and flamegraphs: To be able to profile code that runs in interrupt handlers, or in sections of code that disable interrupts, Perf relies on NMI support in the architecture. flamegraphs are a great visual representation of perf profile output. Below is a flamegraph I generated from perf profile output, that shows just what happens on an architecture like ARMv8 with missing NMI support. Perf is using maskable-interrupts on this platform for profiling:

As you can see in the area of the flamegraph where the arrow is pointed, a large amount of time is spent in _raw_spin_unlock_irqrestore. It can baffle anyone looking at this data for the first time, and make them think that most of the time is spent in the unlock function. What’s actually happenning is because perf is using a maskable interrupt in ARMv8 to do its profiling, any section of code that disables interrupts will not be see in the flamegraph (not be profiled). In other words perf is unable to peek into sections of code where interrupts are disabled. As a result, when interrupts are reenabled during the _raw_spin_unlock_irqrestore, the perf interrupt routine then kicks in and records the large number of samples that elapsed in the interrupt-disable section but falsely accounts it to the _raw_spin_unlock_restore function during which the perf interrupt got a chance to run. Hence the flamegraph anomaly. It is indeed quite sad that ARM still doesn’t have a true NMI which perf would love to make use of.

BUT! Daniel Thompson has been hard at work trying to simulate Non-maskable interrupts on ARMv8. The idea is based on using interrupt priorities and is the subject of the rest of this post.

NMI Simulation using priorities

To simulate an NMI, Daniel creates 2 groups of interrupts in his patchset. One group is for all ‘normal’ interrupts, and the other for non-maskable interrupts (NMI). Non-maskable interrupts are assigned a higher priority than the normal interrupt group. Inorder to ‘mask’ interrupts in this approach, Daniel replaces the regular interrupt masking scheme in the kernel which happens at the CPU-core level, with setting of the interrupt controller’s PMR (priority mask register). When the PMR is set to a certain value, only interrupts which have a higher priority than what’s in the PMR will be signaled to a CPU core, all other interrupts will be silenced (masked). By using this technique, it is possible to mask normal interrupts while keeping the NMI unmasked all the time.

Just how does he do this? So, a small primer on interrupts in the ARM world. ARM uses the GIC Generic interrupt controller to prioritize and route interrupts to CPU cores. GIC interrupt priorties go from 0 to 255. 0 being highest and 255 being the lowest. By default, the kernel assigns priority 0xa0 (192) to all interrupts. He changes this default priority from 0xa0 to 0xc0 (you’ll see why). He then defines what values of PMR would be consider as “unmasked” vs “masked”. Masked is 0xb0 and unmasked is 0xf0. This results in the following priorities (greater numbers are lower priority).

1
2
3
4
0xf0 (240 decimal)  (11110000 binary) - Interrupts Unmasked (enabled)
0xc0 (192 decimal)  (11000000 binary) - Normal interrupt priority
0xb0 (176 decimal)  (10110000 binary) - Interrupts masked   (disabled)
0x80 (128 decimal)  (10000000 binary) - Non-maskable interrupts

In this new scheme, when interrupts are to be masked (disabled), the PMR is set to 0xf0 and when they are unmasked (enabled), the PMR is set to 0xb0. As you can see, setting the PMR to 0xb0 indeed masks normal interrupts, because 0xb0(PMR) < 0xc0(Normal), however non-maskable interrupts still stay unmasked as 0x80(NMI) < 0xb0(PMR). Also notice that inorder to mask/unmask interrupts, all that needs to be done is flip bit 7 in the PMR (0xb0 –> 0xf0). Daniel largely uses Bit 7 as the mask bit in the patchset.

Quirk 1: Saving of the PMR context during traps

Its suggested in the patchset that during traps, the priority value set in the PMR needs to be saved because it may change during traps. To facilitate this, Daniel found a dummy bit in the PSTATE register (PSR). During any exception, Bit 7 of of the PMR is saved into a PSR bit (he calls it the G bit) and restores it on return from the exception. Look at the changes to kernel_entry macro in the set for this code.

Quirk 2: Ack of masked interrupts

Note that interrupts are masked before the GIC interrupt controller code can even identify the source of the interrupt. When the GIC code eventually runs, it is tasked with identifying the interrupt source. It does so by reading the IAR register. This read also has the affecting of “Acking” the interrupt – in other words, telling the GIC that the kernel has acknowledged the interrupt request for that particular source. Daniel points out that, because the new scheme uses PMR for interrupt masking, its no longer possible to ACK interrupts without first unmasking them (by resetting the PMR) so he temporarily resets PMR, does the IAR read, and restores it. Look for the code in gic_read_iar_common in his patchset to handle this case.

Open questions I have

  • Where in the patchset does Daniel mask NMIs once an NMI is in progress, or is this even needed?

Future work

Daniel has tested his patchset only on the foundation model yet, but it appears that the patch series with modifications should work on the newer Qualcomm chipsets that have the necessary GIC (Generic interrupt controller) access from the core to mess with IRQ priorities. Also, currently Daniel has only implemented CPU backtrace, more work needs to be done for perf support which I’ll look into if I can get backtraces working properly on real silicon first.