Debugging Android* OS running on Intel® Atom™ Processor via JTAG

.

For the system integrator and device manufacturer it may however very well be necessary to work on the device driver and system software stack layer as well. This is especially true if additional platform specific peripheral device support needs to be implemented or if the first operating system port to a new Intel® Atom™ processor based device is undertaken.

In the following chapters we will look at IEEE 1149.1 (JTAG) standard based debug solutions for this purpose as well as architectural differences between ARM* architecture and Intel® Architecture that may impact system level debugging.

1.1 JTAG Debugging

For true firmware, OS level system and device driver debug, using a JTAG interface is the most commonly used method in the embedded intelligent systems world. The Joint Test Action Group (JTAG) IEEE 1149.1 standard defines a “Standard Test Access Port and Boundary-Scan Architecture for test access ports used for testing printed circuit boards.” This standard is commonly simply referred to as the JTAG debug interface. From its beginnings as a standard for circuit board testing it has developed into the de facto interface standard for OS independent and OS system level platform debug.

More background information on JTAG and its usage in modern system software stack debugging is available at in the article “JTAG 101; IEEE 1149.x and Software Debug” by Randy Johnson and Stewart Christie.

From the OEM’s perspective and that of their partner application and driver developers, understanding the interaction between the driver and software stack components running on the different parts of the system-on-chip (SoC) integrated intelligent system or smartphone form factor device is critical for determining platform stability. From a silicon validator’s perspective, the low level software stack provides the test environment that exposes the kind of stress factors the platform will be exposed to in real-world use cases. In short, modern SoCs require understanding the complete package and its complex real-world interactions, not just positive unit test results for individual hardware components. This is the level of insight a JTAG-based system software debug approach can provide. This can be achieved by merging the in-depth hardware awareness JTAG inherently provides with the ability to export state information of the Android OS running on the target.

Especially for device driver debug, it is important to understand both the exact state of the peripheral device on the chipset and the interaction of the device driver with the OS layer and the rest of the software stack.

If you are looking at Android from the perspective of system debugging, looking at device drivers and the OS kernel, it is really just a specialized branch of Linux. Thus it can be treated like any 2.6.3x or higher Linux.

The Intel® Atom™ Processor Z2460 supports IEEE-1149.1 and IEEE-1149.7 (JTAG) Boundary Scan and MPI Parallel Trace Interface (PTI) as well as Branch Trace Storage (BTS), Last Branch Record (LBR) and Architectural Event Trace (AET) based instruction tracing through Intel’s JTAG-compliant eXtended Debug Port (XDP).

Various JTAG vendors offer system debug solutions with Android support including:
• Wind River (http://www.windriver.com/products/JTAG-debugging/)
• Lauterbach (http://www.lauterbach.com)
• Intel (http://software.intel.com - access restrictions may apply)

1.2.  Android OS Debugging

What complicates debugging an Android-based platform is that Android usually very aggressively takes advantage of low power idle states and sleep states to optimize for power consumption. Thus the real challenge becomes debugging through low power states and

• either maintaining JTAG functionality through some of the low power states
• or, where this is not possible, reattaching JTAG as soon as the chipset power domain for JTAG is re-enabled.

Many OS level issues on these types of platforms tend to center around power mode changes and sleep/wake-up sequences.

A system debugger, whether debug agent based or using a JTAG device interface, is a very useful tool to help satisfy several of the key objectives of OS development.

The debugger can be used to validate the boot process and to analyze and correct stability issues like runtime errors, segmentation faults, or services not being started correctly during boot.

It can also be used to identify and correct OS configuration issues by providing detailed access and representations of page tables, descriptor tables, and also instruction trace. The combination of instruction trace and memory table access can be a very powerful tool to identify the root causes for stack overflow, memory leak, or even data abort scenarios.

Figure 1 shows the page table translation from physical to virtual memory addresses. With the high level of flexibility that is available on x86 in defining the depth of translation tables and granularity of the addressed memory blocks, this level of easy access and visibility of the memory layout becomes even more important for system development on the OS level.


Figure 1: Logical Address to Linear Address Translation

To translate a logical address into a linear address, the processor does the following:

  1. Uses the offset in the segment selector to locate the segment descriptor for the segment in the GDT or LDT and reads it into the processor. (This step is needed only when a new segment selector is loaded into a segment register.)
  2. Examines the segment descriptor to check the access rights and range of the segment to insure that the segment is accessible and that the offset is within the limits of the segment.
  3. Adds the base address of the segment from the segment descriptor to the offset to form a linear address.
  4. If paging is not used, the processor maps the linear address directly to a physical address (that is, the linear address goes out on the processor’s address bus). If the linear address space is paged, a second level of address translation is used to translate the linear address into a physical address.

When operating in protected mode, the Intel Architecture permits the linear address space to be mapped directly into a large physical memory (for example, 4 GBytes of RAM) or indirectly (using paging) into a smaller physical memory and disk storage. This latter method of mapping the linear address space is commonly referred to as virtual memory or demand-paged virtual memory.
When paging is used, the processor divides the linear address space into fixed-size pages (generally 4 KBytes in length) that can be mapped into physical memory and/or disk storage. When a program (or task) references a logical address in memory, the processor translates the address into a linear address and then uses its paging mechanism to translate the linear address into a corresponding physical address. If the page containing the linear address is not currently in physical memory, the processor generates a page-fault exception (#PF). The exception handler for the page-fault exception typically directs the operating system or executive to load the page from disk storage into physical memory (perhaps writing a different page from physical memory out to disk in the process). When the page has been loaded in physical memory, a return from the exception handler causes the instruction that generated the exception to be restarted. The information that the processor uses to map linear addresses into the physical address space and to generate page-fault exceptions (when necessary) is contained in page directories and page tables stored in memory.
Paging is different from segmentation through its use of fixed-size pages. Unlike segments, which usually are the same size as the code or data structures they hold, pages have a fixed size. If segmentation is the only form of address translation used, a data structure present in physical memory will have all of its parts in memory. If paging is used, a data structure can be partly in memory and partly in disk storage.
To minimize the number of bus cycles required for address translation, the most recently accessed page-directory and page-table entries are cached in the processor in devices called translation lookaside buffers (TLBs). The TLBs satisfy most requests for reading the current page directory and page tables without requiring a bus cycle. Extra bus cycles occur only when the TLBs do not contain a page-table entry, which typically happens when a page has not been accessed for a long time.

This highlights two key differences between developing and configuring the Android OS software stack on Intel architecture and many other architectures. The selector base and offset addressing model, combined with the local descriptor table (LDT) and global descriptor table (GDT) allow for deep, multilayered address translation from physical to virtual memory with variable address chunk granularity as well. This is a powerful capability for custom memory configuration in a compartmentalized environment with protected isolated memory spaces. If used incorrectly it can, however, also increase memory access times. Thus the good visibility of memory page translation is desirable.

One other difference between Intel architecture and others is the handling of system interrupts. On ARM, for instance, you have a predefined set of hardware interrupts in the reserved address space from 0x0 through 0x20. These locations then contain jump instructions to the interrupt handler. On Intel architecture a dedicated hardware interrupt controller is employed. The hardware interrupts are not accessed directly through memory space, but by accessing the Intel® 8529 interrupt controller. The advantage of this approach is that the interrupt handler already allows for direct handling for I/O interrupts for attached devices. In architectures that don’t use a dedicated interrupt controller, usually the IRQ interrupt has be overloaded with a more complex interrupt handler routine to accomplish this.

2.1. Device Driver Debugging

A good JTAG debugger solution for OS level debug should furthermore provide visibility of kernel threads and active kernel modules along with other information exported by the kernel. To allow for debugging dynamically loaded services and device drivers, a kernel patch or a kernel module that exports the memory location of a driver’s initialization method and destruction method may be used.

Especially for system configuration and device driver debugging, it is also important to be able to directly access and check the contents of device configuration registers. These registers and their contents can be simply listed with their register hex values or visualized as bitfields as shown in Figure 2. A bitwise visualization makes it easier to catch and understand changes to a device state during debug, while the associated device driver is interacting with it.

Figure 2: Device Register Bitfield View

Analyzing the code after the Android compressed zImage kernel image has been unpacked into memory is possible by simply releasing run control in the debugger until start_kernel is reached. This implies of course that the vmlinux file that contains the kernel symbol information has been loaded. At this point the use of software breakpoints is possible. Prior to this point in the boot process only breakpoint-register–based hardware breakpoints should be used, to avoid the debugger attempting to write breakpoint instructions into uninitialized memory. The operating system is then successfully booted once the idle loop mwait_idle has been reached.

Additionally, if your debug solution provides access to Last Branch Storage (LBR) based instruction trace, this capability can, in conjunction with all the regular run control features of a JTAG debugger, be used to force execution stop at an exception and analyze the execution flow in reverse identifying the root cause for runtime issues.

Last Branch Records can be used to trace code execution from target reset. Since discontinuities in code execution are stored in these MSRs, debuggers can reconstruct executed code by reading the ‘To’ and ‘From’ addresses, access memory between the specific locations, and disassemble the code. The disassembly is usually displayed in a trace GUI in the debugger interface. This may be useful for seeing what code was executed before a System Management Interrupt (SMI) or other exception if a breakpoint is set on the interrupt.

2.2. Hardware Breakpoints

Just as on ARM architecture, processors based on Intel architecture support breakpoint instructions for software breakpoints as well as hardware breakpoints for data as well as code. On ARM architecture you usually have a set of dedicated registers for breakpoints and data breakpoints (watchpoints). The common implementation tends to provide two of each. When these registers contain a value, the processor checks against accesses to the set memory address by the program counter register or a memory read/write. As soon as the access happens, execution is halted. This is different from software breakpoints in that their execution is halted as soon as a breakpoint instruction is encountered. Since the breakpoint instruction replaces the assembly instruction that would normally be at a given memory address, the execution effectively halts before the instruction that normally would be at the breakpoint location is executed.

The implementation of hardware breakpoints on Intel architecture is very similar to that on ARM, although it is a bit more flexible.

On all Intel Atom Processor cores, there are four DR registers that store addresses, which are compared against the fetched address on the memory bus before (sometimes after) a memory fetch.

You can use all four of these registers to provide addresses that trigger any of the following debug run control events:

  • 00 – break on instruction execution
  • 01 – break on data write only
  • 10 – Undefined OR (if architecture allows it) break on I/O reads or writes
  • 11 – break on data reads or writes but not instruction fetch

Thus, all four hardware breakpoints can be used to be either breakpoints or watchpoints. Watchpoints can be either Write-Only or Read-Write (or I/O) watchpoints.

2.3. Cross-Debug: Intel® Atom™ Processor and ARM Architecture

Many developers targeting the Intel Atom processor may have experience developing primarily for RISC architectures with fixed instruction length. MIPS and ARM are prime examples of ISAs with a fixed length. In general, the cross-debug usage model between an Intel Atom processor and ARM architecture processor is very similar. Many of the conceptual debug methods and issues are the same.

Developing on an Intel architecture-based development host for an Intel Atom processor target does, however, offer two big advantages, especially when the embedded operating system of choice is a derivative of one of the common standard operating systems like Linux or Windows. The first advantage is the rich ecosystem of performance, power analysis, and debug tools available for the broader software development market on Intel architecture. The second advantage is that debugging functional correctness and multithreading behavior of the application may be accomplished locally. This advantage will be discussed later in the chapter.

There are a few differences between Intel Atom processors and ARM processors that developers should know. These differences are summarized in the next two subsections.

2.4. Variable Length Instructions

The IA-32 and Intel 64 instruction sets have variable instruction length. The impact on the debugger is that it cannot just inspect the code in fixed 32-bit intervals, but must interpret and disassemble the machine instructions of the application based on the context of these instructions; the location of the next instruction depends on the location, size, and correct decoding of the previous. In contrast, on ARM architecture all the debugger needs to monitor is the code sequence that switches from ARM mode to Thumb mode or enhanced Thumb mode and back. Once in a specific mode, all instructions and memory addresses are either 32-bit or 16-bit in size. Firmware developers and device driver developers who need to precisely align calls to specific device registers and may want to rely on understanding the debugger’s memory window printout should understand the potential impact of variable length instructions.

2.5. Hardware Interrupts

One other architectural difference that may be relevant when debugging system code is how hardware interrupts are handled. On ARM architecture the exception vectors0 Reset

  • 1 Abort
  • 2 Data Abort
  • 3 Prefetch Abort
  • 4 Undefined Instruction
  • 5 Interrupt (IRQ)
  • 6 Fast Interrupt (FIRQ)

are mapped from address 0x0 to address 0x20. This memory area is protected and cannot normally be remapped. Commonly, all of the vector locations at 0x0 through 0x20 contain jumps to the memory address where the real exception handler code resides. For the reset vector that implies that at 0x0 will be a jump to the location of the firmware or platform boot code. This approach makes the implementation of hardware interrupts and OS signal handlers less flexible on ARM architecture, but also more standardized. It is easy to trap an interrupt in the debugger by simply setting a hardware breakpoint at the location of the vector in the 0x0 through 0x20 address range.

On Intel architecture a dedicated hardware interrupt controller is employed. The interrupts

  • 0 System timer
  • 1 Keyboard
  • 2 Cascaded second interrupt controller
  • 3 COM2 - serial interface
  • 4 COM1 - serial interface
  • 5 LPT - parallel interface
  • 6 Floppy disk controller
  • 7 Available
  • 8 CMOS real-time clock
  • 9 Sound card
  • 10 Network adapter
  • 11 Available
  • 12 Available
  • 13 Numeric processor
  • 14 IDE -- Hard disk interface
  • 15 IDE -- Hard disk interface

cannot be accessed directly through the processor memory address space, but are handled by accessing the Intel 8259 Interrupt Controller. As can be seen from the list of interrupts, the controller already allows for direct handling of hardware I/O interrupts of attached devices, which are handled through the IRQ interrupt or fast interrupt on an ARM platform. This feature makes the implementation of proper interrupt handling at the operating system level easier on Intel architecture especially for device I/O. The mapping of software exceptions like data aborts or segmentation faults is more flexible on Intel architecture as well and corresponds to an interrupt controller port that is addressed via the Interrupt Descriptor Table (IDT). The mapping of the IDT to the hardware interrupts definable by the software stack. In addition, trapping these exceptions cannot as easily be done from a software stack agnostic debug implementation. In order to trap software events that trigger hardware interrupts on Intel architecture, some knowledge of the OS layer is required. It is necessary to know how the OS signals for these exceptions map to the underlying interrupt controller. Most commonly, even in a system level debugger a memory mapped signal table from the operating system will trap exceptions instead of attempting to trap exceptions directly on the hardware level.

2.6. Single Step

ARM architecture does not have an explicit single step instruction. On Intel architecture, an assembly level single step is commonly implemented in the debugger directly via such an instruction. On ARM, a single instruction step is implemented as a “run until break” command. The debugger is required to do some code inspection to ensure that all possible code paths (especially if stepping away from a branch instruction or such) are covered. From a debugger implementation standpoint this does generate a slight overhead but is not excessive, since this “run until break” implementation will be frequently needed for high level language stepping anyways. Software developers in general should be aware of this difference since this can lead to slightly different stepping behavior.

2.7. Virtual Memory Mapping

The descriptor table and page translation implementation for virtual memory mapping is surprisingly similar, at least conceptually. On Intel architecture, the Global Descriptor Table (GDT) and Local Descriptor Table (LDT) enable nested coarseness adjustments to memory pages are mapped into the virtual address space. Figure 3 illustrates the linear to physical address translation on Intel architecture.

Figure 3: Page Translation on Intel® Architecture

On ARM, the first level and second level page tables define a more direct and at maximum, a one- or two-level–deep page search for virtual memory. Figure 4 shows a sample linear address to physical address translation.

Figure 4: Page Translation on ARM

Intel architecture offers multiple levels of coarseness for the descriptor tables, page tables, 32-bit address space access in real mode, and 64-bit addressing in protected mode that’s dependent on the selector base:offset model. ARM does not employ base:offset in its various modes. On Intel architecture, the page table search can implicitly be deeper. On ARM, the defined set is two page tables. On Intel architecture, the descriptor tables can actually mask nested tables and thus the true depth of a page table run can easily reach twice or three times the depth on ARM.

The page translation mechanism on Intel architecture provides for greater flexibility in the system memory layout and mechanisms used by the OS layer to allocate specific memory chunks as protected blocks for application execution. However, it does add challenges for the developer to have a full overview of the memory virtualization and thus avoid memory leaks and memory access violations (segmentation faults). On a full featured OS with plenty of memory, this issue is less of a concern. Real-time operating systems with more visibility into memory handling may be more exposed to this issue.

Considerations for Intel® Hyper-Threading Technology

From a debugging perspective there is really no practical difference between a physical processor core and a logical core that has been enabled via Intel Hyper-Threading Technology. Enabling hyper-threading occurs as part of the platform initialization process in your BIOS. Therefore, there is no noticeable difference from the application standpoint between a true physical processor core and an additional logical processor core. Since this technology enables concurrent execution of multiple threads, the debugging challenges are similar to true multicore debug.

SoC and Interaction of Heterogeneous Multi-Core

Dozens of software components and hardware components interacting on SoCs increase the amount of time it takes to root-cause issues during debug. Interactions between the different software components are often timing sensitive. When trying to debug a code base with many interactions between components single-stepping through one specific component is usually not a viable option. Traditional printf debugging is also not effective in this context because the debugging changes can adversely affect timing behavior and cause even worse problems (also known as “Heisenbugs”).

4.1. Event Trace Debugging

There are a variety of static software instrumentation based data event tracing technologies that help address this issue. The common principle is that they utilize a small amount of DRAM buffer memory to capture event data as it is being created and then uses some kind of logging mechanism to write the trace result into a log file. Data trace monitoring can be real time by interfacing directly with the trace logging API or can be done offline by using a variety of trace viewers for analyzing more complex software stack component interactions.

LTTng*, Ftrace* and SVEN* are 3 of the most common such implementations.

Below is a comparison table of those 3 that are focused primarily on Linux* type operating systems and can thus be made applicable to Android* as well.

Figure 5: Different Data Event Tracing Solutions

Please check out the respective product websites to find out more:

 

5. References

 

For more complete information about compiler optimizations, see our Optimization Notice.