PMU (Performance Monitoring Unit) interrupt support for Exynos 4412

Recently I bought an odroid-x development board that is powered by an Exynos4412, Quad-core Cortex-A9, processor. It is incredibly cheap, costing well below $200. So it was a perfect fit for my purpose—doing some research (s/research/hacking/g).
 
For my purpose, I wanted to use hardware performance counter interrupt. Each Cortex-A9 core has six counters, each of which can be configured to collect any of the 58 available events, including instruction and cache miss events. Each counter can be configured to generate interrupts on counter overflow; you can call your software handler at every 1000 cache misses, for example. My recent work utilizes this capability to control the memory access rate of each core.
 
Since my code is based on perf_event infrastructure in accessing the counters, it should work fine, in theory, without any modification. So, I gave it a try on the board. Unfortunately, however, it did not work in that way (as expected). I had no trouble on compiling kernel and my code on the board b
ecause Ubuntu is running directly on the board (Thanks ubuntu and Linaro). The problem arose, however, while running the code. There were two problems: one is Cortex-A9 speicific hardware limitation and the other one is specific Exynos4412 kernel implementation problem.

The first problem, Cortex-A9 specific, was an easy one. My kernel code accesses performance counters through perf_event instrastructure as shown in the following:

       struct perf_event_attr sched_perf_hw_attr = {
               .type           = PERF_TYPE_HARDWARE,
               .config         = PERF_COUNT_HW_CACHE_MISSES,
               .size           = sizeof(struct perf_event_attr),
               .pinned         = 1,
               .disabled       = 1,
               .exclude_kernel = 1,   
       };

       event = perf_event_create_kernel_counter(&sched_perf_hw_attr, cpu, NULL,
               event_overflow_callback ,NULL

The trouble was the bold line. Because I only want to count events in user mode, I set .exclude_kernel=1 in order to instruct the counter not to count in the kernel mode. Unfortunately, Cortex-A9 does not support mode dependant counting. So the counter object creation failed. The solution in my case was simply not to enable the flag (i.e., .exclude_kernel = 0), because it only means somewhat inaccurate counting. Not good but not a deal breaker.

The second problem, the Exynos4412 specific, was, however, rather difficult one. Although everything was configured correctly through perf_event infrastructure, the processor simply did not generate interrupts. After digging the ARM ARM (Architecture Reference Manaual) and the Exynos 4412 User’s manual, I figured that the kernel from the official odroid-x repository enable a wrong interrupt line. Exynos4412 has a Power Management Unit block whose abbreviation is also PMU. The problem is that core’s Performance Monitoring Unit is also PMU. As it sounds confusing, the kernel developer who wrote the following code seems to be confused as well. Here, “arm-pmu”, the resource for per-core performance monitor, use the IRQ number of Exynos 4412’s power management unit.

static struct platform_device s5p_device_pmu = {
       .name           = “arm-pmu”,  // resource for Performance Monitoring Unit
       .id             = ARM_PMU_DEVICE_CPU,
       .num_resources  = ARRAY_SIZE(s5p_pmu_resource),
       .resource       = s5p_pmu_resource,
};

static struct resource s5p_pmu_resource[] = {
      DEFINE_RES_IRQ(IRQ_PMU)    // IRQ for Power Management Unit
};

As a result, armpmu_reserve_hardware() function in arch/arm/kernel/perf_event.c request irq with using IRQ_PMU. Of course, it never generate interrupts on counter overflow.

Because per-core PMU (Performance Monitoring Unit) was connected indirectly through an interrupt combinator and Exynos 4412’s interrupt line mappings are somewhat weird, I had to add special initialization routines to route per-core PMU interrupt properlly to its designated core.

Here’s the patch. Hope it would help someone wasting his time.

https://gist.github.com/4325369

ARM vs Intel

I think now is one of the most interesting time in computing history.

There’s the mighty Intel who is dorminating PC world. Most notebook, desktop, and server use Intel processors (sorry AMD, you are far behind).

On the other hand, ARM is dorminating mobile computing world. Most, if not all, smartphone and tablet use variety of ARM processors.

As the boundary between PC and mobile computing blurs, they now directly compete with each other.

Chromebook is the first notebook based on ARM Cortex A15 [1]. It means that ARM processors now have comparable performance to intel’s low end offerings (maybe not quite there yet, but still good enough to be used in notebook).

Notebook is not the only category that ARM is trying to enter. Server computers are also well focused by ARM and its many licensees. Calexa already has production server chips based on Cortex A9 [2]; Applied Micro showed off more serious server chips based on 64bit ARM v8 architecture [3]; and Nvidia is also preparing their 64bit ARM processors for server [4].

Of course, Intel is not standing still. They pour resources to counter the challenges—keeping their turf while entering smartphone/table category. With Clovertail, the latest Atom line, they proved that x86 does not necessarily consume more power than ARM [5]. They are also pushing power-consumption down on their more powerful mainstream Haswell based cores to the level that is comparable to ARM’s offerings [6].

Now is the time we are redefining computing systems. Nobody know how the computing landscape will be changed in a few years later. But just watching it is exciting and it would be even more exciting to be part of it.

[1] http://www.engadget.com/2012/10/18/google-launches-11-6-inch-arm-based-samsung-chromebook-249-ul/
[2] http://www.calxeda.com/
[3] http://www.apm.com/media/corporate_events/appliedmicro_to_host_technical_panel_at_arm/
[4] http://www.xbitlabs.com/news/cpu/display/20120921010327_Nvidia_Develops_High_Performance_ARM_Based_Boulder_Microprocessor_Report.html
[5] http://www.anandtech.com/show/6340/intel-details-atom-z2760-clovertrail-for-windows-8-tablets

Git + Dropbox

I like both git and dropbox. But syncing .git directory through dropbox eats up the precious dropbox space. Moreoever, It is ugly. 

So I searched a solution to avoid synching .git directories under my dropbox. 

As alwyas, stack overflow gives me the answer.

Just one issue though. In case you are using Windows, you should use windows path convention in the .git file even if you are using git bash environment.

$ cat .git

gitdir: C:pathtorepo.git

 

Thanks stack overflow.

 

Inspiration or steal?

File:Xerox 8010 compound document.jpg

The first commercial GUI from Xerox Star workstation. (image source: Wikipedia)

 

History of GUI is full of “inspiration” from others. Apple’s macintosh GUI was inspired by Xerox PARC. Microsoft was then inspired by Apple. One british company called Acorn first created the taskbar concept now found in most desktop systems [1]. Many elements, such as pull-down menu, icons, drag-and-drop, taskbar, pop-up menu, and so on were developed and shared by everybody.

If all those user interface elements have to be protected by patents, does it really promote innovation? It seems that the US court believe that GUI elements such as rounded square icons, bouncing back when scrolling, tap & zoom and so on [2] should be protected by the law and ordered Samsung pay one billiion dollars to Apple for using them in their products [3].

Altohough I agree that Apple is innovative in designing such a nice UI and should be credited for it, I am not, however, convinced that GUI elements can be patented. If so, only big giants like Apple, who afford to pay the cost of patents, can improve user interface. I do not think that is any good for public.

[1] http://theoligarch.com/microsoft_vs_apple_history.htm
[2] http://si.wsj.net/public/resources/images/MK-BW708C_SAMAP_G_20120823183614.jpg
[3] http://www.nytimes.com/2012/08/25/technology/jury-reaches-decision-in-apple-samsung-patent-trial.html?_r=1&hp

Everybody hates bugs

People hate bugs so much, so there have been so many researches to find, fix, avoid, and prevent them. In his recent article appeared in ACM QUEUE magazineDr. Berger, a professor at Umass, gave a great overview about the state of art in debugging C/C++ applications. I briefly summerize it here with occational notes from my perspective.

 

Focus here is C/C++ applications. C/C++ are known unsafe, meaning vulnerable to memory errors such as buffer overflow and dangling pointer (use after free).

 

Using memory safe language is not always feasible for both technical (performance) and economic  reasons.

 

Using right tools mitigate the problems. Static analyzer such as Coverity can find many bugs (but not all, of course). They trade false negative for fewer false positive. Randomized fuzz testing tools are adopted in production systems. For example, Microsoft’s Driver Verifier perform randomized concurrency stress testing. These tools are focused on finding bugs before ship the product. However, none of tools find all bugs, bugs appear after deployment.

 

Debugging after deployment is hard because it is difficult to get all the necessary information. To remedy this situation, Desktop and mobile systems are often configured to send limited information, such as stack trace with thread context information, to a centralized server. An excellent example of this kind is Microsoft’s Dr. Watson (now it is called, rather uninterestingly, Window Error Reporting). Google also provides a similar system called Breakpad.

 

Logging is also useful for debugging. In particular,  many server applications constantly produce logs which may contain valuable clues for crash. However, correlating bugs out of log is typically very time consuming and can be wasted in case the log does not contain enough information. SherLog automate bug tracing from log, LogEnhancer automatically extend log message.

 

Now the main focus in the article is fault tolerance or avoidance. The author uses car analogy saying software needs seatbelts and airbags.

 

Conservative garbage collection can be applied to C/C++. They identify pointers conservatively and track the use and garbage collect when they are not reachable just like normal garbage collector. Of course, this inherits all the problems of garbage collection in general—require more memory, can be slow, and can introduce very high unpredictable latency.

 

Memory overflow can be detected by Valgrind’s MemCheck but with huge overhead (an order of x100). Google’s AddressSanitizer reduce overhead (a factor of 2) by using combination of compiler and runtime based techniques, although it requires recompilation of entire code base.

 

At this point, he argued that reproducible bug is a good thing for debugging but users prefer heisenbug over reproducible bugs. I do not completely agree this because my experience with consumer electronics devices suggest that users are actually active tester and happy to report reproducible bugs to their manufacturer and avoid the problematic sequence. However, if they feel the systems sometimes doesn’t work, they are just as frustrated as developers and have a negative feeling about the quality of the system. I do agree, however, on his example of web browsing. He argued that when a browser fail to display a web page correctly, user simply hit the refresh button and that usually solve the problem. My point is that it depends on types of applications and can not be generalized as “users are more happier with heisenbugs over reproducible bugs”

 

DieHard is a memory allocator that has extra space between the two adjacent allocations in order to avoid failure in the presence of “small” buffer overflow. Another aspect of the DieHard system is that instead of using freed object immediately, its allocator randomly use objects from the bitmap of entire memory heap.  Intended purpose is that probability of overflow reduced as chances of two objects are adjacent is reduced. But again I think this is very controversial approach at best. The former idea of adding some padding to tolerate small buffer overflow, however, is adopted   by Microsoft as they now ship an FTH (Fault tolerant heap).

 

Grace, Dthread convert thread to process and use page based protection mechanism in order to enforce sequential or predictable order. Page based protection mechanism is similar to the one used in Software DSMs. These systems can be viewed as safe run-time systems that trade performance over added safety.

 

Overall, this article nicely enumerate recent research activities in debugging and worth reading.

 

 

64bit ARM architecture (AArch64)

Finally, the day has come. The first upstream patch for 64bit ARM architecture hit the LKML [1]. And as always, LWN summerise the matter nicely [2]. It is done by, of course, ARM forks—namely Catalin Marlins, Marc Zyngier, and Will Deacon. I’m happy that Catalis is listed himself as a sole maintainer (He is good and polite—second part is a hard to find trait in kernel community). There’s still no actual silicon, so they implemented the code on top of ARM’s simulator (Wait, what about Applied Micro’s 64bit FPGA thing? [3]).

Checking some documentations included in the patch (Documentation/aarch64/) , I found that it requires device tree, supports 39bit address space (512GB) and supports both 4KB page and 64KB page. If I understand correctly, both kernel and user get entire 39bit space (if 39th bit is 1, then it is kernel space whlie it is 0 for user space). Simply put, we only need to care about 40bit address space (just 1 more hex value compare to 32bit hex address).

So, here it is. It’s a 64 bit architecture. It supports 512GB addressable memory space for both kernel and user. It’s ready for the prime time.

[1] Catalin Marlins, AArch64 Linux kernel port, http://lwn.net/Articles/505682/
[2] Jonathan Corbet, Supporting 64bit ARM systems,  http://lwn.net/Articles/506148/]
[2] 
http://www.apm.com/media/corporate_events/appliedmicro_to_host_technical_panel_at_arm/ 

Spinlock implementation in ARM Linux Kernel

Spinlock is a synchronization mechanism frequently used in kernel developement to provide low cost synchronization between multiple cores. Its implementation heavily depends on hardware architecture. In this post, I would like to describe how it is implemented on ARM architecture (to be precise ARMv7 and ARMv6).


ARM architecture provides two instructions, LDREX and STREX, for implementing spinlock. The basic semantic is simple. If you performs LDREX (load execlusive), you load from a memory location to a register and mark the memory location as execlusive. Other processors can still call the LDREX operations at the same time allowing multiple concurrent owners of the same memory location. The important part is done by STREX (store execluive). If multiple processors concurrently execute the STREX, the only one processor will succeed. The successful store returns 0 and non successful one returns 1. For those processors who failed in STREX, they must retry with LDREX and STREX in sequence again to gain “exclusive” access right to the memory location. The actual code in Linux kernel is shown as follows. I will describe them line by line.

https://gist.github.com/3018642


The line 1 loads lock->lock value and line 2 check whether it is 0. At line 3, if the lock->lock value is not zero it executes WFE which enters CPU into power saving mode until it receives an external interrupt or event. At line 4, it tries to store 1 to lock->lock to indicate it is currently being held. The line 5 checks whether the store was successful. If not, it repeats from the line 1. If it is successulf, line 7 performs a memory barrier operation to make all previous memory updates from other processors to be visible to the requesting processor.

https://gist.github.com/3018644


Unlocking is much simpler. The line 1 performs a memory barrier operation to make all previous changes made inside the spinlock to be visible to all other processors. Then it simply stores 0 to lock->lock to indicate the lock is now free at line 2. Finally, at line 3, the processor sends an event to other processors who are waiting at WFE (line 3 of ar
ch_spin_lock function).

To understand more architectural details (local and global monitor state transition and so on), please refer section 3.4 of ARM architectural reference manul [1].

[1] ARM, ARM Architecture Reference Manual. ARMv7-A and ARMv7-R edition

Why do I still use C?

So, in 2012, why am I still use C?

Simple answer is Linux kernel (which I work most of the time) is written in C.

Another rather educational version of answer could be the following quote I found at here

“Knowing C gets you closer to the hardware, to better understand how things work on the system level.”

The Linux Graphics Stack

There is one word that explains the Linux graphics system—complex. There are so many subsystems. There are many different implementations for the same subsystem. You can be easily overwhelmed by the sheer number of achronyms used to describe it.

Here’s good overview of modern Linux graphics subsystem I found recently. It mainly describes graphics components for standard desktop environment. Although it is missing more embedded version of graphics subsystems (SDL, framebuffer, and various alternative window systems), I found it is very useful to understand what is doing what in the standard desktop environment.