March 12, 2007

[Editor's note: Part 3 explains how to access DSP features like circular addressing from portable C. It also shows how to use pragmas and inline assembly. Part 5 shows how to optimize memory performance, and how to make speed vs. size tradeoffs.]

To most DSP programmers, "code optimization" implies optimization of inner loops. However, there can be more to application performance than tightly vectorized inner loops.

Firstly, embedded applications include an expanding range of algorithms, and a surprising variety of expressions are creeping into critical loops. In many cases, these new types of expressions cannot be easily vectorized.

Secondly, there is a tradition in the DSP world that the bulk of the application or "control code" only matters in so far as its code size, but sometimes the bulk of the application can account for a significant percentage of the execution time. "Control code" mainly refers to decision-making code (such as switch statements) but is can also be thought of as "generic" C code. This type of code is found in all applications—not just in DSP applications—so all the normal optimization techniques from the computing mainstream are applicable here.

These factors are becoming more important as DSP pipelines grow deeper. Deeper pipelines create longer branch delays, which makes decision-making code increasingly costly. In addition, DSP oriented processors often skimp in areas such as division and speculation. (For example, they often lack the speculative execution hardware found in general-purpose processors). All of this presents new opportunities for optimization.

The most important aspect of control code is decision making. Decision-making code is full of conditional expressions. The performance of this code depends on behavior of the branches associated with the conditional expressions. It is faster to not take branches. If the processor has branch-prediction hardware, a correctly-predicted branch is faster than an incorrectly predicted branch. Thus, control code executes faster if the processor can avoid taking branches—or at least correctly predicting branches—most of the time.

To achieve this goal, compilers use profile-guided optimization. With this approach, the toolset uses multiple builds and runs to analyze the application's behavior. Each run generates a complete trace of the program path showing every transfer of control. ("Compiled simulation" techniques are very useful here, as they offer speeds hundreds of times faster than conventional simulators.) Once the compiler knows (statistically) which way every branch went, the program is recompiled to use the most efficient branches. This optimization can easily result in a 10 to 15% improvement in control code performance.

Of course, the program path may depend upon the training data you provide. In general, this means that you want to use training data that represents a typical case. In some application areas, however, it is better to use specialized test data. For example, it may be desirable to accelerate the worst-case path rather than the average performance.

The results of control flow optimization can be disorienting because the compiler will aggressively rearrange the program. On many DSPs, the cheapest control flow decision is to not take the branch and drop through to the following instruction instead. On machines like Blackfin, even correctly predicted branches jumps cost a few stall cycles, so it is better to not take branches. To achieve this goal the compiler may have to substantially rearrange your code, so large blocks of assembly will move around after this optimization.

If you can't use simulation to build a trace then you can still intervene locally in key areas. You can insert "expected true" and "expected false" into your C to tell the compiler which way you expect branches to go:

Here we see it used with an error call. In most cases the code will proceed without executing the error routine. The "expected_false" tells the compiler that calling the error function is rare. As a result, the compiler rearranges the code as follows:

< Compare>
< Conditional Branch to X>
Y: the rest of the function.

X: < Call error routine>
    < Jump Y >

In most cases we have a zero cost drop through to Y. The rearrangement also benefits the cache. Because the most-frequently executed code is now lumped together with a linear control flow, cache misses are less likely.

It is also possible to removing some conditional expressions altogether. For instance, Blackfin offers min, max, and abs machine instructions. These instructions provide the same functionality as a simple "if" statements, as illustrated in Figure 1.

Figure 1. Simple conditional expressions (left) can often be replaced by single machine instructions (right).

In this example, we remove a conditional branch, which might have taken many cycles, and replace it with a single cycle machine instruction. It also helps that the code now takes the same number of cycles whether the condition is true or false.

When using this trick, make sure you know what data types the machine instructions support. For example, on the min and max instructions on Blackfin are defined for signed 32 bit values. The compiler will easily perform the transformation for 32-bit signed data, but it may require assistance from the programmer to handle 16-bit and/or unsigned data.