Memory Traffic Optimization to Improve Application Performance

Memory traffic optimization yields the greatest speedup compared to all the other optimization techniques to be deployed in optimizing an application. The other techniques being vectorization and multithreading. And if we want to really leverage the power of vectorization, then we have to optimize data re-use in caches. In addition to this it is important to understand that vector arithmetic in modern processors is cheap, it’s memory access that’s expensive. It’s therefore paramount that we optimize memory access for bandwidth bound applications.

But first before delving looking at locality of memory access in space and time, let’s have a quick refresher on vectorization and multithreading.

Vectorization is basically having a single instruction operating on multiple data elements. The speedup as a result of vectorizing in your application will depend on the instruction set on your hardware since the number of registers differ across various platforms. Vectorization and multithreading can easily be implemented using OpenMP pragmas within your application. You can also perform automatic vectorization using compilers. However the caveat with automatic vectorization is that the compilers will only try to vectorize the innermost loops in cases where you have multiple nested loops. You can override this with a #pragma omp simd in your code. You also need to take care of vector dependence within your code to ensure that the code is vectorized correctly. An example of true vector dependence is as shown below:

for( int i=1;i<n; i++){

a[i]+=a[i-1];

}

In the above snippet, the compiler does not have enough information due to the dependence of a[i] on  a[i-1], to enable it to implement vectorization. It is also important to understand that vectorization will only work if the compiler knows the number of iterations in a loop. This is the reason you should avoid using while loops because the number of iterations are not known during compilation, instead use the good ol’ for loops. If the compiler does not see any vectorization opportunities, you can provide this through a technique called strip mining. Strip mining is a programming technique that turns one loop into two nested loops. This technique not only presents compiler with vectorization opportunities but also achieves data locality in time.

If you really want to leverage vectorization, you will have to optimize data re-use in cache by achieving locality of memory access in both space and time. To achieve locality of memory address in space, ensure that your application has a unit stride! This is to enable optimal usage of the cache lines during data transfer and thus e.g if you implement unit stride access in a loop and you are accessing back to back memory elements, for one double precision memory access, you’ll end up getting 7 free accesses and for  single precision memory access, you end up with 15 free accesses. Also, another important aspect to unit stride access is if you have a 2D-container you should access it columns first, then move to the next rows and maintain this sequence to ensure that the cache line is loaded with back to back data neighbors. Spatial locality can also be improved by moving from an Array of Structures (AoS) to a Structure of Arrays (SoA) in your data containers.

The main aim of achieving data locality in time is to reduce the chances of cache misses. The most effective techniques to have an optimal cache hit rate within a loop is to by loop tiling. Loop tiling is quite complex but basically involves two steps; strip mining and permutation. If you have 2 nested loops and you detect that one of the data containers is being re-used but the the re-use in not optimal, then you strip mine the loop accessing the data container and permute the outer two loops.

The purpose of this blog post was to provide a quick primer into various performance optimization techniques and the procedures. I would recommend that you dig deeper into each of these areas: vectorization , multithreading and memory traffic optimization in order to get a better understanding and be able to apply these to your stencil code and applications.

Leave a Reply

Your email address will not be published. Required fields are marked *