Bookmark and Share

ARM Guide to OpenCL Optimizing Convolution: Conclusion of the Optimization Process

Register or sign in to access the Embedded Vision Academy's free technical training content.

The training materials provided by the Embedded Vision Academy are offered free of charge to everyone. All we ask in return is that you register, and tell us a little about yourself so that we can understand a bit about our audience. As detailed in our Privacy Policy, we will not share your registration information, nor contact you, except with your consent.

Registration is free and takes less than one minute. Click here to register, and get full access to the Embedded Vision Academy's unique technical training content.

If you've already registered, click here to sign in.

See a sample of this page's content below:

This chapter describes the conclusions from the optimization process.


While a C and C++ implementation ported to an OpenCL implementation can be faster than the C and C++ code on an application processor implementation, much can still be improved.

The following list is a summary of the different techniques to optimize this kernel, and other OpenCL kernels:

  • Find ways to parallelize sequential code and keep work items independent from each other where this is possible.
  • Compute more than one pixel, or more bytes, per work-item.
  • Reduce the number of load/store operations by vectorizing.
  • Avoid conditional branches and loops. Use pre-processor macros or inline functions if you can.
  • Use the built-in OpenCL function library.
  • Use vector operations and data types of the right size, such as char, short, int, and float, depending on the data and the accuracy required.
  • Set the constants at build time instead of execution time.