Enabling your code for vector execution on multi-core architectures

Abstract

In the era of higher memory bandwidth, larger vector register, increasing core count, have you ever asked where your performance lays? What is the maximum speed-up achievable on the underlying architecture? How the code implementation is effecting, inducing and improving the reaching goal of the High Performance Computing?

In this tutorial, we will answer all these questions and you will have the opportunity to learn techniques, methods and solutions on how to improve your code, how to enable the new hardware features and how to use the roofline model to visualize the potential benefits of your optimization process.

We will start with an overview of the latest micro-processor architectures and how the intrinsic parallelism has been implemented in hardware, mainly the SIMD instructions and multi-threading. Then we focus on how to define and measure processor and memory performance and how this is related to the application level. In particular we describe the roofline model approach, which gives an estimation and a visual model useful to estimate the application performance and the limitation of the underlying hardware.

When it comes to improve application performance, one needs to re-architect and/or tuning existing code to expose enough parallelism and vectorization. This is the time to dive into the code modernization framework. This results in a systematic approach which needs to be followed to achieve the highest performance possible. With the help of examples and use cases, we pinpoint you to possible inefficiencies both on threading and vectorization and we explain remedies, hints and strategies to be considered to ensure an application delivers great performance on today’s scalable hardware and upcoming future generations.

Furthermore we will show how performance analysis tools like Intel® Advisor and Intel® VTune Amplifier, together with examples and use cases, pinpoint you to inefficiencies both on threading and vectorization and also give hints to remedies. The outline can be summarized as follows: modern computer architecture, cache and memory system, optimization process and vectorization, profiling tools.

Agenda

The total estimated time for the tutorial is 3 hours. The tutorial will have presentations and demo session. Below is the tutorial outline:

Introduction and Motivation
Modern Computer Architecture
- Moore's law
- SIMD instructions
- Simultaneous multithreading
- Cache and Memory System
Roofline model
- How to measure performance
- Arithmetic intensity
- Roofline chart
Code Optimization Process
- Introduction to code modernization approach
- Scalar and serial optimization
- Vectorization
- Parallelization
Profiling tools
- Intel® Advisor
- Intel® VTune Amplifier

Prerequisite knowledge

Attendees should be comfortable with either C/C++ or Fortran programming languages and basic Linux commands, like make and ssh. No previous experience in vectorization and parallelization is required and profiling tools, as well.

Presenter

Fabio Baruffa is a software technical consulting engineer in the Developer Products Division (DPD) of the Software and Services Group (SSG) at Intel. He is working in the compiler team and provides customer support in the high performance computing (HPC) area. Prior at Intel, he has been working as HPC application specialist and developer in the largest supercomputing centers in Europe, mainly the Leibniz Supercomputing Center and the Max-Plank Computing and Data Facility in Munich, as well as Cineca in Italy. He has been involved in software development, analysis of scientific code and optimization for HPC systems. He holds a PhD in Physics from University of Regensburg for his research in the area of spintronics device and quantum computing.