Iterative Optimization in the Polyhedral Model: Part II, Multidimensional Time Louis-Noël Pouchet1 Cédric Bastoul1 Albert Cohen1 John Cavazos2
2
1 ALCHEMY group, INRIA Saclay / University of Paris-Sud 11, France Dept. of Computer & Information Sciences, University of Delaware, USA
June 9, 2008
ACM SIGPLAN 2008 Conference on Programming Languages Design and Implementation Tucson, Arizona
PLDI’08
Introduction: Situation
Motivation
I
New architecture → New high-performance libraries needed
I
New architecture → New optimization flow needed
I
Architecture complexity/diversity increases faster than optimization progress
I
Traditional approaches lose performance portability. . .
We want a portable optimization process!
INRIA Saclay / U. of Delaware
2 / 18
PLDI’08
Introduction: The Problem
The Optimization Problem Architectural characteristics
Compiler optimization interaction
Domain knowledge
ALU, SIMD, Caches, ...
GCC has 205 passes...
Linear algebra, FFT, ...
Optimizing compilation process
Code for architecture 1
INRIA Saclay / U. of Delaware
Code for architecture 2
.........
Code for architecture N
3 / 18
PLDI’08
Introduction: The Problem
The Optimization Problem Architectural characteristics
Compiler optimization interaction
Domain knowledge
ALU, SIMD, Caches, ...
GCC has 205 passes...
Linear algebra, FFT, ...
Optimizing compilation process
Code for architecture 1
INRIA Saclay / U. of Delaware
Code for architecture 2
locality improvement, = vectorization, parallelization, etc...
.........
Code for architecture N
3 / 18
PLDI’08
Introduction: The Problem
The Optimization Problem Architectural characteristics
Compiler optimization interaction
Domain knowledge
ALU, SIMD, Caches, ...
GCC has 205 passes...
Linear algebra, FFT, ...
Optimizing compilation process
Code for architecture 1
INRIA Saclay / U. of Delaware
Code for architecture 2
parameter tuning, = phase ordering, etc...
.........
Code for architecture N
3 / 18
PLDI’08
Introduction: The Problem
The Optimization Problem Architectural characteristics
Compiler optimization interaction
Domain knowledge
ALU, SIMD, Caches, ...
GCC has 205 passes...
Linear algebra, FFT, ...
Optimizing compilation process
Code for architecture 1
INRIA Saclay / U. of Delaware
Code for architecture 2
pattern recognition, = hand-tuned kernel codes, etc...
.........
Code for architecture N
3 / 18
PLDI’08
Introduction: The Problem
The Optimization Problem Architectural characteristics
Compiler optimization interaction
Domain knowledge
ALU, SIMD, Caches, ...
GCC has 205 passes...
Linear algebra, FFT, ...
Optimizing compilation process
Code for architecture 1
INRIA Saclay / U. of Delaware
Code for architecture 2
= Auto-tuning libraries
.........
Code for architecture N
3 / 18
PLDI’08
Introduction: The Problem
The Optimization Problem Architectural characteristics
Compiler optimization interaction
Domain knowledge
ALU, SIMD, Caches, ...
GCC has 205 passes...
Linear algebra, FFT, ...
In reality, there is a complex interplay between all components
Our approach: build an expressive set of program versions
Optimizing compilation process
Code for architecture 1
INRIA Saclay / U. of Delaware
Code for architecture 2
.........
Code for architecture N
3 / 18
PLDI’08
Generating Program Versions: Overview
Iterative Optimization Flow High-level transformations Input code
Optimization 1
Optimization 2
Target code
INRIA Saclay / U. of Delaware
.........
Optimization N
Compiler
4 / 18
PLDI’08
Generating Program Versions: Overview
Iterative Optimization Flow
Input code
Set of program versions
Target code
Compiler
Program version = result of a sequence of loop transformation INRIA Saclay / U. of Delaware
4 / 18
PLDI’08
Generating Program Versions: Overview
Iterative Optimization Flow
Input code
Final code
Run
Set of program versions
Space explorer
Target code
Compiler
Program version = result of a sequence of loop transformation INRIA Saclay / U. of Delaware
4 / 18
Generating Program Versions: Properties
PLDI’08
Set of Program Versions
What matters is the result of the application of optimizations, not the optimization sequence
All-in-one approach: I
Legality: semantics is always preserved
I
Uniqueness: all versions of the set are distinct
I
Expressiveness: a version is the result of an arbitrarily complex sequence of loop transformation
INRIA Saclay / U. of Delaware
5 / 18
PLDI’08
Generating Program Versions: The Representation
The Polyhedral Model in a Nutshell I
Arbitrarily complex sequence of loop transformations are modeled in a single optimization step: new scheduling matrix
I
Granularity: each executed instance of each statement
for (i = ...; i < ...; ++i)
Θ:
S1(i); for (i = ...; i < ...; ++i) S2(i);
I
First row → all outer-most loops
INRIA Saclay / U. of Delaware
6 / 18
PLDI’08
Generating Program Versions: The Representation
The Polyhedral Model in a Nutshell I
Arbitrarily complex sequence of loop transformations are modeled in a single optimization step: new scheduling matrix
I
Granularity: each executed instance of each statement
Θ:
for (i = ...; i < ...; ++i) for (j = ...; j < ...; ++j) S1(i,j); for (i = ...; i < ...; ++i) for (j = ...; j < ...; ++j) S2(i,j);
I
Second row → all next outer-most loops
INRIA Saclay / U. of Delaware
6 / 18
PLDI’08
Generating Program Versions: The Representation
The Polyhedral Model in a Nutshell I
Arbitrarily complex sequence of loop transformations are modeled in a single optimization step: new scheduling matrix
I
Granularity: each executed instance of each statement
Θ:
I
for (j = ...; j < ...; ++j) S2(...,j); for (i = ...; i < ...; ++i) for (j = ...; j < ...; ++j) S1(i,j); S2(i,j);
Minor change → significant impact
INRIA Saclay / U. of Delaware
6 / 18
PLDI’08
Generating Program Versions: The Representation
The Polyhedral Model in a Nutshell I
Arbitrarily complex sequence of loop transformations are modeled in a single optimization step: new scheduling matrix
I
Granularity: each executed instance of each statement
Θ:
~ı ~ı
~p ~p
Transformation
~ı ~p c INRIA Saclay / U. of Delaware
reversal skewing interchange fusion distribution peeling shifting
c
c
for (j = ...; j < ...; ++j) S2(...,j); for (i = ...; i < ...; ++i) for (j = ...; j < ...; ++j) S1(i,j); S2(i,j);
Description Changes the direction in which a loop traverses its iteration range Makes the bounds of a given loop depend on an outer loop counter Exchanges two loops in a perfectly nested loop, a.k.a. permutation Fuses two loops, a.k.a. jamming Splits a single loop nest into many, a.k.a. fission or splitting Extracts one iteration of a given loop Allows to reorder loops 6 / 18
Generating Program Versions: Contributions
PLDI’08
Previous Contributions
Previous work (CGO’07, Part I, One-Dimensional Time): I
Focus on Static Control Parts (SCoP) I
SCoP: Consecutive set of statements with affine control flow
I Complete framework for one-dimensional schedules I Efficient search space construction, efficient traversal I Drawbacks in applicability I Drawbacks in expressiveness We previously solved a simpler problem...
INRIA Saclay / U. of Delaware
7 / 18
Generating Program Versions: Contributions
PLDI’08
New Contributions
Dealing with multidimensional schedules: I
Applicability on any Static Control Parts
I
Increased expressiveness
I
Design of scalable traversal methods I
Dedicated genetic algorithm
I
Dedicated heuristic
INRIA Saclay / U. of Delaware
8 / 18
PLDI’08
Generating Program Versions: Looking Into Details
Deeper In The Method Multidimensional schedules: high expressiveness, complex problem Space construction
Distinct schedules
Set of program versions
Space traversal
- combinatorial expression of legality
- multiple polytopes to traverse
- heuristic needed: greedy selection of dependences + ordering (see Some Efficient Solutions to the Affine Scheduling Problem, Part II: Multidimensional Time, Feautrier, 1992)
- large and expressive spaces (up to 1050)
- Code generation friendly bounds on the schedule coefficients
INRIA Saclay / U. of Delaware
Tested versions
- partial enumeration (mandatory): completion mechanism+ subspace partitioning - shape the space: optimized polytope projection (required) + constrained dynamic scan
9 / 18
PLDI’08
Traversing the Search Space: Extensive Analysis
Observations on the Performance Distribution Performance distribution - 8x8 DCT Best Average Worst
1.6
for (i = 0; i < for (j = 0; j tmp[i][j] = for (k = 0; tmp[i][j]
Performance improvement
1.4 1.2 1
M; i++) < M; j++) { 0.0; k < M; k++) += block[i][k] * cos1[j][k];
} for (i = 0; i < M; i++) for (j = 0; j < M; j++) { sum2 = 0.0; for (k = 0; k < M; k++) sum2 += cos1[i][k] * tmp[k][j]; block[i][j] = ROUND(sum2); }
0.8 0.6 0.4 0.2 0 10
20 30 40 50 Point index for the first schedule row
60
I
Extensive study of 8x8 Discrete Cosine Transform (UTDSP)
I
Search space analyzed: 66 × 19683 = 1.29 × 106 different legal program versions
INRIA Saclay / U. of Delaware
10 / 18
PLDI’08
Traversing the Search Space: Extensive Analysis
Observations on the Performance Distribution Performance distribution - 8x8 DCT Best Average Worst
1.6
Θ:
Performance improvement
1.4 1.2 1 0.8 0.6 0.4 0.2 0
10
20 30 40 50 Point index for the first schedule row
60
I
Extensive study of 8x8 Discrete Cosine Transform (UTDSP)
I
Search space analyzed: 66 × 19683 = 1.29 × 106 different legal program versions
INRIA Saclay / U. of Delaware
10 / 18
PLDI’08
Traversing the Search Space: Extensive Analysis
Observations on the Performance Distribution Performance distribution - 8x8 DCT
I best I average I worst
Best Average Worst
1.6
Performance improvement
1.4 1.2 1 0.8 0.6 0.4 0.2 0 10
20 30 40 50 Point index for the first schedule row
60
I
Take one specific value for the first row
I
Try the 19863 possible values for the second row
INRIA Saclay / U. of Delaware
10 / 18
PLDI’08
Traversing the Search Space: Extensive Analysis
Observations on the Performance Distribution Performance distribution - 8x8 DCT
Performance improvement
Performance distribution (sorted) - 8x8 DCT
Best Average Worst
1.6
1.6
1.4
1.4
1.2
1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2 0
0 10
20 30 40 50 Point index for the first schedule row
60
0
2000
4000 6000 8000 10000 12000 14000 16000 Point index of the second schedule dimension, first one fixed
I
Take one specific value for the first row
I
Try the 19863 possible values for the second row
I
Very low proportion of best points: < 0.02%
INRIA Saclay / U. of Delaware
18000
10 / 18
PLDI’08
Traversing the Search Space: Extensive Analysis
Observations on the Performance Distribution Performance distribution - 8x8 DCT Best Average Worst
1.6
Large performance variation
Performance improvement
1.4 1.2 1 0.8 0.6 0.4 0.2 0 10
I
20 30 40 50 Point index for the first schedule row
60
Performance variation is large for good values of the first row
INRIA Saclay / U. of Delaware
10 / 18
PLDI’08
Traversing the Search Space: Extensive Analysis
Observations on the Performance Distribution Performance distribution - 8x8 DCT Best Average Worst
1.6
Small performance variation
Performance improvement
1.4 1.2 1 0.8 0.6 0.4 0.2 0 10
20 30 40 50 Point index for the first schedule row
60
I
Performance variation is large for good values of the first row
I
It is usually reduced for bad values of the first row
INRIA Saclay / U. of Delaware
10 / 18
Traversing the Search Space: Extensive Analysis
PLDI’08
Scanning The Space of Program Versions
The search space: I
Performance variation indicates to partition the space
I
Non-uniform distribution of performance
I
No clear analytical property of the optimization function
→ Build dedicated heuristic and genetic operators aware of these static and dynamic characteristics
INRIA Saclay / U. of Delaware
11 / 18
Traversing the Search Space: Heuristic
PLDI’08
Dedicated Heuristic
I
Multidimensional version of the heuristic presented in Part I
I
Discover 80%+ of the performance improvement in less than 50 runs for small kernels
I
Feedback directed, yet deterministic
I
Leverages our knowledge about performance distribution
I
Relies on the completion algorithm to instantiate the full version
I
But unsatisfactory results for larger programs...
INRIA Saclay / U. of Delaware
12 / 18
PLDI’08
Traversing the Search Space: Genetic Operators
Dedicated GA Operators Mutation I
Performance distribution is not uniform
I
Tailored to focus on the most promising subspaces
I
Preserves legality (closed under affine constraints)
Cross-over I
Row cross-over !
I
I
Column cross-over !
+
!
=
!
+
!
=
!
Both preserve legality
INRIA Saclay / U. of Delaware
13 / 18
PLDI’08
Traversing the Search Space: Genetic Operators
Dedicated GA Results
GA versus Random - 8x8 DCT
Performance distribution (sorted) - 8x8 DCT Random GA
1.6
1.6 1.4 Performance improvement
Performance Improvement
1.4 1.2 1 0.8 0.6
1.2 1 0.8 0.6
0.4
0.4
0.2
0.2
0
0 50
I
100
150
200 250 300 Number of runs
350
400
450
500
0
2000
4000 6000 8000 10000 12000 14000 16000 Point index of the second schedule dimension, first one fixed
18000
GA converges towards the maximal space speedup
INRIA Saclay / U. of Delaware
14 / 18
PLDI’08
Traversing the Search Space: Experimental Results
Experimental Results [1/3] Performance improvement for AMD Athlon64 Heuristic GA
Performance improvement
1.7 1.6 1.5 1.4 1.3 1.2 1.1 1
er
av e ag
rm
r da p m dc
ra
lu
c
t
ul m at ir sf
tn
lp
la
m
lm
fir
iir ge ed t
dc
baseline: gcc -O3 -ftree-vectorize -msse2 INRIA Saclay / U. of Delaware
15 / 18
PLDI’08
Traversing the Search Space: Experimental Results
Experimental Results [2/3] Performance improvement for ST231
Performance improvement
1.35
Heuristic GA
1.3 1.25 1.2 1.15 1.1 1.05 1
er
av e ag
rm
r da ra p m dc
lu
c
t
ul m at ir sf
tn
lp
la
m
lm
fir
iir ge ed t
dc
baseline: st200cc -O3 -OPT:alias=restrict -mauto-prefetch INRIA Saclay / U. of Delaware
16 / 18
Traversing the Search Space: Experimental Results
PLDI’08
Experimental Results [3/3]
Looking into details (hardware counters+compilation trace): I
Better activity of the processing units
I
Best version may vary significantly for different architectures
I
Different source code may trigger different compiler optimizations
→ Our method is a portable optimization process
INRIA Saclay / U. of Delaware
17 / 18
PLDI’08
Conclusion:
Conclusion I
Scalable algorithms (GA and heuristic) to traverse the space, with dedicated pruning and search strategies
I
Part I + Part II: applicability observed on various compilers (GCC, ICC, Open64) and architectures (x86_32, x86_64, MIPS32, ST231 VLIW)
I
Tunable framework: open to other search space construction strategies
I
Take-home message: I I
INRIA Saclay / U. of Delaware
All-in-one: legality, uniqueness, expressiveness Generic and portable approach for high-level transformation selection
18 / 18
PLDI’08
Conclusion:
Tunuing: Distribute and Tile
I
Focus on fuse/distribute legality affine constraints (presented algorithm with additional constraints)
I
Use PLuTo as a tiling / parallel backend
I
Driven by program versions
I
Excellent performance gains (research report coming soon...)
INRIA Saclay / U. of Delaware
19 / 18