pldi slides - PDF Free Download

Iterative Optimization in the Polyhedral Model: Part II, Multidimensional Time Louis-Noël Pouchet1 Cédric Bastoul1 Albert Cohen1 John Cavazos2

2

1 ALCHEMY group, INRIA Saclay / University of Paris-Sud 11, France Dept. of Computer & Information Sciences, University of Delaware, USA

June 9, 2008

ACM SIGPLAN 2008 Conference on Programming Languages Design and Implementation Tucson, Arizona

PLDI’08

Introduction: Situation

Motivation

I

New architecture → New high-performance libraries needed

I

New architecture → New optimization flow needed

I

Architecture complexity/diversity increases faster than optimization progress

I

Traditional approaches lose performance portability. . .

We want a portable optimization process!

INRIA Saclay / U. of Delaware

2 / 18

PLDI’08

Introduction: The Problem

The Optimization Problem Architectural characteristics

Compiler optimization interaction

Domain knowledge

ALU, SIMD, Caches, ...

GCC has 205 passes...

Linear algebra, FFT, ...

Optimizing compilation process

Code for architecture 1

INRIA Saclay / U. of Delaware

Code for architecture 2

.........

Code for architecture N

3 / 18

PLDI’08

Introduction: The Problem

The Optimization Problem Architectural characteristics

Compiler optimization interaction

Domain knowledge

ALU, SIMD, Caches, ...

GCC has 205 passes...

Linear algebra, FFT, ...

Optimizing compilation process

Code for architecture 1

INRIA Saclay / U. of Delaware

Code for architecture 2

locality improvement, = vectorization, parallelization, etc...

.........

Code for architecture N

3 / 18

PLDI’08

Introduction: The Problem

The Optimization Problem Architectural characteristics

Compiler optimization interaction

Domain knowledge

ALU, SIMD, Caches, ...

GCC has 205 passes...

Linear algebra, FFT, ...

Optimizing compilation process

Code for architecture 1

INRIA Saclay / U. of Delaware

Code for architecture 2

parameter tuning, = phase ordering, etc...

.........

Code for architecture N

3 / 18

PLDI’08

Introduction: The Problem

The Optimization Problem Architectural characteristics

Compiler optimization interaction

Domain knowledge

ALU, SIMD, Caches, ...

GCC has 205 passes...

Linear algebra, FFT, ...

Optimizing compilation process

Code for architecture 1

INRIA Saclay / U. of Delaware

Code for architecture 2

pattern recognition, = hand-tuned kernel codes, etc...

.........

Code for architecture N

3 / 18

PLDI’08

Introduction: The Problem

The Optimization Problem Architectural characteristics

Compiler optimization interaction

Domain knowledge

ALU, SIMD, Caches, ...

GCC has 205 passes...

Linear algebra, FFT, ...

Optimizing compilation process

Code for architecture 1

INRIA Saclay / U. of Delaware

Code for architecture 2

= Auto-tuning libraries

.........

Code for architecture N

3 / 18

PLDI’08

Introduction: The Problem

The Optimization Problem Architectural characteristics

Compiler optimization interaction

Domain knowledge

ALU, SIMD, Caches, ...

GCC has 205 passes...

Linear algebra, FFT, ...

In reality, there is a complex interplay between all components

Our approach: build an expressive set of program versions

Optimizing compilation process

Code for architecture 1

INRIA Saclay / U. of Delaware

Code for architecture 2

.........

Code for architecture N

3 / 18

PLDI’08

Generating Program Versions: Overview

Iterative Optimization Flow High-level transformations Input code

Optimization 1

Optimization 2

Target code

INRIA Saclay / U. of Delaware

.........

Optimization N

Compiler

4 / 18

PLDI’08

Generating Program Versions: Overview

Iterative Optimization Flow

Input code

Set of program versions

Target code

Compiler

Program version = result of a sequence of loop transformation INRIA Saclay / U. of Delaware

4 / 18

PLDI’08

Generating Program Versions: Overview

Iterative Optimization Flow

Input code

Final code

Run

Set of program versions

Space explorer

Target code

Compiler

Program version = result of a sequence of loop transformation INRIA Saclay / U. of Delaware

4 / 18

Generating Program Versions: Properties

PLDI’08

Set of Program Versions

What matters is the result of the application of optimizations, not the optimization sequence

All-in-one approach: I

Legality: semantics is always preserved

I

Uniqueness: all versions of the set are distinct

I

Expressiveness: a version is the result of an arbitrarily complex sequence of loop transformation

INRIA Saclay / U. of Delaware

5 / 18

PLDI’08

Generating Program Versions: The Representation

The Polyhedral Model in a Nutshell I

Arbitrarily complex sequence of loop transformations are modeled in a single optimization step: new scheduling matrix

I

Granularity: each executed instance of each statement

for (i = ...; i < ...; ++i)





 Θ:

 

S1(i); for (i = ...; i < ...; ++i) S2(i);

I

First row → all outer-most loops

INRIA Saclay / U. of Delaware

6 / 18

PLDI’08

Generating Program Versions: The Representation

The Polyhedral Model in a Nutshell I

Arbitrarily complex sequence of loop transformations are modeled in a single optimization step: new scheduling matrix

I

Granularity: each executed instance of each statement





 Θ:

 

for (i = ...; i < ...; ++i) for (j = ...; j < ...; ++j) S1(i,j); for (i = ...; i < ...; ++i) for (j = ...; j < ...; ++j) S2(i,j);

I

Second row → all next outer-most loops

INRIA Saclay / U. of Delaware

6 / 18

PLDI’08

Generating Program Versions: The Representation

The Polyhedral Model in a Nutshell I

Arbitrarily complex sequence of loop transformations are modeled in a single optimization step: new scheduling matrix

I

Granularity: each executed instance of each statement





 Θ:

 

I

for (j = ...; j < ...; ++j) S2(...,j); for (i = ...; i < ...; ++i) for (j = ...; j < ...; ++j) S1(i,j); S2(i,j);

Minor change → significant impact

INRIA Saclay / U. of Delaware

6 / 18

PLDI’08

Generating Program Versions: The Representation

The Polyhedral Model in a Nutshell I

Arbitrarily complex sequence of loop transformations are modeled in a single optimization step: new scheduling matrix

I

Granularity: each executed instance of each statement

  Θ:

~ı ~ı

~p ~p

Transformation

~ı ~p c INRIA Saclay / U. of Delaware

reversal skewing interchange fusion distribution peeling shifting

c

  

c

for (j = ...; j < ...; ++j) S2(...,j); for (i = ...; i < ...; ++i) for (j = ...; j < ...; ++j) S1(i,j); S2(i,j);

Description Changes the direction in which a loop traverses its iteration range Makes the bounds of a given loop depend on an outer loop counter Exchanges two loops in a perfectly nested loop, a.k.a. permutation Fuses two loops, a.k.a. jamming Splits a single loop nest into many, a.k.a. fission or splitting Extracts one iteration of a given loop Allows to reorder loops 6 / 18

Generating Program Versions: Contributions

PLDI’08

Previous Contributions

Previous work (CGO’07, Part I, One-Dimensional Time): I

Focus on Static Control Parts (SCoP) I

SCoP: Consecutive set of statements with affine control flow

I Complete framework for one-dimensional schedules I Efficient search space construction, efficient traversal I Drawbacks in applicability I Drawbacks in expressiveness We previously solved a simpler problem...

INRIA Saclay / U. of Delaware

7 / 18

Generating Program Versions: Contributions

PLDI’08

New Contributions

Dealing with multidimensional schedules: I

Applicability on any Static Control Parts

I

Increased expressiveness

I

Design of scalable traversal methods I

Dedicated genetic algorithm

I

Dedicated heuristic

INRIA Saclay / U. of Delaware

8 / 18

PLDI’08

Generating Program Versions: Looking Into Details

Deeper In The Method Multidimensional schedules: high expressiveness, complex problem Space construction

Distinct schedules

Set of program versions

Space traversal

- combinatorial expression of legality

- multiple polytopes to traverse

- heuristic needed: greedy selection of dependences + ordering (see Some Efficient Solutions to the Affine Scheduling Problem, Part II: Multidimensional Time, Feautrier, 1992)

- large and expressive spaces (up to 1050)

- Code generation friendly bounds on the schedule coefficients

INRIA Saclay / U. of Delaware

Tested versions

- partial enumeration (mandatory): completion mechanism+ subspace partitioning - shape the space: optimized polytope projection (required) + constrained dynamic scan

9 / 18

PLDI’08

Traversing the Search Space: Extensive Analysis

Observations on the Performance Distribution Performance distribution - 8x8 DCT Best Average Worst

1.6

for (i = 0; i < for (j = 0; j tmp[i][j] = for (k = 0; tmp[i][j]

Performance improvement

1.4 1.2 1

M; i++) < M; j++) { 0.0; k < M; k++) += block[i][k] * cos1[j][k];

} for (i = 0; i < M; i++) for (j = 0; j < M; j++) { sum2 = 0.0; for (k = 0; k < M; k++) sum2 += cos1[i][k] * tmp[k][j]; block[i][j] = ROUND(sum2); }

0.8 0.6 0.4 0.2 0 10

20 30 40 50 Point index for the first schedule row

60

I

Extensive study of 8x8 Discrete Cosine Transform (UTDSP)

I

Search space analyzed: 66 × 19683 = 1.29 × 106 different legal program versions

INRIA Saclay / U. of Delaware

10 / 18

PLDI’08

Traversing the Search Space: Extensive Analysis

Observations on the Performance Distribution Performance distribution - 8x8 DCT Best Average Worst

1.6





  Θ:  

    

Performance improvement

1.4 1.2 1 0.8 0.6 0.4 0.2 0

10

20 30 40 50 Point index for the first schedule row

60

I

Extensive study of 8x8 Discrete Cosine Transform (UTDSP)

I

Search space analyzed: 66 × 19683 = 1.29 × 106 different legal program versions

INRIA Saclay / U. of Delaware

10 / 18

PLDI’08

Traversing the Search Space: Extensive Analysis

Observations on the Performance Distribution Performance distribution - 8x8 DCT

I best I average I worst

Best Average Worst

1.6

Performance improvement

1.4 1.2 1 0.8 0.6 0.4 0.2 0 10

20 30 40 50 Point index for the first schedule row

60

I

Take one specific value for the first row

I

Try the 19863 possible values for the second row

INRIA Saclay / U. of Delaware

10 / 18

PLDI’08

Traversing the Search Space: Extensive Analysis

Observations on the Performance Distribution Performance distribution - 8x8 DCT

Performance improvement

Performance distribution (sorted) - 8x8 DCT

Best Average Worst

1.6

1.6

1.4

1.4

1.2

1.2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2 0

0 10

20 30 40 50 Point index for the first schedule row

60

0

2000

4000 6000 8000 10000 12000 14000 16000 Point index of the second schedule dimension, first one fixed

I

Take one specific value for the first row

I

Try the 19863 possible values for the second row

I

Very low proportion of best points: < 0.02%

INRIA Saclay / U. of Delaware

18000

10 / 18

PLDI’08

Traversing the Search Space: Extensive Analysis

Observations on the Performance Distribution Performance distribution - 8x8 DCT Best Average Worst

1.6

Large performance variation

Performance improvement

1.4 1.2 1 0.8 0.6 0.4 0.2 0 10

I

20 30 40 50 Point index for the first schedule row

60

Performance variation is large for good values of the first row

INRIA Saclay / U. of Delaware

10 / 18

PLDI’08

Traversing the Search Space: Extensive Analysis

Observations on the Performance Distribution Performance distribution - 8x8 DCT Best Average Worst

1.6

Small performance variation

Performance improvement

1.4 1.2 1 0.8 0.6 0.4 0.2 0 10

20 30 40 50 Point index for the first schedule row

60

I

Performance variation is large for good values of the first row

I

It is usually reduced for bad values of the first row

INRIA Saclay / U. of Delaware

10 / 18

Traversing the Search Space: Extensive Analysis

PLDI’08

Scanning The Space of Program Versions

The search space: I

Performance variation indicates to partition the space

I

Non-uniform distribution of performance

I

No clear analytical property of the optimization function

→ Build dedicated heuristic and genetic operators aware of these static and dynamic characteristics

INRIA Saclay / U. of Delaware

11 / 18

Traversing the Search Space: Heuristic

PLDI’08

Dedicated Heuristic

I

Multidimensional version of the heuristic presented in Part I

I

Discover 80%+ of the performance improvement in less than 50 runs for small kernels

I

Feedback directed, yet deterministic

I

Leverages our knowledge about performance distribution

I

Relies on the completion algorithm to instantiate the full version

I

But unsatisfactory results for larger programs...

INRIA Saclay / U. of Delaware

12 / 18

PLDI’08

Traversing the Search Space: Genetic Operators

Dedicated GA Operators Mutation I

Performance distribution is not uniform

I

Tailored to focus on the most promising subspaces

I

Preserves legality (closed under affine constraints)

Cross-over I

Row cross-over !

I

I

Column cross-over !

+

!

=

!

+

!

=

!

Both preserve legality

INRIA Saclay / U. of Delaware

13 / 18

PLDI’08

Traversing the Search Space: Genetic Operators

Dedicated GA Results

GA versus Random - 8x8 DCT

Performance distribution (sorted) - 8x8 DCT Random GA

1.6

1.6 1.4 Performance improvement

Performance Improvement

1.4 1.2 1 0.8 0.6

1.2 1 0.8 0.6

0.4

0.4

0.2

0.2

0

0 50

I

100

150

200 250 300 Number of runs

350

400

450

500

0

2000

4000 6000 8000 10000 12000 14000 16000 Point index of the second schedule dimension, first one fixed

18000

GA converges towards the maximal space speedup

INRIA Saclay / U. of Delaware

14 / 18

PLDI’08

Traversing the Search Space: Experimental Results

Experimental Results [1/3] Performance improvement for AMD Athlon64 Heuristic GA

Performance improvement

1.7 1.6 1.5 1.4 1.3 1.2 1.1 1

er

av e ag

rm

r da p m dc

ra

lu

c

t

ul m at ir sf

tn

lp

la

m

lm

fir

iir ge ed t

dc

baseline: gcc -O3 -ftree-vectorize -msse2 INRIA Saclay / U. of Delaware

15 / 18

PLDI’08

Traversing the Search Space: Experimental Results

Experimental Results [2/3] Performance improvement for ST231

Performance improvement

1.35

Heuristic GA

1.3 1.25 1.2 1.15 1.1 1.05 1

er

av e ag

rm

r da ra p m dc

lu

c

t

ul m at ir sf

tn

lp

la

m

lm

fir

iir ge ed t

dc

baseline: st200cc -O3 -OPT:alias=restrict -mauto-prefetch INRIA Saclay / U. of Delaware

16 / 18

Traversing the Search Space: Experimental Results

PLDI’08

Experimental Results [3/3]

Looking into details (hardware counters+compilation trace): I

Better activity of the processing units

I

Best version may vary significantly for different architectures

I

Different source code may trigger different compiler optimizations

→ Our method is a portable optimization process

INRIA Saclay / U. of Delaware

17 / 18

PLDI’08

Conclusion:

Conclusion I

Scalable algorithms (GA and heuristic) to traverse the space, with dedicated pruning and search strategies

I

Part I + Part II: applicability observed on various compilers (GCC, ICC, Open64) and architectures (x86_32, x86_64, MIPS32, ST231 VLIW)

I

Tunable framework: open to other search space construction strategies

I

Take-home message: I I

INRIA Saclay / U. of Delaware

All-in-one: legality, uniqueness, expressiveness Generic and portable approach for high-level transformation selection

18 / 18

PLDI’08

Conclusion:

Tunuing: Distribute and Tile

I

Focus on fuse/distribute legality affine constraints (presented algorithm with additional constraints)

I

Use PLuTo as a tiling / parallel backend

I

Driven by program versions

I

Excellent performance gains (research report coming soon...)

INRIA Saclay / U. of Delaware

19 / 18