🗊Презентация Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication

Категория: Математика
Нажмите для полного просмотра!
Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №1Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №2Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №3Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №4Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №5Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №6Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №7Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №8Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №9Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №10Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №11Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №12Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №13Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №14Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №15Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №16Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №17Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №18Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №19Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №20Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №21Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №22Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №23Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №24Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №25Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №26Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №27Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №28Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №29Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №30Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №31Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №32Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №33Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №34Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №35Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №36Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №37Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №38Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №39Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №40Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №41Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №42Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №43Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №44Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №45Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №46Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №47Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №48Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №49Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №50Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №51Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №52Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №53Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №54Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №55Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №56Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №57Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №58Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №59Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №60Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №61Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №62Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №63Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №64Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №65Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №66Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №67Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №68Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №69Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №70Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №71Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №72Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №73Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №74Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №75Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №76Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №77Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №78Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №79Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №80Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №81Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №82Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №83Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №84Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №85Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №86Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №87Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №88Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №89Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №90Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №91Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №92Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №93Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №94Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №95Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №96Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №97Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №98Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №99Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №100Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №101Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №102Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №103Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №104Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №105

Содержание

Вы можете ознакомиться и скачать презентацию на тему Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication. Доклад-сообщение содержит 105 слайдов. Презентации для любого класса можно скачать бесплатно. Если материал и наш сайт презентаций Mypresentation Вам понравились – поделитесь им с друзьями с помощью социальных кнопок и добавьте в закладки в своем браузере.

Слайды и текст этой презентации


Слайд 1





CS 267
Dense Linear Algebra:
History and Structure,
Parallel Matrix Multiplication
James Demmel
www.cs.berkeley.edu/~demmel/cs267_Spr11
Описание слайда:
CS 267 Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr11

Слайд 2





Outline
History and motivation
Structure of the Dense Linear Algebra motif
Parallel Matrix-matrix multiplication
Parallel Gaussian Elimination (next time)
Описание слайда:
Outline History and motivation Structure of the Dense Linear Algebra motif Parallel Matrix-matrix multiplication Parallel Gaussian Elimination (next time)

Слайд 3





Outline
History and motivation
Structure of the Dense Linear Algebra motif
Parallel Matrix-matrix multiplication
Parallel Gaussian Elimination (next time)
Описание слайда:
Outline History and motivation Structure of the Dense Linear Algebra motif Parallel Matrix-matrix multiplication Parallel Gaussian Elimination (next time)

Слайд 4





Motifs
Описание слайда:
Motifs

Слайд 5





What is dense linear algebra?
Not just matmul!
Linear Systems:  Ax=b
Least Squares: choose x to minimize ||Ax-b||2
Overdetermined or underdetermined
Unconstrained, constrained, weighted
Eigenvalues and vectors of Symmetric Matrices
Standard (Ax = λx), Generalized (Ax=λBx)
Eigenvalues and vectors of Unsymmetric matrices
Eigenvalues, Schur form, eigenvectors, invariant subspaces
Standard, Generalized
Singular Values and vectors (SVD)
Standard, Generalized
Different matrix structures
Real, complex; Symmetric, Hermitian, positive definite; dense, triangular, banded …
Level of detail
Simple Driver
Expert Drivers with error bounds,  extra-precision, other options
Lower level routines (“apply certain kind of orthogonal transformation”, matmul…)
Описание слайда:
What is dense linear algebra? Not just matmul! Linear Systems: Ax=b Least Squares: choose x to minimize ||Ax-b||2 Overdetermined or underdetermined Unconstrained, constrained, weighted Eigenvalues and vectors of Symmetric Matrices Standard (Ax = λx), Generalized (Ax=λBx) Eigenvalues and vectors of Unsymmetric matrices Eigenvalues, Schur form, eigenvectors, invariant subspaces Standard, Generalized Singular Values and vectors (SVD) Standard, Generalized Different matrix structures Real, complex; Symmetric, Hermitian, positive definite; dense, triangular, banded … Level of detail Simple Driver Expert Drivers with error bounds, extra-precision, other options Lower level routines (“apply certain kind of orthogonal transformation”, matmul…)

Слайд 6





A brief history of (Dense) Linear Algebra software (1/7)
Libraries like EISPACK (for eigenvalue problems)
Then the BLAS (1) were invented (1973-1977)
Standard library of 15 operations (mostly) on vectors
“AXPY”  ( y = α·x + y ), dot product, scale (x = α·x ), etc
Up to 4 versions of each (S/D/C/Z), 46 routines, 3300 LOC
Goals
Common “pattern” to ease programming, readability, self-documentation
Robustness, via careful coding (avoiding over/underflow)
Portability + Efficiency via machine specific implementations
Why BLAS 1 ?  They do O(n1) ops on O(n1) data
Used in libraries like LINPACK (for linear systems)
Source of the name “LINPACK Benchmark” (not the code!)
Описание слайда:
A brief history of (Dense) Linear Algebra software (1/7) Libraries like EISPACK (for eigenvalue problems) Then the BLAS (1) were invented (1973-1977) Standard library of 15 operations (mostly) on vectors “AXPY” ( y = α·x + y ), dot product, scale (x = α·x ), etc Up to 4 versions of each (S/D/C/Z), 46 routines, 3300 LOC Goals Common “pattern” to ease programming, readability, self-documentation Robustness, via careful coding (avoiding over/underflow) Portability + Efficiency via machine specific implementations Why BLAS 1 ? They do O(n1) ops on O(n1) data Used in libraries like LINPACK (for linear systems) Source of the name “LINPACK Benchmark” (not the code!)

Слайд 7


Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №7
Описание слайда:

Слайд 8





A brief history of (Dense) Linear Algebra software (2/7)
But the BLAS-1 weren’t enough
Consider AXPY ( y = α·x + y ): 2n flops on 3n read/writes 
Computational intensity = (2n)/(3n) = 2/3
Too low to run near peak speed (read/write dominates)
Hard to vectorize (“SIMD’ize”) on supercomputers of the day (1980s)
So the BLAS-2 were invented (1984-1986)
Standard library of 25 operations (mostly) on matrix/vector pairs
“GEMV”: y = α·A·x + β·x, “GER”: A = A + α·x·yT,  x = T-1·x
Up to 4 versions of each (S/D/C/Z), 66 routines, 18K LOC
Why BLAS 2 ?  They do O(n2) ops on O(n2) data
So computational intensity still just ~(2n2)/(n2) = 2
OK for vector machines, but not for machine with caches
Описание слайда:
A brief history of (Dense) Linear Algebra software (2/7) But the BLAS-1 weren’t enough Consider AXPY ( y = α·x + y ): 2n flops on 3n read/writes Computational intensity = (2n)/(3n) = 2/3 Too low to run near peak speed (read/write dominates) Hard to vectorize (“SIMD’ize”) on supercomputers of the day (1980s) So the BLAS-2 were invented (1984-1986) Standard library of 25 operations (mostly) on matrix/vector pairs “GEMV”: y = α·A·x + β·x, “GER”: A = A + α·x·yT, x = T-1·x Up to 4 versions of each (S/D/C/Z), 66 routines, 18K LOC Why BLAS 2 ? They do O(n2) ops on O(n2) data So computational intensity still just ~(2n2)/(n2) = 2 OK for vector machines, but not for machine with caches

Слайд 9





A brief history of (Dense) Linear Algebra software (3/7)
The next step: BLAS-3 (1987-1988)
Standard library of 9 operations (mostly) on matrix/matrix pairs
“GEMM”: C = α·A·B + β·C, C = α·A·AT + β·C,  B = T-1·B
Up to 4 versions of each (S/D/C/Z), 30 routines, 10K LOC
Why BLAS 3 ?  They do O(n3) ops on O(n2) data
So computational intensity (2n3)/(4n2) = n/2 – big at last!
Good for machines with caches, other mem. hierarchy levels
How much BLAS1/2/3 code so far  (all at www.netlib.org/blas)
Source: 142 routines, 31K LOC,    Testing:  28K LOC
Reference  (unoptimized) implementation only 
Ex: 3 nested loops for GEMM
Lots more optimized code (eg Homework 1)
Motivates “automatic tuning” of the BLAS
Part  of standard math libraries (eg AMD AMCL, Intel MKL)
Описание слайда:
A brief history of (Dense) Linear Algebra software (3/7) The next step: BLAS-3 (1987-1988) Standard library of 9 operations (mostly) on matrix/matrix pairs “GEMM”: C = α·A·B + β·C, C = α·A·AT + β·C, B = T-1·B Up to 4 versions of each (S/D/C/Z), 30 routines, 10K LOC Why BLAS 3 ? They do O(n3) ops on O(n2) data So computational intensity (2n3)/(4n2) = n/2 – big at last! Good for machines with caches, other mem. hierarchy levels How much BLAS1/2/3 code so far (all at www.netlib.org/blas) Source: 142 routines, 31K LOC, Testing: 28K LOC Reference (unoptimized) implementation only Ex: 3 nested loops for GEMM Lots more optimized code (eg Homework 1) Motivates “automatic tuning” of the BLAS Part of standard math libraries (eg AMD AMCL, Intel MKL)

Слайд 10


Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №10
Описание слайда:

Слайд 11





A brief history of (Dense) Linear Algebra software (4/7)
LAPACK – “Linear Algebra PACKage” - uses BLAS-3 (1989 – now)
Ex: Obvious way to express Gaussian Elimination  (GE) is adding multiples of one row to other rows – BLAS-1
How do we reorganize GE to use BLAS-3 ? (details later)
Contents of LAPACK (summary)
Algorithms we can turn into (nearly) 100% BLAS 3
Linear Systems: solve Ax=b for x
Least Squares: choose x to minimize ||Ax-b||2
Algorithms that are only 50% BLAS 3
Eigenproblems: Find  and x where Ax =  x
Singular Value Decomposition (SVD)
Generalized problems (eg Ax =  Bx)
Error bounds for everything
Lots of variants depending on A’s structure  (banded, A=AT, etc)
How much code?  (Release 3.3, Nov 2010) (www.netlib.org/lapack)
Source: 1586 routines, 500K LOC,  Testing: 363K LOC
Ongoing development (at UCB and elsewhere) (class projects!)
Описание слайда:
A brief history of (Dense) Linear Algebra software (4/7) LAPACK – “Linear Algebra PACKage” - uses BLAS-3 (1989 – now) Ex: Obvious way to express Gaussian Elimination (GE) is adding multiples of one row to other rows – BLAS-1 How do we reorganize GE to use BLAS-3 ? (details later) Contents of LAPACK (summary) Algorithms we can turn into (nearly) 100% BLAS 3 Linear Systems: solve Ax=b for x Least Squares: choose x to minimize ||Ax-b||2 Algorithms that are only 50% BLAS 3 Eigenproblems: Find  and x where Ax =  x Singular Value Decomposition (SVD) Generalized problems (eg Ax =  Bx) Error bounds for everything Lots of variants depending on A’s structure (banded, A=AT, etc) How much code? (Release 3.3, Nov 2010) (www.netlib.org/lapack) Source: 1586 routines, 500K LOC, Testing: 363K LOC Ongoing development (at UCB and elsewhere) (class projects!)

Слайд 12





A brief history of (Dense) Linear Algebra software (5/7)
Is LAPACK parallel?
Only if the BLAS are parallel (possible in shared memory)
ScaLAPACK – “Scalable LAPACK” (1995 – now)
For distributed memory – uses MPI
More complex data structures, algorithms than LAPACK
Only (small) subset of LAPACK’s functionality available
Details later (class projects!) 
All at www.netlib.org/scalapack
Описание слайда:
A brief history of (Dense) Linear Algebra software (5/7) Is LAPACK parallel? Only if the BLAS are parallel (possible in shared memory) ScaLAPACK – “Scalable LAPACK” (1995 – now) For distributed memory – uses MPI More complex data structures, algorithms than LAPACK Only (small) subset of LAPACK’s functionality available Details later (class projects!) All at www.netlib.org/scalapack

Слайд 13





Success Stories for Sca/LAPACK (6/7)
Widely used
Adopted by Mathworks, Cray, Fujitsu, HP, IBM, IMSL, Intel, NAG, NEC, SGI, …
5.5M webhits/year @ Netlib (incl. CLAPACK, LAPACK95)
New Science discovered through the solution of dense matrix systems
Nature article on the flat universe used ScaLAPACK
Other articles in Physics Review B that also use it
1998 Gordon Bell Prize
www.nersc.gov/news/reports/newNERSCresults050703.pdf
Описание слайда:
Success Stories for Sca/LAPACK (6/7) Widely used Adopted by Mathworks, Cray, Fujitsu, HP, IBM, IMSL, Intel, NAG, NEC, SGI, … 5.5M webhits/year @ Netlib (incl. CLAPACK, LAPACK95) New Science discovered through the solution of dense matrix systems Nature article on the flat universe used ScaLAPACK Other articles in Physics Review B that also use it 1998 Gordon Bell Prize www.nersc.gov/news/reports/newNERSCresults050703.pdf

Слайд 14





Back to basics: 
Why avoiding communication is important (1/2)
Algorithms have two costs:
Arithmetic (FLOPS)
Communication: moving data between 
levels of a memory hierarchy (sequential case) 
processors over a network (parallel case).
Описание слайда:
Back to basics: Why avoiding communication is important (1/2) Algorithms have two costs: Arithmetic (FLOPS) Communication: moving data between levels of a memory hierarchy (sequential case) processors over a network (parallel case).

Слайд 15





Why avoiding communication is important (2/2)
Running time of an algorithm is sum of 3 terms:
# flops * time_per_flop
# words moved / bandwidth
# messages * latency
Описание слайда:
Why avoiding communication is important (2/2) Running time of an algorithm is sum of 3 terms: # flops * time_per_flop # words moved / bandwidth # messages * latency

Слайд 16





Review: Naïve Sequential MatMul:  C = C + A*B
for i = 1 to n
  {read row i of A into fast memory,  n2 reads}
   for j = 1 to n
       {read C(i,j) into fast memory, n2 reads}
       {read column j of B into fast memory, n3 reads}
       for k = 1 to n
           C(i,j) = C(i,j) + A(i,k) * B(k,j)
       {write C(i,j) back to slow memory, n2 writes}
Описание слайда:
Review: Naïve Sequential MatMul: C = C + A*B for i = 1 to n {read row i of A into fast memory, n2 reads} for j = 1 to n {read C(i,j) into fast memory, n2 reads} {read column j of B into fast memory, n3 reads} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory, n2 writes}

Слайд 17





Less Communication with Blocked Matrix Multiply
Blocked Matmul C = A·B explicitly refers to subblocks of A, B and C of dimensions that depend on cache size
Описание слайда:
Less Communication with Blocked Matrix Multiply Blocked Matmul C = A·B explicitly refers to subblocks of A, B and C of dimensions that depend on cache size

Слайд 18





Blocked vs Cache-Oblivious Algorithms
Blocked Matmul C = A·B explicitly refers to subblocks of A, B and C of dimensions that depend on cache size
Описание слайда:
Blocked vs Cache-Oblivious Algorithms Blocked Matmul C = A·B explicitly refers to subblocks of A, B and C of dimensions that depend on cache size

Слайд 19





Communication Lower Bounds:    Prior Work on Matmul
Assume  n3 algorithm  (i.e. not Strassen-like)
Sequential case, with fast memory of size M
Lower bound on  #words moved to/from slow memory  =  (n3 / M1/2 )    [Hong, Kung, 81] 
Attained using blocked or cache-oblivious algorithms
Описание слайда:
Communication Lower Bounds: Prior Work on Matmul Assume n3 algorithm (i.e. not Strassen-like) Sequential case, with fast memory of size M Lower bound on #words moved to/from slow memory =  (n3 / M1/2 ) [Hong, Kung, 81] Attained using blocked or cache-oblivious algorithms

Слайд 20





New lower bound for all “direct” linear algebra
Holds for
BLAS, LU, QR, eig, SVD, tensor contractions, …
Some whole programs (sequences of  these operations, no matter how they are interleaved, eg computing Ak)
Dense and sparse matrices (where #flops  <<  n3 )
Sequential and parallel algorithms
Some graph-theoretic algorithms (eg Floyd-Warshall)
Описание слайда:
New lower bound for all “direct” linear algebra Holds for BLAS, LU, QR, eig, SVD, tensor contractions, … Some whole programs (sequences of these operations, no matter how they are interleaved, eg computing Ak) Dense and sparse matrices (where #flops << n3 ) Sequential and parallel algorithms Some graph-theoretic algorithms (eg Floyd-Warshall)

Слайд 21





Can we attain these lower bounds?
Do conventional dense algorithms as implemented in  LAPACK and ScaLAPACK attain these bounds?
Mostly not 
If not, are there other algorithms that do?
Yes
Goals for algorithms:
Minimize #words =  (#flops/ M1/2 ) 
Minimize #messages =  (#flops/ M3/2 )
Need new data structures
Minimize for multiple memory hierarchy levels
Cache-oblivious algorithms would be simplest
Fewest flops when matrix fits in fastest memory
Cache-oblivious algorithms don’t  always attain  this
Attainable for nearly all dense linear algebra
Just a few prototype implementations so far (class projects!)
Only a few sparse algorithms so far (eg Cholesky)
Описание слайда:
Can we attain these lower bounds? Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds? Mostly not If not, are there other algorithms that do? Yes Goals for algorithms: Minimize #words =  (#flops/ M1/2 ) Minimize #messages =  (#flops/ M3/2 ) Need new data structures Minimize for multiple memory hierarchy levels Cache-oblivious algorithms would be simplest Fewest flops when matrix fits in fastest memory Cache-oblivious algorithms don’t always attain this Attainable for nearly all dense linear algebra Just a few prototype implementations so far (class projects!) Only a few sparse algorithms so far (eg Cholesky)

Слайд 22





A brief future look at (Dense) Linear Algebra software (7/7)
PLASMA and MAGMA (now)
Planned extensions to Multicore/GPU/Heterogeneous
Can one software infrastructure accommodate all    algorithms and platforms of current (future) interest?
How much code generation and tuning can we automate?
Details later (Class projects!)
Other related projects
BLAST Forum (www.netlib.org/blas/blast-forum)
Attempt to extend BLAS to other languages, add some new functions, sparse matrices, extra-precision, interval arithmetic
Only partly successful (extra-precise BLAS used in latest LAPACK)
FLAME (www.cs.utexas.edu/users/flame/)
Formal Linear Algebra Method Environment
Attempt to automate code generation across multiple platforms
Описание слайда:
A brief future look at (Dense) Linear Algebra software (7/7) PLASMA and MAGMA (now) Planned extensions to Multicore/GPU/Heterogeneous Can one software infrastructure accommodate all algorithms and platforms of current (future) interest? How much code generation and tuning can we automate? Details later (Class projects!) Other related projects BLAST Forum (www.netlib.org/blas/blast-forum) Attempt to extend BLAS to other languages, add some new functions, sparse matrices, extra-precision, interval arithmetic Only partly successful (extra-precise BLAS used in latest LAPACK) FLAME (www.cs.utexas.edu/users/flame/) Formal Linear Algebra Method Environment Attempt to automate code generation across multiple platforms

Слайд 23





Outline
History and motivation
Structure of the Dense Linear Algebra motif
Parallel Matrix-matrix multiplication
Parallel Gaussian Elimination (next time)
Описание слайда:
Outline History and motivation Structure of the Dense Linear Algebra motif Parallel Matrix-matrix multiplication Parallel Gaussian Elimination (next time)

Слайд 24





What could go into the linear algebra motif(s)?
Описание слайда:
What could go into the linear algebra motif(s)?

Слайд 25





For all linear algebra problems:
Ex: LAPACK Table of Contents
Linear Systems
Least Squares
Overdetermined, underdetermined
Unconstrained, constrained, weighted
Eigenvalues and vectors of Symmetric Matrices
Standard (Ax = λx), Generalized (Ax=λBx)
Eigenvalues and vectors of Unsymmetric matrices
Eigenvalues, Schur form, eigenvectors, invariant subspaces
Standard, Generalized
Singular Values and vectors (SVD)
Standard, Generalized
Level of detail
Simple Driver
Expert Drivers with error bounds,  extra-precision, other options
Lower level routines (“apply certain kind of orthogonal transformation”)
Описание слайда:
For all linear algebra problems: Ex: LAPACK Table of Contents Linear Systems Least Squares Overdetermined, underdetermined Unconstrained, constrained, weighted Eigenvalues and vectors of Symmetric Matrices Standard (Ax = λx), Generalized (Ax=λBx) Eigenvalues and vectors of Unsymmetric matrices Eigenvalues, Schur form, eigenvectors, invariant subspaces Standard, Generalized Singular Values and vectors (SVD) Standard, Generalized Level of detail Simple Driver Expert Drivers with error bounds, extra-precision, other options Lower level routines (“apply certain kind of orthogonal transformation”)

Слайд 26





For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general 
GG – general , pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP –  positive definite, packed
PT –  positive definite, tridiagonal
Описание слайда:
For all matrix/problem structures: Ex: LAPACK Table of Contents BD – bidiagonal GB – general banded GE – general GG – general , pair GT – tridiagonal HB – Hermitian banded HE – Hermitian HG – upper Hessenberg, pair HP – Hermitian, packed HS – upper Hessenberg OR – (real) orthogonal OP – (real) orthogonal, packed PB – positive definite, banded PO – positive definite PP – positive definite, packed PT – positive definite, tridiagonal

Слайд 27





For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general 
GG – general , pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP –  positive definite, packed
PT –  positive definite, tridiagonal
Описание слайда:
For all matrix/problem structures: Ex: LAPACK Table of Contents BD – bidiagonal GB – general banded GE – general GG – general , pair GT – tridiagonal HB – Hermitian banded HE – Hermitian HG – upper Hessenberg, pair HP – Hermitian, packed HS – upper Hessenberg OR – (real) orthogonal OP – (real) orthogonal, packed PB – positive definite, banded PO – positive definite PP – positive definite, packed PT – positive definite, tridiagonal

Слайд 28





For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general 
GG – general, pair 
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP –  positive definite, packed
PT –  positive definite, tridiagonal
Описание слайда:
For all matrix/problem structures: Ex: LAPACK Table of Contents BD – bidiagonal GB – general banded GE – general GG – general, pair GT – tridiagonal HB – Hermitian banded HE – Hermitian HG – upper Hessenberg, pair HP – Hermitian, packed HS – upper Hessenberg OR – (real) orthogonal OP – (real) orthogonal, packed PB – positive definite, banded PO – positive definite PP – positive definite, packed PT – positive definite, tridiagonal

Слайд 29





For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general 
GG – general, pair 
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP –  positive definite, packed
PT –  positive definite, tridiagonal
Описание слайда:
For all matrix/problem structures: Ex: LAPACK Table of Contents BD – bidiagonal GB – general banded GE – general GG – general, pair GT – tridiagonal HB – Hermitian banded HE – Hermitian HG – upper Hessenberg, pair HP – Hermitian, packed HS – upper Hessenberg OR – (real) orthogonal OP – (real) orthogonal, packed PB – positive definite, banded PO – positive definite PP – positive definite, packed PT – positive definite, tridiagonal

Слайд 30





For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general 
GG – general, pair 
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP –  positive definite, packed
PT –  positive definite, tridiagonal
Описание слайда:
For all matrix/problem structures: Ex: LAPACK Table of Contents BD – bidiagonal GB – general banded GE – general GG – general, pair GT – tridiagonal HB – Hermitian banded HE – Hermitian HG – upper Hessenberg, pair HP – Hermitian, packed HS – upper Hessenberg OR – (real) orthogonal OP – (real) orthogonal, packed PB – positive definite, banded PO – positive definite PP – positive definite, packed PT – positive definite, tridiagonal

Слайд 31





For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general 
GG – general, pair 
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP –  positive definite, packed
PT –  positive definite, tridiagonal
Описание слайда:
For all matrix/problem structures: Ex: LAPACK Table of Contents BD – bidiagonal GB – general banded GE – general GG – general, pair GT – tridiagonal HB – Hermitian banded HE – Hermitian HG – upper Hessenberg, pair HP – Hermitian, packed HS – upper Hessenberg OR – (real) orthogonal OP – (real) orthogonal, packed PB – positive definite, banded PO – positive definite PP – positive definite, packed PT – positive definite, tridiagonal

Слайд 32





For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general 
GG – general , pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP –  positive definite, packed
PT –  positive definite, tridiagonal
Описание слайда:
For all matrix/problem structures: Ex: LAPACK Table of Contents BD – bidiagonal GB – general banded GE – general GG – general , pair GT – tridiagonal HB – Hermitian banded HE – Hermitian HG – upper Hessenberg, pair HP – Hermitian, packed HS – upper Hessenberg OR – (real) orthogonal OP – (real) orthogonal, packed PB – positive definite, banded PO – positive definite PP – positive definite, packed PT – positive definite, tridiagonal

Слайд 33





For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general 
GG – general, pair 
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP –  positive definite, packed
PT –  positive definite, tridiagonal
Описание слайда:
For all matrix/problem structures: Ex: LAPACK Table of Contents BD – bidiagonal GB – general banded GE – general GG – general, pair GT – tridiagonal HB – Hermitian banded HE – Hermitian HG – upper Hessenberg, pair HP – Hermitian, packed HS – upper Hessenberg OR – (real) orthogonal OP – (real) orthogonal, packed PB – positive definite, banded PO – positive definite PP – positive definite, packed PT – positive definite, tridiagonal

Слайд 34





Organizing Linear Algebra – in books
Описание слайда:
Organizing Linear Algebra – in books

Слайд 35





Outline
History and motivation
Structure of the Dense Linear Algebra motif
Parallel Matrix-matrix multiplication
Parallel Gaussian Elimination (next time)
Описание слайда:
Outline History and motivation Structure of the Dense Linear Algebra motif Parallel Matrix-matrix multiplication Parallel Gaussian Elimination (next time)

Слайд 36





Different Parallel Data Layouts for Matrices (not all!)
Описание слайда:
Different Parallel Data Layouts for Matrices (not all!)

Слайд 37





Parallel Matrix-Vector Product
Compute y = y + A*x, where A is a dense  matrix
Layout: 
1D row blocked
A(i) refers to the n by n/p block row                                                    that processor i owns, 
x(i) and y(i) similarly refer to                                                         segments of x,y owned by i
Algorithm:
Foreach processor i
   Broadcast x(i)
   Compute y(i) = A(i)*x
Algorithm uses the formula
y(i) = y(i) + A(i)*x = y(i) + j A(i,j)*x(j)
Описание слайда:
Parallel Matrix-Vector Product Compute y = y + A*x, where A is a dense matrix Layout: 1D row blocked A(i) refers to the n by n/p block row that processor i owns, x(i) and y(i) similarly refer to segments of x,y owned by i Algorithm: Foreach processor i Broadcast x(i) Compute y(i) = A(i)*x Algorithm uses the formula y(i) = y(i) + A(i)*x = y(i) + j A(i,j)*x(j)

Слайд 38





Matrix-Vector Product y = y + A*x
A column layout of the matrix eliminates the broadcast of x
But adds a reduction to update the destination y
A 2D blocked layout uses a broadcast and reduction, both on a subset of processors
sqrt(p) for square processor grid
Описание слайда:
Matrix-Vector Product y = y + A*x A column layout of the matrix eliminates the broadcast of x But adds a reduction to update the destination y A 2D blocked layout uses a broadcast and reduction, both on a subset of processors sqrt(p) for square processor grid

Слайд 39





Parallel Matrix Multiply
Computing C=C+A*B
Using basic algorithm: 2*n3 Flops
Variables are:
Data layout
Topology of machine 
Scheduling communication
Use of performance models for algorithm design
Message Time = “latency” + #words * time-per-word
                   =  + n*
Efficiency (in any model):
serial time / (p *  parallel time)
perfect (linear) speedup  efficiency = 1
Описание слайда:
Parallel Matrix Multiply Computing C=C+A*B Using basic algorithm: 2*n3 Flops Variables are: Data layout Topology of machine Scheduling communication Use of performance models for algorithm design Message Time = “latency” + #words * time-per-word =  + n* Efficiency (in any model): serial time / (p * parallel time) perfect (linear) speedup  efficiency = 1

Слайд 40





Matrix Multiply with 1D Column Layout
Assume matrices are n x n and n is divisible by p
A(i) refers to the n by n/p block column that processor i owns (similiarly for B(i) and C(i))
B(i,j) is the n/p by n/p sublock of B(i) 
in rows j*n/p through (j+1)*n/p - 1
Algorithm uses the formula
C(i) = C(i) + A*B(i) = C(i) + j A(j)*B(j,i)
Описание слайда:
Matrix Multiply with 1D Column Layout Assume matrices are n x n and n is divisible by p A(i) refers to the n by n/p block column that processor i owns (similiarly for B(i) and C(i)) B(i,j) is the n/p by n/p sublock of B(i) in rows j*n/p through (j+1)*n/p - 1 Algorithm uses the formula C(i) = C(i) + A*B(i) = C(i) + j A(j)*B(j,i)

Слайд 41





Matrix Multiply: 1D Layout on Bus or Ring
Algorithm uses the formula
C(i) = C(i) + A*B(i) = C(i) + j A(j)*B(j,i)
First consider a bus-connected machine without broadcast:  only one pair of processors can communicate at a time (ethernet)
Second consider a machine with processors on a ring: all processors may communicate with nearest neighbors simultaneously
Описание слайда:
Matrix Multiply: 1D Layout on Bus or Ring Algorithm uses the formula C(i) = C(i) + A*B(i) = C(i) + j A(j)*B(j,i) First consider a bus-connected machine without broadcast: only one pair of processors can communicate at a time (ethernet) Second consider a machine with processors on a ring: all processors may communicate with nearest neighbors simultaneously

Слайд 42





MatMul: 1D layout on Bus without Broadcast
Naïve algorithm:
    C(myproc) = C(myproc) + A(myproc)*B(myproc,myproc)
     for i = 0 to p-1
        for j = 0 to p-1 except i 
            if (myproc == i) send A(i) to processor j
            if (myproc == j) 
                 receive A(i) from processor i
                 C(myproc) = C(myproc) + A(i)*B(i,myproc)
            barrier

Cost of inner loop:
       computation: 2*n*(n/p)2 = 2*n3/p2 
       communication:  + *n2  /p
Описание слайда:
MatMul: 1D layout on Bus without Broadcast Naïve algorithm: C(myproc) = C(myproc) + A(myproc)*B(myproc,myproc) for i = 0 to p-1 for j = 0 to p-1 except i if (myproc == i) send A(i) to processor j if (myproc == j) receive A(i) from processor i C(myproc) = C(myproc) + A(i)*B(i,myproc) barrier Cost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication:  + *n2 /p

Слайд 43





Naïve MatMul (continued)
Cost of inner loop:
       computation: 2*n*(n/p)2 = 2*n3/p2 
       communication:  + *n2 /p        … approximately

Only 1 pair of processors (i and j) are active on any iteration,
  and of those, only i is doing computation
                   => the algorithm is almost entirely serial

Running time: 
         = (p*(p-1) + 1)*computation +  p*(p-1)*communication
          2*n3 + p2* + p*n2*

 This is worse than the serial time and grows with p.
Описание слайда:
Naïve MatMul (continued) Cost of inner loop: computation: 2*n*(n/p)2 = 2*n3/p2 communication:  + *n2 /p … approximately Only 1 pair of processors (i and j) are active on any iteration, and of those, only i is doing computation => the algorithm is almost entirely serial Running time: = (p*(p-1) + 1)*computation + p*(p-1)*communication  2*n3 + p2* + p*n2* This is worse than the serial time and grows with p.

Слайд 44





Matmul for 1D layout on a Processor Ring
Описание слайда:
Matmul for 1D layout on a Processor Ring

Слайд 45





Matmul for 1D layout on a Processor Ring
Time  of inner loop = 2*( + *n2/p) + 2*n*(n/p)2
Total Time  = 2*n* (n/p)2  +  (p-1) * Time of inner loop
                     2*n3/p  + 2*p* + 2**n2
(Nearly) Optimal for 1D layout on Ring or Bus, even with Broadcast:
 Perfect speedup for arithmetic
 A(myproc) must move to each other processor, costs at least
               (p-1)*cost of sending n*(n/p) words    


Parallel Efficiency = 2*n3 / (p * Total Time) 
                                = 1/(1 +  * p2/(2*n3) +  * p/(2*n) )
                                = 1/ (1 + O(p/n))
 Grows to 1 as n/p increases (or  and  shrink)

But far from communication lower bound
Описание слайда:
Matmul for 1D layout on a Processor Ring Time of inner loop = 2*( + *n2/p) + 2*n*(n/p)2 Total Time = 2*n* (n/p)2 + (p-1) * Time of inner loop  2*n3/p + 2*p* + 2**n2 (Nearly) Optimal for 1D layout on Ring or Bus, even with Broadcast: Perfect speedup for arithmetic A(myproc) must move to each other processor, costs at least (p-1)*cost of sending n*(n/p) words Parallel Efficiency = 2*n3 / (p * Total Time) = 1/(1 +  * p2/(2*n3) +  * p/(2*n) ) = 1/ (1 + O(p/n)) Grows to 1 as n/p increases (or  and  shrink) But far from communication lower bound

Слайд 46





MatMul with 2D Layout
Consider processors in 2D grid (physical or logical)
Processors can communicate with 4 nearest neighbors
Broadcast along rows and columns 
Assume p processors form square s x s grid,  s = p1/2
Описание слайда:
MatMul with 2D Layout Consider processors in 2D grid (physical or logical) Processors can communicate with 4 nearest neighbors Broadcast along rows and columns Assume p processors form square s x s grid, s = p1/2

Слайд 47





Cannon’s Algorithm
… C(i,j) = C(i,j) +   A(i,k)*B(k,j)
…  assume s = sqrt(p) is an integer
   forall  i=0 to s-1              …  “skew” A
         left-circular-shift row i of A by i
         … so that A(i,j) overwritten by A(i,(j+i)mod s)
   forall  i=0 to s-1               …  “skew” B
         up-circular-shift column i of B by i
          … so that B(i,j) overwritten by B((i+j)mod s), j)
   for k=0 to s-1        … sequential
          forall i=0 to s-1 and j=0 to s-1    … all processors in parallel
               C(i,j) = C(i,j) + A(i,j)*B(i,j)
               left-circular-shift each row of A by 1
               up-circular-shift each column of B by 1
Описание слайда:
Cannon’s Algorithm … C(i,j) = C(i,j) +  A(i,k)*B(k,j) … assume s = sqrt(p) is an integer forall i=0 to s-1 … “skew” A left-circular-shift row i of A by i … so that A(i,j) overwritten by A(i,(j+i)mod s) forall i=0 to s-1 … “skew” B up-circular-shift column i of B by i … so that B(i,j) overwritten by B((i+j)mod s), j) for k=0 to s-1 … sequential forall i=0 to s-1 and j=0 to s-1 … all processors in parallel C(i,j) = C(i,j) + A(i,j)*B(i,j) left-circular-shift each row of A by 1 up-circular-shift each column of B by 1

Слайд 48





Cannon’s Matrix Multiplication
Описание слайда:
Cannon’s Matrix Multiplication

Слайд 49





Initial Step to Skew Matrices in Cannon
Initial blocked input
After skewing before initial block multiplies
Описание слайда:
Initial Step to Skew Matrices in Cannon Initial blocked input After skewing before initial block multiplies

Слайд 50





Skewing Steps in Cannon
All blocks of A must multiply all like-colored blocks of B
First step
Second
Third
Описание слайда:
Skewing Steps in Cannon All blocks of A must multiply all like-colored blocks of B First step Second Third

Слайд 51





Cost of Cannon’s Algorithm
  forall  i=0 to s-1              …  recall s = sqrt(p)
         left-circular-shift row i of A by i    … cost ≤ s*( + *n2/p)
   forall  i=0 to s-1
         up-circular-shift column i of B by i … cost ≤ s*( + *n2/p)
   for k=0 to s-1
          forall  i=0 to s-1 and j=0 to s-1
               C(i,j) = C(i,j) + A(i,j)*B(i,j)   … cost = 2*(n/s)3 = 2*n3/p3/2
               left-circular-shift each row of A by 1   … cost = + *n2/p
               up-circular-shift each column of B by 1     … cost =  + *n2/p
Описание слайда:
Cost of Cannon’s Algorithm forall i=0 to s-1 … recall s = sqrt(p) left-circular-shift row i of A by i … cost ≤ s*( + *n2/p) forall i=0 to s-1 up-circular-shift column i of B by i … cost ≤ s*( + *n2/p) for k=0 to s-1 forall i=0 to s-1 and j=0 to s-1 C(i,j) = C(i,j) + A(i,j)*B(i,j) … cost = 2*(n/s)3 = 2*n3/p3/2 left-circular-shift each row of A by 1 … cost = + *n2/p up-circular-shift each column of B by 1 … cost =  + *n2/p

Слайд 52





Cannon’s Algorithm is “optimal”
Optimal means
Considering only O(n3) matmul algs (not Strassen)
Considering only O(n2/p) storage per processor
Use communication lower bound #words = (#flops/M1/2)
Sequential: a processor doing n3 flops from matmul with a fast memory of size M must do  (n3 / M1/2 ) references to slow memory
Parallel: a processor doing f =#flops from matmul with a local memory of size M must do  (f / M1/2 ) references to remote memory
Local memory = O(n2/p) , f  n3/p for at least one proc.
So f / M1/2  =  ((n3/p)/ (n2/p)1/2 ) =  ( n2/p1/2 )
#Messages =  ( #words sent / max message size) =       ( (n2/p1/2)/(n2/p)) =  (p1/2)
Описание слайда:
Cannon’s Algorithm is “optimal” Optimal means Considering only O(n3) matmul algs (not Strassen) Considering only O(n2/p) storage per processor Use communication lower bound #words = (#flops/M1/2) Sequential: a processor doing n3 flops from matmul with a fast memory of size M must do  (n3 / M1/2 ) references to slow memory Parallel: a processor doing f =#flops from matmul with a local memory of size M must do  (f / M1/2 ) references to remote memory Local memory = O(n2/p) , f  n3/p for at least one proc. So f / M1/2 =  ((n3/p)/ (n2/p)1/2 ) =  ( n2/p1/2 ) #Messages =  ( #words sent / max message size) =  ( (n2/p1/2)/(n2/p)) =  (p1/2)

Слайд 53





Pros and Cons of Cannon
So what if it’s “optimal”, is it fast?
Local computation one call to (optimized) matrix-multiply
Hard to generalize for
p not a perfect square
A and B not square
Dimensions of A, B not perfectly divisible by s=sqrt(p)
A and B not “aligned” in the way they are stored on processors
block-cyclic layouts
Needs extra memory for copies of local matrices
Описание слайда:
Pros and Cons of Cannon So what if it’s “optimal”, is it fast? Local computation one call to (optimized) matrix-multiply Hard to generalize for p not a perfect square A and B not square Dimensions of A, B not perfectly divisible by s=sqrt(p) A and B not “aligned” in the way they are stored on processors block-cyclic layouts Needs extra memory for copies of local matrices

Слайд 54





SUMMA Algorithm
SUMMA = Scalable Universal Matrix Multiply 
Slightly less efficient, but simpler and easier to generalize
Presentation from van de Geijn and Watts
www.netlib.org/lapack/lawns/lawn96.ps
Similar ideas appeared many times
Used in practice in PBLAS = Parallel BLAS
www.netlib.org/lapack/lawns/lawn100.ps
Описание слайда:
SUMMA Algorithm SUMMA = Scalable Universal Matrix Multiply Slightly less efficient, but simpler and easier to generalize Presentation from van de Geijn and Watts www.netlib.org/lapack/lawns/lawn96.ps Similar ideas appeared many times Used in practice in PBLAS = Parallel BLAS www.netlib.org/lapack/lawns/lawn100.ps

Слайд 55





SUMMA
Описание слайда:
SUMMA

Слайд 56





SUMMA
Описание слайда:
SUMMA

Слайд 57





SUMMA performance
Описание слайда:
SUMMA performance

Слайд 58





SUMMA performance
Описание слайда:
SUMMA performance

Слайд 59


Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №59
Описание слайда:

Слайд 60





Summary of Parallel Matrix Multiplication so far
1D Layout
Bus without broadcast - slower than serial
Nearest neighbor communication on a ring (or bus with broadcast): Efficiency = 1/(1 + O(p/n))
2D Layout – one copy of all matrices (O(n2/p) per processor)
Cannon
Efficiency = 1/(1+O(sqrt(p) /n+* sqrt(p) /n)) – optimal!
Hard to generalize for general p, n, block cyclic, alignment 
SUMMA
Efficiency = 1/(1 + O(log p * p / (b*n2) + log p * sqrt(p) /n))
Very General
b small => less memory, lower efficiency
b large => more memory, high efficiency
Used in practice (PBLAS)
Описание слайда:
Summary of Parallel Matrix Multiplication so far 1D Layout Bus without broadcast - slower than serial Nearest neighbor communication on a ring (or bus with broadcast): Efficiency = 1/(1 + O(p/n)) 2D Layout – one copy of all matrices (O(n2/p) per processor) Cannon Efficiency = 1/(1+O(sqrt(p) /n+* sqrt(p) /n)) – optimal! Hard to generalize for general p, n, block cyclic, alignment SUMMA Efficiency = 1/(1 + O(log p * p / (b*n2) + log p * sqrt(p) /n)) Very General b small => less memory, lower efficiency b large => more memory, high efficiency Used in practice (PBLAS)

Слайд 61





Summary of Parallel Matrix Multiplication so far
1D Layout
Bus without broadcast - slower than serial
Nearest neighbor communication on a ring (or bus with broadcast): Efficiency = 1/(1 + O(p/n))
2D Layout – one copy of all matrices (O(n2/p) per processor)
Cannon
Efficiency = 1/(1+O(sqrt(p) /n+* sqrt(p) /n)) – optimal!
Hard to generalize for general p, n, block cyclic, alignment 
SUMMA
Efficiency = 1/(1 + O(log p * p / (b*n2) + log p * sqrt(p) /n))
Very General
b small => less memory, lower efficiency
b large => more memory, high efficiency
Used in practice (PBLAS)
Описание слайда:
Summary of Parallel Matrix Multiplication so far 1D Layout Bus without broadcast - slower than serial Nearest neighbor communication on a ring (or bus with broadcast): Efficiency = 1/(1 + O(p/n)) 2D Layout – one copy of all matrices (O(n2/p) per processor) Cannon Efficiency = 1/(1+O(sqrt(p) /n+* sqrt(p) /n)) – optimal! Hard to generalize for general p, n, block cyclic, alignment SUMMA Efficiency = 1/(1 + O(log p * p / (b*n2) + log p * sqrt(p) /n)) Very General b small => less memory, lower efficiency b large => more memory, high efficiency Used in practice (PBLAS)

Слайд 62





Beating   #words_moved  =  (n2/P1/2)
Описание слайда:
Beating #words_moved = (n2/P1/2)

Слайд 63





2.5D algorithms – for c copies
Описание слайда:
2.5D algorithms – for c copies

Слайд 64





2.5D matrix multiply
Описание слайда:
2.5D matrix multiply

Слайд 65





2.5D matrix multiply performance
Описание слайда:
2.5D matrix multiply performance

Слайд 66


Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №66
Описание слайда:

Слайд 67





Extra Slides
Описание слайда:
Extra Slides

Слайд 68





Recursive Layouts
For both cache hierarchies and parallelism, recursive layouts may be useful
Z-Morton, U-Morton, and X-Morton Layout
Also Hilbert layout and others
What about the user’s view?
Fortunately, many problems can be solved on a permutation
Never need to actually change the user’s layout
Описание слайда:
Recursive Layouts For both cache hierarchies and parallelism, recursive layouts may be useful Z-Morton, U-Morton, and X-Morton Layout Also Hilbert layout and others What about the user’s view? Fortunately, many problems can be solved on a permutation Never need to actually change the user’s layout

Слайд 69





Gaussian Elimination
Описание слайда:
Gaussian Elimination

Слайд 70





Gaussian Elimination via a Recursive Algorithm
Описание слайда:
Gaussian Elimination via a Recursive Algorithm

Слайд 71





Recursive Factorizations
Just as accurate as conventional method
Same number of operations
Automatic variable blocking
Level 1 and 3 BLAS only !
Extreme clarity and simplicity of expression
Highly efficient
The recursive formulation is just a rearrangement of the point-wise LINPACK algorithm
The standard error analysis applies (assuming the matrix operations are computed the “conventional” way).
Описание слайда:
Recursive Factorizations Just as accurate as conventional method Same number of operations Automatic variable blocking Level 1 and 3 BLAS only ! Extreme clarity and simplicity of expression Highly efficient The recursive formulation is just a rearrangement of the point-wise LINPACK algorithm The standard error analysis applies (assuming the matrix operations are computed the “conventional” way).

Слайд 72






Recursive LU
Описание слайда:
Recursive LU

Слайд 73





Review: BLAS 3 (Blocked) GEPP
Описание слайда:
Review: BLAS 3 (Blocked) GEPP

Слайд 74


Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №74
Описание слайда:

Слайд 75


Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №75
Описание слайда:

Слайд 76


Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №76
Описание слайда:

Слайд 77


Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №77
Описание слайда:

Слайд 78


Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №78
Описание слайда:

Слайд 79


Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №79
Описание слайда:

Слайд 80


Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №80
Описание слайда:

Слайд 81


Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №81
Описание слайда:

Слайд 82


Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №82
Описание слайда:

Слайд 83


Dense Linear Algebra: History and Structure, Parallel Matrix Multiplication, слайд №83
Описание слайда:

Слайд 84





A small software project ...
Описание слайда:
A small software project ...

Слайд 85





Work-Depth Model of Parallelism
The work depth model:
The simplest model is used
For algorithm design, independent of a machine
The work, W, is the total number of operations
The depth, D, is the longest chain of dependencies
The parallelism, P, is defined as W/D
Specific examples include:
circuit model, each input defines a graph with ops at nodes
vector model, each step is an operation on a vector of elements
language model, where set of operations defined by language
Описание слайда:
Work-Depth Model of Parallelism The work depth model: The simplest model is used For algorithm design, independent of a machine The work, W, is the total number of operations The depth, D, is the longest chain of dependencies The parallelism, P, is defined as W/D Specific examples include: circuit model, each input defines a graph with ops at nodes vector model, each step is an operation on a vector of elements language model, where set of operations defined by language

Слайд 86





Latency Bandwidth Model
Network of fixed number P of processors
fully connected
each with local memory
Latency ()
accounts for varying performance with number of messages
gap (g) in logP model may be more accurate cost if messages are pipelined
Inverse bandwidth ()
accounts for performance varying with volume of data
Efficiency (in any model):
serial time / (p *  parallel time)
perfect (linear) speedup  efficiency = 1
Описание слайда:
Latency Bandwidth Model Network of fixed number P of processors fully connected each with local memory Latency () accounts for varying performance with number of messages gap (g) in logP model may be more accurate cost if messages are pipelined Inverse bandwidth () accounts for performance varying with volume of data Efficiency (in any model): serial time / (p * parallel time) perfect (linear) speedup  efficiency = 1

Слайд 87





Initial Step to Skew Matrices in Cannon
Initial blocked input
After skewing before initial block multiplies
Описание слайда:
Initial Step to Skew Matrices in Cannon Initial blocked input After skewing before initial block multiplies

Слайд 88





Skewing Steps in Cannon
First step
Second
Third
Описание слайда:
Skewing Steps in Cannon First step Second Third

Слайд 89





Motivation  (1)
3 Basic Linear Algebra Problems
Linear Equations: Solve Ax=b for x
Least Squares: Find x that minimizes ||r||2   ri2 where r=Ax-b
Statistics: Fitting data with simple functions
3a. Eigenvalues: Find  and x where Ax =  x
Vibration analysis, e.g., earthquakes, circuits
3b. Singular Value Decomposition: ATAx=2x
Data fitting, Information retrieval
Lots of variations depending on structure of A
A symmetric, positive definite, banded, …
Описание слайда:
Motivation (1) 3 Basic Linear Algebra Problems Linear Equations: Solve Ax=b for x Least Squares: Find x that minimizes ||r||2   ri2 where r=Ax-b Statistics: Fitting data with simple functions 3a. Eigenvalues: Find  and x where Ax =  x Vibration analysis, e.g., earthquakes, circuits 3b. Singular Value Decomposition: ATAx=2x Data fitting, Information retrieval Lots of variations depending on structure of A A symmetric, positive definite, banded, …

Слайд 90





Motivation (2) 
Why dense A, as opposed to sparse A?
Many large matrices are sparse, but …
Dense algorithms easier to understand 
Some applications yields large dense matrices
LINPACK Benchmark (www.top500.org)
“How fast is your computer?” =                                            “How fast can you solve dense Ax=b?”
Large sparse matrix algorithms often yield smaller (but still large) dense problems
Do ParLab Apps most use small dense matrices?
Описание слайда:
Motivation (2) Why dense A, as opposed to sparse A? Many large matrices are sparse, but … Dense algorithms easier to understand Some applications yields large dense matrices LINPACK Benchmark (www.top500.org) “How fast is your computer?” = “How fast can you solve dense Ax=b?” Large sparse matrix algorithms often yield smaller (but still large) dense problems Do ParLab Apps most use small dense matrices?

Слайд 91





Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars)
Algorithm	Serial		PRAM		Memory	    	#Procs
Dense LU	N3		N		N2		N2
Band LU	N2  (N7/3)		N		N3/2  (N5/3)	N (N4/3)
Jacobi		N2 (N5/3) 		N (N2/3) 		N		N
Explicit Inv.	N2 		log N		N2		N2
Conj.Gradients 	N3/2 (N4/3) 	N1/2(1/3) *log N	N		N
Red/Black SOR N3/2 (N4/3) 	N1/2 (N1/3) 	N		N
Sparse LU	N3/2 (N2) 		N1/2 		N*log N	(N4/3) 	N
FFT		N*log N		log N		N		N
Multigrid	N		log2 N		N		N
Lower bound	N		log N		N
PRAM is an idealized parallel model with zero cost communication
Reference:  James Demmel, Applied Numerical Linear Algebra, SIAM, 1997
(Note: corrected complexities for 3D case from last lecture!).
Описание слайда:
Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars) Algorithm Serial PRAM Memory #Procs Dense LU N3 N N2 N2 Band LU N2 (N7/3) N N3/2 (N5/3) N (N4/3) Jacobi N2 (N5/3) N (N2/3) N N Explicit Inv. N2 log N N2 N2 Conj.Gradients N3/2 (N4/3) N1/2(1/3) *log N N N Red/Black SOR N3/2 (N4/3) N1/2 (N1/3) N N Sparse LU N3/2 (N2) N1/2 N*log N (N4/3) N FFT N*log N log N N N Multigrid N log2 N N N Lower bound N log N N PRAM is an idealized parallel model with zero cost communication Reference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997 (Note: corrected complexities for 3D case from last lecture!).

Слайд 92





Lessons and Questions (1)
Structure of the problem matters
Cost of solution can vary dramatically (n3 to n)
Many other examples
Some structure can be figured out automatically
“A\b” can figure out symmetry, some sparsity
Some structures known only to (smart) user
If performance not critical, user may be happy to settle for A\b
How much of this goes into the motifs?
How much should we try to help user choose?
Tuning, but at algorithmic choice level  (SALSA)
Motifs overlap
Dense, sparse, (un)structured grids, spectral
Описание слайда:
Lessons and Questions (1) Structure of the problem matters Cost of solution can vary dramatically (n3 to n) Many other examples Some structure can be figured out automatically “A\b” can figure out symmetry, some sparsity Some structures known only to (smart) user If performance not critical, user may be happy to settle for A\b How much of this goes into the motifs? How much should we try to help user choose? Tuning, but at algorithmic choice level (SALSA) Motifs overlap Dense, sparse, (un)structured grids, spectral

Слайд 93





Organizing Linear Algebra (1)
By Operations
Low level (eg  mat-mul:  BLAS)
Standard level (eg solve Ax=b, Ax=λx:  Sca/LAPACK)
Applications level (eg  systems  & control:  SLICOT)
By Performance/accuracy tradeoffs
“Direct methods” with guarantees vs “iterative methods” that may work faster and accurately enough
By Structure
Storage
Dense 
columnwise, rowwise, 2D block cyclic, recursive space-filling curves
Banded, sparse (many flavors), black-box, …
Mathematical
Symmetries, positive definiteness, conditioning, …
As diverse as the world being modeled
Описание слайда:
Organizing Linear Algebra (1) By Operations Low level (eg mat-mul: BLAS) Standard level (eg solve Ax=b, Ax=λx: Sca/LAPACK) Applications level (eg systems & control: SLICOT) By Performance/accuracy tradeoffs “Direct methods” with guarantees vs “iterative methods” that may work faster and accurately enough By Structure Storage Dense columnwise, rowwise, 2D block cyclic, recursive space-filling curves Banded, sparse (many flavors), black-box, … Mathematical Symmetries, positive definiteness, conditioning, … As diverse as the world being modeled

Слайд 94





Organizing Linear Algebra (2)
By Data Type
Real vs Complex
Floating point (fixed or varying length), other
By Target Platform
Serial, manycore, GPU, distributed memory,  out-of-DRAM, Grid, …
By programming interface
Language bindings
“A\b” versus access to details
Описание слайда:
Organizing Linear Algebra (2) By Data Type Real vs Complex Floating point (fixed or varying length), other By Target Platform Serial, manycore, GPU, distributed memory, out-of-DRAM, Grid, … By programming interface Language bindings “A\b” versus access to details

Слайд 95





For all linear algebra problems:
Ex: LAPACK Table of Contents
Linear Systems
Least Squares
Overdetermined, underdetermined
Unconstrained, constrained, weighted
Eigenvalues and vectors of Symmetric Matrices
Standard (Ax = λx), Generallzed (Ax=λxB)
Eigenvalues and vectors of Unsymmetric matrices
Eigenvalues, Schur form, eigenvectors, invariant subspaces
Standard, Generalized
Singular Values and vectors (SVD)
Standard, Generalized
Level of detail
Simple Driver
Expert Drivers with error bounds,  extra-precision, other options
Lower level routines (“apply certain kind of orthogonal transformation”)
Описание слайда:
For all linear algebra problems: Ex: LAPACK Table of Contents Linear Systems Least Squares Overdetermined, underdetermined Unconstrained, constrained, weighted Eigenvalues and vectors of Symmetric Matrices Standard (Ax = λx), Generallzed (Ax=λxB) Eigenvalues and vectors of Unsymmetric matrices Eigenvalues, Schur form, eigenvectors, invariant subspaces Standard, Generalized Singular Values and vectors (SVD) Standard, Generalized Level of detail Simple Driver Expert Drivers with error bounds, extra-precision, other options Lower level routines (“apply certain kind of orthogonal transformation”)

Слайд 96





For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general 
GG – general , pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP –  positive definite, packed
PT –  positive definite, tridiagonal
Описание слайда:
For all matrix/problem structures: Ex: LAPACK Table of Contents BD – bidiagonal GB – general banded GE – general GG – general , pair GT – tridiagonal HB – Hermitian banded HE – Hermitian HG – upper Hessenberg, pair HP – Hermitian, packed HS – upper Hessenberg OR – (real) orthogonal OP – (real) orthogonal, packed PB – positive definite, banded PO – positive definite PP – positive definite, packed PT – positive definite, tridiagonal

Слайд 97





For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general 
GG – general , pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP –  positive definite, packed
PT –  positive definite, tridiagonal
Описание слайда:
For all matrix/problem structures: Ex: LAPACK Table of Contents BD – bidiagonal GB – general banded GE – general GG – general , pair GT – tridiagonal HB – Hermitian banded HE – Hermitian HG – upper Hessenberg, pair HP – Hermitian, packed HS – upper Hessenberg OR – (real) orthogonal OP – (real) orthogonal, packed PB – positive definite, banded PO – positive definite PP – positive definite, packed PT – positive definite, tridiagonal

Слайд 98





For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general 
GG – general, pair 
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP –  positive definite, packed
PT –  positive definite, tridiagonal
Описание слайда:
For all matrix/problem structures: Ex: LAPACK Table of Contents BD – bidiagonal GB – general banded GE – general GG – general, pair GT – tridiagonal HB – Hermitian banded HE – Hermitian HG – upper Hessenberg, pair HP – Hermitian, packed HS – upper Hessenberg OR – (real) orthogonal OP – (real) orthogonal, packed PB – positive definite, banded PO – positive definite PP – positive definite, packed PT – positive definite, tridiagonal

Слайд 99





For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general 
GG – general, pair 
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP –  positive definite, packed
PT –  positive definite, tridiagonal
Описание слайда:
For all matrix/problem structures: Ex: LAPACK Table of Contents BD – bidiagonal GB – general banded GE – general GG – general, pair GT – tridiagonal HB – Hermitian banded HE – Hermitian HG – upper Hessenberg, pair HP – Hermitian, packed HS – upper Hessenberg OR – (real) orthogonal OP – (real) orthogonal, packed PB – positive definite, banded PO – positive definite PP – positive definite, packed PT – positive definite, tridiagonal

Слайд 100





For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general 
GG – general, pair 
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP –  positive definite, packed
PT –  positive definite, tridiagonal
Описание слайда:
For all matrix/problem structures: Ex: LAPACK Table of Contents BD – bidiagonal GB – general banded GE – general GG – general, pair GT – tridiagonal HB – Hermitian banded HE – Hermitian HG – upper Hessenberg, pair HP – Hermitian, packed HS – upper Hessenberg OR – (real) orthogonal OP – (real) orthogonal, packed PB – positive definite, banded PO – positive definite PP – positive definite, packed PT – positive definite, tridiagonal

Слайд 101





For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general 
GG – general , pair
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP –  positive definite, packed
PT –  positive definite, tridiagonal
Описание слайда:
For all matrix/problem structures: Ex: LAPACK Table of Contents BD – bidiagonal GB – general banded GE – general GG – general , pair GT – tridiagonal HB – Hermitian banded HE – Hermitian HG – upper Hessenberg, pair HP – Hermitian, packed HS – upper Hessenberg OR – (real) orthogonal OP – (real) orthogonal, packed PB – positive definite, banded PO – positive definite PP – positive definite, packed PT – positive definite, tridiagonal

Слайд 102





For all matrix/problem structures:
Ex: LAPACK Table of Contents
BD – bidiagonal
GB – general banded
GE – general 
GG – general, pair 
GT – tridiagonal
HB – Hermitian banded
HE – Hermitian
HG – upper Hessenberg, pair
HP – Hermitian, packed
HS – upper Hessenberg
OR – (real) orthogonal
OP – (real) orthogonal, packed
PB – positive definite, banded
PO – positive definite
PP –  positive definite, packed
PT –  positive definite, tridiagonal
Описание слайда:
For all matrix/problem structures: Ex: LAPACK Table of Contents BD – bidiagonal GB – general banded GE – general GG – general, pair GT – tridiagonal HB – Hermitian banded HE – Hermitian HG – upper Hessenberg, pair HP – Hermitian, packed HS – upper Hessenberg OR – (real) orthogonal OP – (real) orthogonal, packed PB – positive definite, banded PO – positive definite PP – positive definite, packed PT – positive definite, tridiagonal

Слайд 103





For all data types:
Ex: LAPACK Table of Contents
Real and complex
Single and double precision
Arbitrary precision in progress
Описание слайда:
For all data types: Ex: LAPACK Table of Contents Real and complex Single and double precision Arbitrary precision in progress

Слайд 104





Organizing Linear Algebra (3)
Описание слайда:
Organizing Linear Algebra (3)

Слайд 105





Review of the BLAS
Описание слайда:
Review of the BLAS



Похожие презентации
Mypresentation.ru
Загрузить презентацию