🗊Презентация Adaptive libraries for multicore architectures with explicitly-managed memory hierarchies

Категория: Технология
Нажмите для полного просмотра!
Adaptive libraries for multicore architectures with explicitly-managed memory hierarchies, слайд №1Adaptive libraries for multicore architectures with explicitly-managed memory hierarchies, слайд №2Adaptive libraries for multicore architectures with explicitly-managed memory hierarchies, слайд №3Adaptive libraries for multicore architectures with explicitly-managed memory hierarchies, слайд №4Adaptive libraries for multicore architectures with explicitly-managed memory hierarchies, слайд №5Adaptive libraries for multicore architectures with explicitly-managed memory hierarchies, слайд №6Adaptive libraries for multicore architectures with explicitly-managed memory hierarchies, слайд №7Adaptive libraries for multicore architectures with explicitly-managed memory hierarchies, слайд №8Adaptive libraries for multicore architectures with explicitly-managed memory hierarchies, слайд №9Adaptive libraries for multicore architectures with explicitly-managed memory hierarchies, слайд №10

Вы можете ознакомиться и скачать презентацию на тему Adaptive libraries for multicore architectures with explicitly-managed memory hierarchies. Доклад-сообщение содержит 10 слайдов. Презентации для любого класса можно скачать бесплатно. Если материал и наш сайт презентаций Mypresentation Вам понравились – поделитесь им с друзьями с помощью социальных кнопок и добавьте в закладки в своем браузере.

Слайды и текст этой презентации


Слайд 1





Adaptive libraries 
for multicore architectures 
with explicitly-managed 
memory hierarchies
Konstantin Nedovodeev,
Research Engineer at
the Institute for High-Performance Computer and Network Technologies
Описание слайда:
Adaptive libraries for multicore architectures with explicitly-managed memory hierarchies Konstantin Nedovodeev, Research Engineer at the Institute for High-Performance Computer and Network Technologies

Слайд 2





Key architectural features
Key architectural features
Embedded MPSoC’s with an explicitly-managed memory hierarhy (EMMA) posess:
- three different types of cores, namely: 
	- control core(s); 
	- “number-crunching” cores; 
	- transfer engines (TE).
- each computational core has its “private” small sized local store (LS);
- there is a big main storage (RAM);
- all inter-memory transfers ought to be managed by TEs (hence “explicitly-managed memory” term).
Examples of such MPSoCs: 
TI OMAP, TI DaVinci, IBM Cell, Atmel Diopsis, Broadcom mediaDSP, Elvees “Multicore” (Russia)
Описание слайда:
Key architectural features Key architectural features Embedded MPSoC’s with an explicitly-managed memory hierarhy (EMMA) posess: - three different types of cores, namely: - control core(s); - “number-crunching” cores; - transfer engines (TE). - each computational core has its “private” small sized local store (LS); - there is a big main storage (RAM); - all inter-memory transfers ought to be managed by TEs (hence “explicitly-managed memory” term). Examples of such MPSoCs: TI OMAP, TI DaVinci, IBM Cell, Atmel Diopsis, Broadcom mediaDSP, Elvees “Multicore” (Russia)

Слайд 3





Programming issues
Programming issues
- workload distribution among computational cores;
 information transfers distribution among different channels:
- trying to reuse data in the local store (locality-awareness);
- trying to use LS <-> LS (bypassing) as much as possible;
- using multi-buffering to hide memory latency;
- local memory allocation without fragmentation;
- managing synchronization of parallel processes;
 avoiding WaW, WaR dependencies by allocating temporary store 
  in common memory (results renaming).
Описание слайда:
Programming issues Programming issues - workload distribution among computational cores; information transfers distribution among different channels: - trying to reuse data in the local store (locality-awareness); - trying to use LS <-> LS (bypassing) as much as possible; - using multi-buffering to hide memory latency; - local memory allocation without fragmentation; - managing synchronization of parallel processes; avoiding WaW, WaR dependencies by allocating temporary store in common memory (results renaming).

Слайд 4





Tiled algorithms
Tiled algorithms
We concentrate on a high-performance tiled algorithm construction.
Such algorithms are used in the BLAS library, which the LAPACK library is based on.
An example of a task for tiled algorithm construction is the matrix-vector product y’ =  A x +  y (BLAS):
 foreach i in (0..N’)
   yi =  Ai0 x0 +  yi
   foreach j in (1..N’)
     yi =  Aij xj + yi
where A  RNxN, x,y  RN, ,   R, NB – blocking factor, N’ = ceil(N / NB).
Описание слайда:
Tiled algorithms Tiled algorithms We concentrate on a high-performance tiled algorithm construction. Such algorithms are used in the BLAS library, which the LAPACK library is based on. An example of a task for tiled algorithm construction is the matrix-vector product y’ =  A x +  y (BLAS): foreach i in (0..N’) yi =  Ai0 x0 +  yi foreach j in (1..N’) yi =  Aij xj + yi where A  RNxN, x,y  RN, ,   R, NB – blocking factor, N’ = ceil(N / NB).

Слайд 5





Program as a coarse-grained dataflow graph
Program as a coarse-grained dataflow graph
Each program could be represented as a macro-flow graph.
Bigger nodes represent microkernel calls performed by the computational cores, while smaller ones represent tile transfers between the main storage and LS.
Описание слайда:
Program as a coarse-grained dataflow graph Program as a coarse-grained dataflow graph Each program could be represented as a macro-flow graph. Bigger nodes represent microkernel calls performed by the computational cores, while smaller ones represent tile transfers between the main storage and LS.

Слайд 6





Existing toolchains
Existing toolchains
Existing toolchains (Cilk, StarS) make scheduling decisions at runtime. The runtime manages tasks (tile processing), distribute the workload, try to do its best in memory reuse, etc.
While being flexible, it lacks unification for EMMA platforms and leads to a significant penalty for small to medium-sized problems.
Описание слайда:
Existing toolchains Existing toolchains Existing toolchains (Cilk, StarS) make scheduling decisions at runtime. The runtime manages tasks (tile processing), distribute the workload, try to do its best in memory reuse, etc. While being flexible, it lacks unification for EMMA platforms and leads to a significant penalty for small to medium-sized problems.

Слайд 7





Proposed toolchain
Proposed toolchain
Our approach moves “decision-making” to compile-time, reducing the overhead level. It becomes possible, because, the computational process does not depend on particular data values.
Standalone graph-generating scripts and processor model description make the library both portable and extensible.
Описание слайда:
Proposed toolchain Proposed toolchain Our approach moves “decision-making” to compile-time, reducing the overhead level. It becomes possible, because, the computational process does not depend on particular data values. Standalone graph-generating scripts and processor model description make the library both portable and extensible.

Слайд 8





How does it feel?
How does it feel?
User: 
1. Wants to generate a parallel program.
Runs single command, e.g.: 
sampl_make_src.bat strsv 70 35 mc0226 2
and gets the source files.
Support engineer:
1. Wants to port the library.
Writes a new version of the runtime-library (200-300 LOC).
2. Wants to add a new program.
Writes the Ruby script (400-500 boilerplate LOC) (DSL?).
Writes microkernels for computational cores (~100 ASM LOC per mk).
Описание слайда:
How does it feel? How does it feel? User: 1. Wants to generate a parallel program. Runs single command, e.g.: sampl_make_src.bat strsv 70 35 mc0226 2 and gets the source files. Support engineer: 1. Wants to port the library. Writes a new version of the runtime-library (200-300 LOC). 2. Wants to add a new program. Writes the Ruby script (400-500 boilerplate LOC) (DSL?). Writes microkernels for computational cores (~100 ASM LOC per mk).

Слайд 9





How fast is it?
How fast is it?
Scales almost linearly up to 16 cores of a synthetic multicore processor (Matrix size = N · NB).
SGEMM – matrix multiplication, STRSM – triangular solve with multiple right-hand sides.
Описание слайда:
How fast is it? How fast is it? Scales almost linearly up to 16 cores of a synthetic multicore processor (Matrix size = N · NB). SGEMM – matrix multiplication, STRSM – triangular solve with multiple right-hand sides.

Слайд 10





How fast is it (continued)?
How fast is it (continued)?
The heatmap of an STRSM program schedule (cc = 8).
< 20% of the time the control core makes TE tasks (overlaps with computations), 
< 2% it spends for synchronization. Computational cores work almost all the time!
Описание слайда:
How fast is it (continued)? How fast is it (continued)? The heatmap of an STRSM program schedule (cc = 8). < 20% of the time the control core makes TE tasks (overlaps with computations), < 2% it spends for synchronization. Computational cores work almost all the time!



Похожие презентации
Mypresentation.ru
Загрузить презентацию