NAMD-BluegeneL - презентация, доклад, проект скачать

Нажмите для полного просмотра!

Содержание ▲

Вы можете ознакомиться и скачать презентацию на тему NAMD-BluegeneL. Доклад-сообщение содержит 48 слайдов. Презентации для любого класса можно скачать бесплатно. Если материал и наш сайт презентаций Mypresentation Вам понравились – поделитесь им с друзьями с помощью социальных кнопок и добавьте в закладки в своем браузере.

Слайды и текст этой презентации

Слайд 1

Описание слайда:

Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Blue Gene Software Group, IBM T J Watson Research Center, Yorktown Heights, NY sameerk@us.ibm.com

Слайд 2

Описание слайда:

Outline Motivation NAMD and Charm++ BGL Techniques Problem mapping Overlap of communication with computation Grain size Load-balancing Communication optimizations Summary

Слайд 3

Описание слайда:

Blue Gene/L

Слайд 4

Описание слайда:

Слайд 5

Описание слайда:

Application Scaling Weak Problem size increases with processors Strong Constant problem size Linear to sub-linear decrease in computation time with processors Cache performance Communication overhead Communication to computation ratio

Слайд 6

Описание слайда:

Scaling on Blue Gene/L Several applications have demonstrated weak scaling Strong scaling on a large number of benchmarks still needs to be achieved

Слайд 7

Описание слайда:

NAMD and Charm++

Слайд 8

Описание слайда:

NAMD: A Production MD program

Слайд 9

Описание слайда:

Слайд 10

Описание слайда:

Molecular Dynamics in NAMD Collection of [charged] atoms, with bonds Newtonian mechanics Thousands of atoms (10,000 - 500,000) At each time-step Calculate forces on each atom Bonds: Non-bonded: electrostatic and van der Waal’s Short-distance: every timestep Long-distance: using PME (3D FFT) Multiple Time Stepping : PME every 4 timesteps Calculate velocities and advance positions Challenge: femtosecond time-step, millions needed!

Слайд 11

Описание слайда:

NAMD Benchmarks

Слайд 12

Описание слайда:

Parallel MD: Easy or Hard? Easy Tiny working data Spatial locality Uniform atom density Persistent repetition Multiple time-stepping

Слайд 13

Описание слайда:

NAMD Computation Application data divided into data objects called patches Sub-grids determined by cutoff Computation performed by migratable computes 13 computes per patch pair and hence much more parallelism Computes can be further split to increase parallelism

Слайд 14

Описание слайда:

NAMD Scalable molecular dynamics simulation 2 types of objects: patches and computes, to expose more parallelism Requires more careful load balancing

Слайд 15

Описание слайда:

Communication to Computation Ratio Scalable Constant with number of processors In practice grows at a very small rate

Слайд 16

Описание слайда:

Charm++ and Converse Charm++: object-based asynchronous message-driven parallel programming paradigm Converse: communication layer for Charm++ Send, recv, progress, on node level

Слайд 17

Описание слайда:

Optimizing NAMD on Blue Gene/L

Слайд 18

Описание слайда:

Single Processor Performance Worked with IBM Toronto for 3 weeks Inner loops slightly altered to enable software pipelining Aliasing issues resolved through the use of #pragma disjoint (*ptr1, *ptr2) 40% serial speedup Current best performance is with 440 Continued efforts with Toronto to get good 440d performance

Слайд 19

Описание слайда:

NAMD on BGL Advantages Both application and hardware are 3D grids Large 4MB L3 cache On large number of processors NAMD will run from L3 Higher bandwidth for short messages Midpoint of peak bandwidth achieved quickly Six outgoing links from each node No OS Daemons

Слайд 20

Описание слайда:

NAMD on BGL Disadvantages Slow embedded CPU Small memory per node Low bisection bandwidth Hard to scale full electrostatics Limited support for overlap of computation and communication No cache coherence

Слайд 21

Описание слайда:

BGL Parallelization Topology driven problem mapping Load-balancing schemes Overlap of computation and communication Communication optimizations

Слайд 22

Описание слайда:

Problem Mapping

Слайд 23

Описание слайда:

Problem Mapping

Слайд 24

Описание слайда:

Problem Mapping

Слайд 25

Описание слайда:

Problem Mapping

Слайд 26

Описание слайда:

Two Away Computation Each data object (patch) is split along a dimension Patches now interact with neighbors of neighbors Makes application more fine grained Improves load balancing Messages of smaller size sent to more processors Improves torus bandwidth

Слайд 27

Описание слайда:

Two Away X

Слайд 28

Описание слайда:

Load Balancing Steps

Слайд 29

Описание слайда:

Load-balancing Metrics Balancing load Minimizing communication hop-bytes Place computes close to patches Biased through placement of proxies on near neighbors Minimizing number of proxies Effects connectivity of each data object

Слайд 30

Описание слайда:

Overlap of Computation and Communication Each FIFO has 4 packet buffers Progress engine should be called every 4400 cycles Overhead of about 200 cycles 5 % increase in computation Remaining time can be used for computation

Слайд 31

Описание слайда:

Network Progress Calls NAMD makes progress engine calls from the compute loops Typical frequency is10000 cycles, dynamically tunable

Слайд 32

Описание слайда:

MPI Scalability Charm++ MPI Driver Iprobe based implementation Higher progress overhead of MPI_Test Statically pinned FIFOs for point to point communication

Слайд 33

Описание слайда:

Charm++ Native Driver BGX Message Layer (developed by George Almasi) Lower progress overhead Active messages Easily design complex communication protocols Dynamic FIFO mapping Low overhead remote memory access Interrupts Charm++ BGX driver was developed by Chao Huang over this summer

Слайд 34

Описание слайда:

BG/L Msglayer ( This slide is taken from G. Almási’s talk on the “new” msglayer. )

Слайд 35

Описание слайда:

Optimized Multicast

Слайд 36

Описание слайда:

Communication Pattern in PME

Слайд 37

Описание слайда:

PME Plane decomposition for 3D-FFT PME objects placed close to patch objects on the torus PME optimized through an asynchronous all-to-all with dynamic FIFO mapping

Слайд 38

Описание слайда:

Performance Results

Слайд 39

Описание слайда:

BGX Message layer vs MPI Fully non-blocking version performed below par on MPI Polling overhead high for a list of posted receives BGX message layer works well with asynchronous communication

Слайд 40

Описание слайда:

Blocking vs Overlap

Слайд 41

Описание слайда:

Effect of Network Progress (Projections timeline of a 1024-node run without aggressive network progress) Network progress not aggressive enough: communication gaps eat up utilization

Слайд 42

Описание слайда:

Effect of Network Progress (2)

Слайд 43

Описание слайда:

Virtual Node Mode

Слайд 44

Описание слайда:

Spring vs Now

Слайд 45

Описание слайда:

Summary

Слайд 46

Описание слайда:

Summary Demonstrated good scaling to 4k processors for the APoA1 with a speedup of 2100 Still working on 8k results ATPase scales well to 8k processors with a speedup of 4000+

Слайд 47

Описание слайда:

Lessons Learnt Eager messages lead to contention Rendezvous messages don’t perform well with mid size messages Topology optimizations are a big winner Overlap of computation and communication is possible Overlap however makes compute load less predictable Lack of operating system daemons leads to massive scaling

Слайд 48

Описание слайда:

Future Plans Experiment with new communication protocols Remote memory access Adaptive eager Fast asynchronous collectives Improve load-balancing Newer distributed strategies Heavy processors dynamically unload to neighbors Pencil decomposition for PME Using the double hummer

Теги NAMD BluegeneL

Похожие презентации