Tutorial 1 (AM): PGAS and Hybrid MPI+PGAS Programming Models on Modern HPC Clusters with Accelerators
Multi-core processors, accelerators (GPGPUs), coprocessors (Xeon Phis) and high-performance interconnects (InfiniBand, 10 GigE/iWARP and RoCE) with RDMA support are shaping the architectures for next generation clusters. Efficient programming models to design applications on these clusters as well as on future exascale systems are still evolving. The new MPI-3 standard brings enhancements to Remote Memory Access Model (RMA) as well as introduce non-blocking collectives. Partitioned Global Address Space (PGAS) Models provide an attractive alternative to the MPI model owing to their easy to use global shared memory abstractions and light-weight one-sided communication. At the same time, Hybrid MPI+PGAS programming models are gaining attention as a possible solution to programming exascale systems. These hybrid models help the transition of codes designed using MPI to take advantage of PGAS models without paying the prohibitive cost of re-designing complete applications. They also enable hierarchical design of applications using the different models to suite modern architectures.
In this tutorial, we provide an overview of the research and development taking place along the programming models (MPI, PGAS and Hybrid MPI+PGAS) and discuss associated opportunities and challenges in designing the associated runtimes as we head toward exascale computing with accelerator-based systems. We start with an in-depth overview of modern system architectures with multi-core processors, GPU accelerators, Xeon Phi co-processors and high-performance interconnects. We present an overview of the new MPI-3 RMA model, language based (UPC and CAF) and library based (OpenSHMEM) PGAS models. We introduce MPI+PGAS hybrid programming models and the associated unified runtime concept. We examine and contrast different challenges in designing high-performance MPI-3 compliant, OpenSHMEM and hybrid MPI+OpenSHMEM runtimes for both host-based and accelerator (GPU- and MIC-) based systems. We present case-studies using application kernels, to demonstrate how one can exploit hybrid MPI+PGAS programming models to achieve better performance without rewriting the complete code. Using the publicly available MVAPICH2-X, MVAPICH2-GDR and MVAPICH-MIC libraries, we present the challenges and opportunities to design efficient MPI, PGAS and hybrid MPI+PGAS runtimes for next generation systems. We introduce the concept of ’CUDAAware MPI/PGAS’ to combine high productivity and high performance. We present how to take advantage of GPU features such as Unified Virtual Address, CUDA-IPC and GPUDirect RDMA technologies to design efficient MPI, OpenSHMEM and Hybrid MPI+OpenSHMEM runtimes. Similarly, using MVAPICH2-MIC runtime, we expose optimized data movement schemes for different system configurations including multiple MICs per-node in same socket and/or different sockets configurations.
- Dhabaleswar K. (DK) Panda, The Ohio State University
- Khaled Hamidouche, The Ohio State University
Dr. Dhabaleswar K. (DK) Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. His research interests include parallel computer architecture, high performance networking, InfiniBand, exascale computing, programming models, GPUs and accelerators, high performance file systems and storage, virtualization, cloud computing and Big Data. He has published over 350 papers in major journals and international conferences related to these research areas. Dr. Panda and his research group members have been doing extensive research on modern networking technologies including InfiniBand, High-Speed Ethernet and RDMA over Converged Enhanced Ethernet (RoCE). The MVAPICH2 (High Performance MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X software libraries, developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 2,425 organizations worldwide (in 75 countries). This software has enabled several InfiniBand clusters to get into the latest TOP500 ranking during the last decade. More than 280,000 downloads of this software have taken place from the project’s website alone. This software package is also available with the software stacks of many network and server vendors, and Linux distributors. The new RDMA-enabled Apache Hadoop package and RDMA-enabled Memcached package are publicly available from http://hibd.cse.ohio-state.edu. Dr. Panda’s research has been supported by funding from US National Science Foundation, US Department of Energy, and several industry including Intel, Cisco, Cray, SUN, Mellanox, QLogic, NVIDIA and NetApp. He is an IEEE Fellow and a member of ACM. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/~panda.
Khaled Hamidouche is a Senior Research Associate in the Department of Computer Science and Engineering at The Ohio State University. He is a member of the Network-Based Computing Laboratory lead by Dr. D. K. Panda. His research interests include high-performance interconnects, parallel programming models, accelerator computing and high-end computing applications. His current focus is on designing high performance unified MPI, PGAS and hybrid MPI+PGAS runtimes for InfiniBand clusters and their support for accelerators. Dr. Hamidouche is involved in the design and development of the popular MVAPICH2 library and its derivatives MVAPICH2-MIC, MVAPICH2-GDR and MVAPICH2-X. He has published over 30 papers in international journals and conferences related to these research areas. He has been actively involved in various professional activities in academic journals and conferences. He is a member of ACM. More details about Dr. Hamidouche are available http://www.cse.ohio-state.edu/~hamidouc.
Tutorial 2 (AM): Productive Programming in Chapel: A Computation-Driven Introduction
Chapel (http://chapel.cray.com) is an emerging open-source language whose goal is to vastly improve the programmability of parallel systems while also enhancing generality and portability compared to conventional techniques. Considered by many to be the most promising of recent parallel languages, Chapel is seeing growing levels of interest not only among HPC users, but also in the data analytic, academic, and mainstream communities. Chapel’s design and implementation are portable and open-source, supporting a wide spectrum of platforms from desktops (Mac, Linux, and Windows) to commodity clusters, the cloud, and large-scale systems developed by Cray and other vendors.
This tutorial will provide an in-depth introduction to Chapel’s features using a computation-driven approach. Time and interest permitting, a hands-on segment will let participants write, compile, and execute parallel Chapel programs, either directly on their laptops (gcc should be pre-installed) or by logging onto remote systems. We’ll end the tutorial by providing an overview of Chapel project status and activities, and by soliciting feedback from participants with the goal of improving Chapel’s utility for their parallel computing needs.
- Bradford Chamberlain, Cray Inc.
- Michael Ferguson, Cray Inc.
Bradford Chamberlain is a Principal Engineer at Cray Inc. where he works on parallel programming models, focusing primarily on the design and implementation of the Chapel language in his role as technical lead for that project and its development team. Brad received his Ph.D. in Computer Science & Engineering from the University of Washington in 2001 where his work focused on the design and implementation of the ZPL parallel array language, particularly on its concept of the region—a first-class index set supporting global-view data parallelism. At times he has also worked on languages for embedded reconfigurable processors, and on algorithms for accelerating the rendering of complex 3D scenes. Brad remains associated with the University of Washington as an affiliate faculty member, most recently teaching a parallel programming course for their professional masters program. He received his Bachelor's degree in Computer Science with honors from Stanford University in 1992.
Michael Ferguson is a software engineer within the Chapel team at Cray. He holds a BA in Math and Computer Science from Cornell. As a contributor to the Chapel project, Michael implemented a dramatically improved I/O system and has led efforts to create an LLVM-based back-end as well as several communication optimizations. Recently, he has focused on defining a memory consistency model for Chapel. He has taught several multi-day Chapel tutorials and has presented work on Chapel at conferences.
Tutorial 3 (PM): Developing Parallel C++ Applications with Modern PGAS Features in UPC++
UPC++ is a PGAS extension for C++ with three main objectives: 1) to provide an object-oriented PGAS programming model in the context of the popular C++ language; 2) to add useful parallel programming idioms unavailable in UPC, such as asynchronous remote function invocation and multidimensional arrays, to support complex scientific applications; 3) to offer an easy on-ramp to PGAS programming through interoperability with other existing parallel programming systems (e.g., MPI, OpenMP, CUDA).
In this tutorial we will introduce basic concepts and advanced optimization techniques of UPC++. We will walk through code examples to illustrate the best practice to write or port parallel programs to UPC++. In addition, we will share our experience in designing and implementing UPC++, which can help users achieve better performance in applications. The documentation and implementation of UPC++ are available at https://bitbucket.org/upcxx.
- Kathy Yelick, Lawrence Berkeley National Laboratory and University of California at Berkeley
- Yili Zheng, Lawrence Berkeley National Laboratory
- Amir Kamil, Lawrence Berkeley National Laboratory
Katherine Yelick is a Professor of Electrical Engineering and Computer Sciences at the University of California at Berkeley and is also the Associate Laboratory Director for Computing Sciences at Lawrence Berkeley National Laboratory. She is the co-author of two books and more than 100 refereed technical papers on parallel languages, compilers, algorithms, libraries, architecture, and storage. She co-invented the UPC and Titanium languages and demonstrated their applicability across architectures through the use of novel runtime and compilation methods. She also co-developed techniques for self-tuning numerical libraries, including the first self-tuned library for sparse matrix kernels which automatically adapts the code to properties of the matrix structure and machine. Her work includes performance analysis and modeling as well as optimization techniques for memory hierarchies, multicore processors, communication libraries, and processor accelerators. She has worked with interdisciplinary teams on application scaling, and her own applications work includes parallelization of a model for blood flow in the heart. She earned her Ph.D. in Electrical Engineering and Computer Science from MIT and has been a professor of Electrical Engineering and Computer Sciences at UC Berkeley since 1991 with a joint research appointment at Berkeley Lab since 1996. She has received multiple research and teaching awards and is a member of the California Council on Science and Technology and a member of the National Academies committee on Sustaining Growth in Computing Performance.
Yili Zheng is a computer research scientist at Lawrence Berkeley National Laboratory since 2008, where he leads the design and implementation of UPC++. His work includes system software, algorithms and applications of high-performance parallel computing. With collaborators at NVIDIA, he co-developed the Phalanx programming system for hierarchical and heterogeneous computers during the DARPA UHPC ECHELON project. He is also a contributor of the popular Berkeley Unified Parallel C compiler and the GASNet communication library. Yili received his Ph.D. in Electrical and Computing Engineering from Purdue University in 2008. From 2006 to 2008, he was on leave at the Department of Biomedical Engineering at Cornell University. From 2003 to 2005, he contributed to the IBM BlueGene/L MPI library and the IBM UPC runtime when interning at IBM T. J. Watson Research Center.
Amir Kamil is researcher in the Computer Languages & Systems Software (CLaSS) Group at Lawrence Berkeley National Laboratory. His work involves programming models, program analysis, languages, and compilers for parallel computing. He previously worked with Kathy Yelick in the Titanium group and Ivan Sutherland in the Fleet group. He earned his Ph.D. in Electrical Engineering and Computer Sciences from UC Berkeley. For additional details, please visit his website at http://www.cs.berkeley.edu/~kamil/.