Katherine Yelick

Katherine Yelick Ph.D.

Faculty Scientist

*kayelick@lbl.gov

*For appointments contact Lisa Theobald, assistant | Email: LATheobald@lbl.gov | Phone: +1 510.495.2922

Kathy Yelick holds a joint research appointment at Lawrence Berkeley National Laboratory and the University of California, Berkeley. She has been a professor of Electrical Engineering and Computer Sciences at U.C. Berkeley since 1991 and has held a joint research appointment at Berkeley Lab since 1996. As an associate lab director, she led the Computing Sciences Area from 2010 through December 2019 when she stepped down to concentrate on research and teaching. She is also a strategic advisor on lab-wide initiatives to Berkeley Lab Director Mike Witherell. Prior to that, she was the director of the National Energy Research Scientific Computing Division (NERSC) from 2008 through 2012 and leader of the Future Technologies Group in the Lab's Computational Research Division from 2005 through 2007. Dr. Yelick earned her Ph.D. in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology and is an internationally recognized expert in high-performance computing. Her research interests include parallel programming languages, automatic performance tuning, performance analysis, parallel algorithms, and optimizing compilers.

Journal Articles

Muaaz G Awan, Jack Deslippe, Aydin Buluc, Oguz Selvitopi, Steven Hofmeyr, Leonid Oliker, Katherine Yelick, "ADEPT: a domain independent sequence alignment strategy for gpu architectures", BMC Bioinformatics, September 2020, 21, doi: 10.1186/s12859-020-03720-1

Steven Hofmeyr, Rob Egan, Evangelos Georganas, Alex C Copeland, Robert Riley, Alicia Clum, Emiley Eloe-Fadrosh, Simon Roux, Eugene Goltsman, Aydin Buluc, Daniel Rokhsar, Leonid Oliker, Katherine Yelick, "Terabase-scale metagenome coassembly with MetaHipMer", Scientific Reports, June 1, 2020, 10, doi: https://doi.org/10.1038/s41598-020-67416-5

Download File: s41598-020-67416-5.pdf (pdf: 1.4 MB)

Metagenome sequence datasets can contain terabytes of reads, too many to be coassembled together on a single shared-memory computer; consequently, they have only been assembled sample by sample (multiassembly) and combining the results is challenging. We can now perform coassembly of the largest datasets using MetaHipMer, a metagenome assembler designed to run on supercomputers and large clusters of compute nodes. We have reported on the implementation of MetaHipMer previously; in this paper we focus on analyzing the impact of very large coassembly. In particular, we show that coassembly recovers a larger genome fraction than multiassembly and enables the discovery of more complete genomes, with lower error rates, whereas multiassembly recovers more dominant strain variation. Being able to coassemble a large dataset does not preclude one from multiassembly; rather, having a fast, scalable metagenome assembler enables a user to more easily perform coassembly and multiassembly, and assemble both abundant, high strain variation genomes, and low-abundance, rare genomes. We present several assemblies of terabyte datasets that could never be coassembled before, demonstrating MetaHipMer’s scaling power. MetaHipMer is available for public use under an open source license and all datasets used in the paper are available for public download.

F. Alexander, A. Almgren, J. Bell, A. Bhattacharjee, J. Chen, P. Colella, D. Daniel, J. DeSlippe, L. Diachin, E. Draeger, A. Dubey, T. Dunning, T. Evans, I. Foster, M. Francois, T. Germann, M. Gordon, S. Habib, M. Halappanavar, S. Hamilton, W. Hart, Z. Huang, A. Hungerford, D. Kasen, P. Kent, T. Kolev, D. Kothe, A. Kronfeld, Y. Luo, P. Mackenzie, D. McCallen, B. Messer, S. Mniszewski, C. Oehmen, A. Perazzo, D. Perez, D. Richard, W. Rider, R. Rieben, K. Roche, A. Siegel, M. Sprague, C. Steefel, R. Stevens, M. Syamlal, M. Taylor, J. Turner, J.-L. Vay, A. Voter, T. Windus and K. Yelick, "Exascale applications: skin in the game", Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 2020,

Katherine Yelick, Aydın Buluç, Muaaz Awan, Ariful Azad, Benjamin Brock, Rob Egan, Saliya Ekanayake, Marquita Ellis, Evangelos Georganas, Giulia Guidi, Steven Hofmeyr, Oguz Selvitopi, Cristina Teodoropol, Leonid Oliker, "The parallelism motifs of genomic data analysis", Philosophical Transactions of The Royal Society A: Mathematical, Physical and Engineering Sciences, 2020,

J. Chapman, M. Mascher, A. Buluç, K. Barry, E. Georganas, A. Session, V. Strnadova, J. Jenkins, S. Sehgal, L. Oliker, J Schmutz, K. Yelick, U. Scholz, R. Waugh, J. Poland, G. Muehlbauer, N. Stein, D. Rokhsar, "A whole-genome shotgun approach for assembling and anchoring the hexaploid bread wheat genome", Genome biology, 2015,

K Madduri, J Su, S Williams, L Oliker, S Ethier, K Yelick, "Optimization of parallel particle-to-grid interpolation on leading multicore platforms", IEEE Transactions on Parallel and Distributed Systems, January 1, 2012, 23:1915--1922, doi: 10.1109/TPDS.2012.28

Rajesh Nishtala, Yili Zheng, Paul Hargrove, Katherine A. Yelick, "Tuning collective communication for Partitioned Global Address Space programming models", Parallel Computing, September 2011, 37(9):576--591, doi: 10.1016/j.parco.2011.05.006

Partitioned Global Address Space (PGAS) languages offer programmers the convenience of a shared memory programming style combined with locality control necessary to run on large-scale distributed memory systems. Even within a PGAS language programmers often need to perform global communication operations such as broadcasts or reductions, which are best performed as collective operations in which a group of threads work together to perform the operation. In this paper we consider the problem of implementing collective communication within PGAS languages and explore some of the design trade-offs in both the interface and implementation. In particular, PGAS collectives have semantic issues that are different than in send–receive style message passing programs, and different implementation approaches that take advantage of the one-sided communication style in these languages. We present an implementation framework for PGAS collectives as part of the GASNet communication layer, which supports shared memory, distributed memory and hybrids. The framework supports a broad set of algorithms for each collective, over which the implementation may be automatically tuned. Finally, we demonstrate the benefit of optimized GASNet collectives using application benchmarks written in UPC, and demonstrate that the GASNet collectives can deliver scalable performance on a variety of state-of-the-art parallel machines including a Cray XT4, an IBM BlueGene/P, and a Sun Constellation system with InfiniBand interconnect.

S Williams, J Carter, L Oliker, J Shalf, K Yelick, "Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms", Journal of Parallel and Distributed Computing, 2009, 69:762--777, doi: 10.1016/j.jpdc.2009.04.002

Download File: jpdc09-lbmhd.pdf (pdf: 1.1 MB)

K Datta, S Kamill, S Williams, L Oliker, J Shalf, K Yelick, "Optimization and performance modeling of stencil computations on modern microprocessors", SIAM Review, 2009, 51:129--159, doi: 10.1137/070693199

Download File: sirev09-stencil.pdf (pdf: 2.8 MB)

S. Williams, K. Datta, J. Carter, L. Oliker, J. Shalf, K. Yelick, D. Bailey, "PERI: Auto-tuning Memory Intensive Kernels for Multicore", SciDAC PI Meeting, Journal of Physics: Conference Series, 125 012038, July 2008, doi: 10.1088/1742-6596/125/1/012038

Download File: jpconf8125012089.pdf (pdf: 874 KB)

Katherine Yelick, Paul Hilfinger, Susan Graham, Dan Bonachea, Jimmy Su, Amir Kamil, Kaushik Datta, Phillip Colella, Tong Wen, "Parallel Languages and Compilers: Perspective from the Titanium Experience", International Journal of High Performance Computing Applications (IJHPCA), August 1, 2007, 21(3):266--290, doi: 10.1177/1094342007078449

We describe the rationale behind the design of key features of Titanium — an explicitly parallel dialect of JavaTM for high-performance scientific programming — and our experiences in building applications with the language. Specifically, we address Titanium’s Partitioned Global Address Space model, SPMD parallelism support, multi-dimensional arrays and array-index calculus, memory management, immutable classes (class-like types that are value types rather than reference types), operator overloading, and generic programming. We provide an overview of the Titanium compiler implementation, covering various parallel analyses and optimizations, Titanium runtime technology and the GASNet network communication layer. We summarize results and lessons learned from implementing the NAS parallel benchmarks, elliptic and hyperbolic solvers using Adaptive Mesh Refinement, and several applications of the Immersed Boundary method.

S Williams, L Oliker, R Vuduc, J Shalf, K Yelick, J Demmel, "Optimization of sparse matrix-vector multiplication on emerging multicore platforms", Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC 07, 2007, doi: 10.1145/1362622.1362674

Download File: parco08-spmv.pdf (pdf: 1.5 MB)

S Williams, J Shalf, L Oliker, S Kamil, P Husbands, K Yelick, "Scientific computing kernels on the cell processor", International Journal of Parallel Programming, January 2007, 35:263--298, doi: 10.1007/s10766-007-0034-5

Download File: ijpp07-cell.pdf (pdf: 1000 KB)

Kaushik Datta, Dan Bonachea, Katherine Yelick, "Titanium performance and potential: An NPB experimental study", Lecture Notes in Computer Science - Proceedings of Languages and Compilers for Parallel Computing (LCPC), December 2006, 4339:200--214, doi: 10.1007/978-3-540-69330-7_14

Titanium is an explicitly parallel dialect of Java TM designed for high-performance scientific programming. We present an overview of the language features and demonstrate their use in the context of the NAS Parallel Benchmarks, a standard suite of common scientific kernels. We argue that parallel languages like Titanium provide greater expressive power than conventional approaches, enabling much more concise and expressive code that minimizes time to solution. Moreover, we have found that the Titanium implementations of three of the NAS Parallel Benchmarks can match or even exceed the performance of the standard Fortran/MPI implementations at realistic problem sizes and processor scales, while still using far cleaner, shorter and more maintainable code.

C. Kozyrakis, D. Judd, J. Gebis, S. Williams, D. Patterson, K. Yelick, "Hardware/Compiler Co-development for an Embedded Media Processor", Proceedings of the IEEE, 2001, doi: 10.1109/5.964446

Conference Papers

Giulia Guidi, Marquita Ellis, Daniel Rokhsar, Katherine Yelick, Aydın Buluç, "BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper", SIAM Conference on Applied and Computational Discrete Algorithms (ACDA21), 2021, doi: 10.1101/464420

G Guidi, M Ellis, A Buluç, K Yelick, D Culler, "10 years later: Cloud computing is closing the performance gap", ICPE 2021 - Companion of the ACM/SPEC International Conference on Performance Engineering, January 1, 2021, 41--48, doi: 10.1145/3447545.3451183

O Selvitopi, B Brock, I Nisa, A Tripathy, K Yelick, A Buluç, "Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication", Proceedings of the International Conference on Supercomputing, January 2021, 431--442, doi: 10.1145/3447818.3461472

A Zeni, G Guidi, M Ellis, N Ding, MD Santambrogio, S Hofmeyr, A Buluc, L Oliker, K Yelick, "LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment", Proceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium, IPDPS 2020, 2020, 462--471, doi: 10.1109/IPDPS47924.2020.00055

T Groves, B Brock, Y Chen, KZ Ibrahim, L Oliker, NJ Wright, S Williams, K Yelick, "Performance Trade-offs in GPU Communication: A Study of Host and Device-initiated Approaches", Proceedings of PMBS 2020: Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis, January 2020, 126--137, doi: 10.1109/PMBS51919.2020.00016

Download File: PMBS20-NVSHMEM-final.pdf (pdf: 659 KB)

G Guidi, O Selvitopi, M Ellis, L Oliker, K Yelick, A Buluc, "Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly", January 1, 2020,

Benjamin A. Brock, Yuxin Chen, Jiakun Yan, John Owens, Aydın Buluç, Katherine Yelick, "RDMA vs. RPC for implementing distributed data structures", 2019 IEEE/ACM 9th Workshop on Irregular Applications: Architectures and Algorithms (IA3), Denver, CO, USA, IEEE, November 18, 2019, 17--22, doi: 10.1109/IA349570.2019.00009

Distributed data structures are key to implementing scalable applications for scientific simulations and data analysis. In this paper we look at two implementation styles for distributed data structures: remote direct memory access (RDMA) and remote procedure call (RPC). We focus on operations that require individual accesses to remote portions of a distributed data structure, e.g., accessing a hash table bucket or distributed queue, rather than global operations in which all processors collectively exchange information. We look at the trade-offs between the two styles through microbenchmarks and a performance model that approximates the cost of each. The RDMA operations have direct hardware support in the network and therefore lower latency and overhead, while the RPC operations are more expressive but higher cost and can suffer from lack of attentiveness from the remote side. We also run experiments to compare the real-world performance of RDMA- and RPC-based data structure operations with the predicted performance to evaluate the accuracy of our model, and show that while the model does not always precisely predict running time, it allows us to choose the best implementation in the examples shown. We believe this analysis will assist developers in designing data structures that will perform well on current network architectures, as well as network architects in providing better support for this class of distributed data structures.

Benjamin Brock, Aydın Buluç, Katherine Yelick, "BCL: A cross-platform distributed data structures library", Proceedings of the 48th International Conference on Parallel Processing (ICPP), August 2019, doi: 10.1145/3337821.3337912

One-sided communication is a useful paradigm for irregular parallel applications, but most one-sided programming environments, including MPI's one-sided interface and PGAS programming languages, lack application-level libraries to support these applications. We present the Berkeley Container Library, a set of generic, cross-platform, high-performance data structures for irregular applications, including queues, hash tables, Bloom filters and more. BCL is written in C++ using an internal DSL called the BCL Core that provides one-sided communication primitives such as remote get and remote put operations. The BCL Core has backends for MPI, OpenSHMEM, GASNet-EX, and UPC++, allowing BCL data structures to be used natively in programs written using any of these programming environments. Along with our internal DSL, we present the BCL ObjectContainer abstraction, which allows BCL data structures to transparently serialize complex data types while maintaining efficiency for primitive types. We also introduce the set of BCL data structures and evaluate their performance across a number of high-performance computing systems, demonstrating that BCL programs are competitive with hand-optimized code, even while hiding many of the underlying details of message aggregation, serialization, and synchronization.

M Ellis, G Guidi, A Buluç, L Oliker, K Yelick, "DiBELLA: Distributed long read to long read alignment", ACM International Conference Proceeding Series, January 1, 2019, doi: 10.1145/3337821.3337919

P Koanantakool, A Ali, A Azad, A Buluç, D Morozov, L Oliker, KA Yelick, S-Y Oh, "Communication-Avoiding Optimization Methods for Distributed Massive-Scale Sparse Inverse Covariance Estimation.", Proceedings of Machine Learning Research, PMLR, 2018, 84:1376--1386,

M Ellis, E Georganas, R Egan, S Hofmeyr, A Buluç, B Cook, L Oliker, K Yelick, "Performance characterization of de novo genome assembly on leading parallel systems", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2017, 10417 LN:79--91, doi: 10.1007/978-3-319-64203-1_6

E Georganas, M Ellis, R Egan, S Hofmeyr, A Buluç, B Cook, L Oliker, K Yelick, "MerBench: PGAS benchmarks for high performance genome assembly", Proceedings of PAW 2017: 2nd Annual PGAS Applications Workshop - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, 2017-Jan:1--4, doi: 10.1145/3144779.3169109

Mathias Jacquelin, Yili Zheng, Esmond Ng, Katherine Yelick, "An Asynchronous Task-based Fan-Both Sparse Cholesky Solver", August 23, 2016,

Systems of linear equations arise at the heart of many scientific and engineering applications. Many of these linear systems are sparse; i.e., most of the elements in the coefficient matrix are zero. Direct methods based on matrix factorizations are sometimes needed to ensure accurate solutions. For example, accurate solution of sparse linear systems is needed in shift-invert Lanczos to compute interior eigenvalues. The performance and resource usage of sparse matrix factorizations are critical to time-to-solution and maximum problem size solvable on a given platform. In many applications, the coefficient matrices are symmetric, and exploiting symmetry will reduce both the amount of work and storage cost required for factorization. When the factorization is performed on large-scale distributed memory platforms, communication cost is critical to the performance of the algorithm. At the same time, network topologies have become increasingly complex, so that modern platforms exhibit a high level of performance variability. This makes scheduling of computations an intricate and performance-critical task. In this paper, we investigate the use of an asynchronous task paradigm, one-sided communication and dynamic scheduling in implementing sparse Cholesky factorization (symPACK) on large-scale distributed memory platforms. Our solver symPACK relies on efficient and flexible communication primitives provided by the UPC++ library. Performance evaluation shows good scalability and that symPACK outperforms state-of-the-art parallel distributed memory factorization packages, validating our approach on practical cases.

D Ozog, A Kamil, Y Zheng, P Hargrove, JR Hammond, A Malony, WD Jong, K Yelick, "A Hartree-Fock Application Using UPC++ and the New DArray Library", 30th International Parallel and Distributed Processing Symposium (IPDPS), IEEE, May 23, 2016, 453--462, doi: 10.1109/IPDPS.2016.108

The Hartree-Fock (HF) method is the fundamental first step for incorporating quantum mechanics into many-electron simulations of atoms and molecules, and it is an important component of computational chemistry toolkits like NWChem. The GTFock code is an HF implementation that, while it does not have all the features in NWChem, represents crucial algorithmic advances that reduce communication and improve load balance by doing an up-front static partitioning of tasks, followed by work stealing whenever necessary. To enable innovations in algorithms and exploit next generation exascale systems, it is crucial to support quantum chemistry codes using expressive and convenient programming models and runtime systems that are also efficient and scalable. This paper presents an HF implementation similar to GTFock using UPC++, a partitioned global address space model that includes flexible communication, asynchronous remote computation, and a powerful multidimensional array library. UPC++ offers runtime features that are useful for HF such as active messages, a rich calculus for array operations, hardware-supported fetch-and-add, and functions for ensuring asynchronous runtime progress. We present a new distributed array abstraction, DArray, that is convenient for the kinds of random-access array updates and linear algebra operations on block-distributed arrays with irregular data ownership. We analyze the performance of atomic fetch-and-add operations (relevant for load balancing) and runtime attentiveness, then compare various techniques and optimizations for each. Our optimized implementation of HF using UPC++ and the DArrays library shows up to 20% improvement over GTFock with Global Arrays at scales up to 24,000 cores.

P Koanantakool, A Azad, A Buluc, D Morozov, SY Oh, L Oliker, K Yelick, "Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication", Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016, January 2016, 842--853, doi: 10.1109/IPDPS.2016.117

Hongzhang Shan, Samuel Williams, Yili Zheng, Amir Kamil, Katherine Yelick,, "Implementing High-Performance Geometric Multigrid Solver with Naturally Grained Messages", 9th International Conference on Partitioned Global Address Space Programming Models (PGAS), September 2015, 38--46, doi: 10.1109/PGAS.2015.12

Download File: pgas15-hpgmg.pdf (pdf: 803 KB)

Evangelos Georganas, Aydin Buluç, Jarrod Chapman, Leonid Oliker, Daniel Rokhsar, Katherine Yelick, "MerAligner: A Fully Parallel Sequence Aligner", IEEE 29th International Parallel and Distributed Processing Symposium (IPDPS), May 2015, 561--570, doi: 10.1109/IPDPS.2015.96

Aligning a set of query sequences to a set of target sequences is an important task in bioinformatics. In this work we present merAligner, a highly parallel sequence aligner that implements a seed -- and -- extend algorithm and employs parallelism in all of its components. MerAligner relies on a high performance distributed hash table (seed index) and uses one-sided communication capabilities of the Unified Parallel C to facilitate a fine-grained parallelism. We leverage communication optimizations at the construction of the distributed hash table and software caching schemes to reduce communication during the aligning phase. Additionally, merAligner preprocesses the target sequences to extract properties enabling exact sequence matching with minimal communication. Finally, we efficiently parallelize the I/O intensive phases and implement an effective load balancing scheme. Results show that merAligner exhibits efficient scaling up to thousands of cores on a Cray XC30 supercomputer using real human and wheat genome data while significantly outperforming existing parallel alignment tools.

Scott French, Yili Zheng, Barbara Romanowicz, Katherine Yelick, "Parallel Hessian Assembly for Seismic Waveform Inversion Using Global Updates", International Parallel and Distributed Processing Symposium (IPDPS), May 2015, 753--762, doi: 10.1109/IPDPS.2015.58

We present the design and evaluation of a distributed matrix-assembly abstraction for large-scale inverse problems in HPC environments: namely, physics-based Hessian estimation in full-waveform seismic inversion at the scale of the entire globe. Our solution to this data-assimilation problem relies on UPC++, a new PGAS extension to the C++ language, to implement one-sided asynchronous updates to distributed matrix elements, and allows us to tackle inverse problems well beyond our previous capabilities. Our evaluation includes scaling results for Hessian estimation on up to 12, 288 cores, typical of current production scientific runs and next-generation inversions. We also present comparisons with an alternative implementation based on MPI-3 remote memory access (RMA) operations, focusing on performance and code complexity. Interoperability between UPC ₊₊ and other parallel programming tools (e.g. MPI, OpenMP) allowed for incremental adoption of the PGAS model where most beneficial. Further, we note that this model of asynchronous assembly can generalize to other data-assimilation applications that accumulate updates into shared global state.

E Georganas, A Buluç, J Chapman, S Hofmeyr, C Aluru, R Egan, L Oliker, D Rokhsar, K Yelick, "HipMer: An extreme-scale de novo genome assembler", International Conference for High Performance Computing, Networking, Storage and Analysis, SC, January 1, 2015, 15-20-No, doi: 10.1145/2807591.2807664

Evangelos Georganas, Aydin Buluç, Jarrod Chapman, Leonid Oliker, Daniel Rokhsar, Katherine Yelick, "Parallel de Bruijn Graph Construction and Traversal for de Novo Genome Assembly", International Conference for High Performance Computing, Networking, Storage and Analysis (SC), November 16, 2014, 437--448, doi: 10.1109/SC.2014.41

Download File: sc14genome.pdf (pdf: 719 KB)

Hongzhang Shan, Amir Kamil, Samuel Williams, Yili Zheng, Katherine Yelick, "Evaluation of PGAS Communication Paradigms with Geometric Multigrid", Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (PGAS), October 2014, doi: 10.1145/2676870.2676874

Download File: PGAS14-miniGMG.pdf (pdf: 1.2 MB)

Partitioned Global Address Space (PGAS) languages and one-sided communication enable application developers to select the communication paradigm that balances the performance needs of applications with the productivity desires of programmers. In this paper, we evaluate three different one-sided communication paradigms in the context of geometric multigrid using the miniGMG benchmark. Although miniGMG's static, regular, and predictable communication does not exploit the ultimate potential of PGAS models, multigrid solvers appear in many contemporary applications and represent one of the most important communication patterns. We use UPC++, a PGAS extension of C++, as the vehicle for our evaluation, though our work is applicable to any of the existing PGAS languages and models. We compare performance with the highly tuned MPI baseline, and the results indicate that the most promising approach towards achieving performance and ease of programming is to use high-level abstractions, such as the multidimensional arrays provided by UPC++, that hide data aggregation and messaging in the runtime library.

Amir Kamil, Yili Zheng, Katherine Yelick, "A Local-View Array Library for Partitioned Global Address Space C++ Programs", ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (ARRAY), June 2014,

Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.Multidimensional arrays are an important data structure in many scientific applications. Unfortunately, built-in support for such arrays is inadequate in C++, particularly in the distributed setting where bulk communication operations are required for good performance. In this paper, we present a multidimensional library for partitioned global address space (PGAS) programs, supporting the one-sided remote access and bulk operations of the PGAS model. The library is based on Titanium arrays, which have proven to provide good productivity and performance. These arrays provide a local view of data, where each rank constructs its own portion of a global data structure, matching the local view of execution common to PGAS programs and providing maximum flexibility in structuring global data. Unlike Titanium, which has its own compiler with array-specific analyses, optimizations, and code generation, we implement multidimensional arrays solely through a C++ library. The main goal of this effort is to provide a library-based implementation that can match the productivity and performance of a compiler-based approach. We implement the array library as an extension to UPC++, a C++ library for PGAS programs, and we extend Titanium arrays with specializations to improve performance. We evaluate the array library by porting four Titanium benchmarks to UPC++, demonstrating that it can achieve up to 25% better performance than Titanium without a significant increase in programmer effort.

Yili Zheng, Amir Kamil, Michael B. Driscoll, Hongzhang Shan, Katherine Yelick, "UPC++: A PGAS extension for C++", International Parallel and Distributed Processing Symposium (IPDPS), May 19, 2014, 1105--1114, doi: 10.1109/IPDPS.2014.115

Partitioned Global Address Space (PGAS) languages are convenient for expressing algorithms with large, random-access data, and they have proven to provide high performance and scalability through lightweight one-sided communication and locality control. While very convenient for moving data around the system, PGAS languages have taken different views on the model of computation, with the static Single Program Multiple Data (SPMD) model providing the best scalability. In this paper we present UPC++, a PGAS extension for C++ that has three main objectives: 1) to provide an object-oriented PGAS programming model in the context of the popular C++ language, 2) to add useful parallel programming idioms unavailable in UPC, such as asynchronous remote function invocation and multidimensional arrays, to support complex scientific applications, 3) to offer an easy on-ramp to PGAS programming through interoperability with other existing parallel programming systems (e.g., MPI, OpenMP, CUDA). We implement UPC++ with a "compiler-free" approach using C++ templates and runtime libraries. We borrow heavily from previous PGAS languages and describe the design decisions that led to this particular set of language features, providing significantly more expressiveness than UPC with very similar performance characteristics. We evaluate the programmability and performance of UPC++ using five benchmarks on two representative supercomputers, demonstrating that UPC++ can deliver excellent performance at large scale up to 32K cores while offering PGAS productivity features to C++ applications.

A Kamil, K Yelick, "Hierarchical computation in the SPMD programming model", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), January 2014, 8664:3--19, doi: 10.1007/978-3-319-09967-5_1

Large-scale parallel machines are programmed mainly with the single program, multiple data (SPMD) model of parallelism. While this model has advantages of scalability and simplicity, it does not fit well with divide-and-conquer parallelism or hierarchical machines that mix shared and distributed memory. In this paper, we define the recursive single program, multiple data model (RSPMD) that extends SPMD with a hierarchical team mechanism to support hierarchical algorithms and machines. We implement this model in the Titanium language and describe how to eliminate a class of deadlocks by ensuring alignment of collective operations. We present application case studies evaluating the RSPMD model, showing that it enables divide-and-conquer algorithms such as sorting to be elegantly expressed and that team collective operations increase performance of conjugate gradient by up to a factor of two. The model also facilitates optimizations for hierarchical machines, improving scalability of particle in cell by 8x and performance of sorting and a stencil code by up to 40% and 14%, respectively.

Evangelos Georganas, Jorge González-Domínguez, Edgar Solomonik, Yili Zheng, Juan Touriño, Katherine Yelick,, "Communication avoiding and overlapping for numerical linear algebra", International Conference for High Performance Computing, Networking, Storage and Analysis (SC), November 10, 2012, doi: 10.1109/SC.2012.32

To efficiently scale dense linear algebra problems to future exascale systems, communication cost must be avoided or overlapped. Communication-avoiding 2.5D algorithms improve scalability by reducing inter-processor data transfer volume at the cost of extra memory usage. Communication overlap attempts to hide messaging latency by pipelining messages and overlapping with computational work. We study the interaction and compatibility of these two techniques for two matrix multiplication algorithms (Cannon and SUMMA), triangular solve, and Cholesky factorization. For each algorithm, we construct a detailed performance model that considers both critical path dependencies and idle time. We give novel implementations of 2.5D algorithms with overlap for each of these problems. Our software employs UPC, a partitioned global address space (PGAS) language that provides fast one-sided communication. We show communication avoidance and overlap provide a cumulative benefit as core counts scale, including results using over 24K cores of a Cray XE6 system.

Hongzhang Shan, Brian Austin, Nicholas Wright, Erich Strohmaier, John Shalf, Katherine Yelick, "Accelerating Applications at Scale Using One-Sided Communication", Santa Barbara, CA, The 6th Conference on Partitioned Global Address Programming Models, October 10, 2012,

Download File: ScaleUsingOneSided.pdf (pdf: 522 KB)

Seung-Jai Min, Costin Iancu and Katherine Yelick, "Hierarchical Work Stealing on Manycore Clusters", Fifth Conference on Partitioned Global Address Space Programming Models (PGAS11), 2011,

Download File: task.pdf (pdf: 703 KB)

H Shan, NJ Wright, J Shalf, K Yelick, M Wagner, N Wichmann, "A preliminary evaluation of the hardware acceleration of the Cray Gemini interconnect for PGAS languages and comparison with MPI", PMBS 11 - Proceedings of the 2nd International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, Co-located with SC 11, January 1, 2011, 13--14, doi: 10.1145/2088457.2088467

Download File: pmbs11.pdf (pdf: 497 KB)

Filip Blagojevic, Paul Hargrove, Costin Iancu, and Katherine Yelick,, "Hybrid PGAS runtime support for multicore nodes", Fourth Conference on Partitioned Global Address Space Programming Model (PGAS), October 2010, doi: 10.1145/2020373.2020376

Download File: paper2.pdf (pdf: 1 MB)

With multicore processors as the standard building block for high performance systems, parallel runtime systems need to provide excellent performance on shared memory, distributed memory, and hybrids. Conventional wisdom suggests that threads should be used as the runtime mechanism within shared memory, and two runtime versions for shared and distributed memory are often designed and implemented separately, retrofitting after the fact for hybrid systems. In this paper we consider the problem of implementing a runtime layer for Partitioned Global Address Space (PGAS) languages, which offer a uniform programming abstraction for hybrid machines. We present a new process-based shared memory runtime and compare it to our previous pthreads implementation. Both are integrated with the GASNet communication layer, and they can co-exist with one another. We evaluate the shared memory runtime approaches, showing that they interact in important and sometimes surprising ways with the communication layer. Using a set of microbenchmarks and application level benchmarks on an IBM BG/P, Cray XT, and InfiniBand cluster, we show that threads, processes and combinations of both are needed for maximum performance. Our new runtime shows speedups of over 60% for application benchmarks and 100% for collective communication benchmarks, when compared to the previous implementation. Our work primarily targets PGAS languages, but some of the lessons are relevant to other parallel runtime systems and libraries.

A Kamil, K Yelick, "Enforcing textual alignment of collectives using dynamic checks", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), January 2010, 5898 LNC:368--382, doi: 10.1007/978-3-642-13374-9_25

Many parallel programs are written in a single-program, multiple data (SPMD) style, in which synchronization is provided using collective operations that all threads execute simultaneously. If these operations are not properly aligned on all threads, deadlock can occur, and many compiler analyses and optimizations that depend on proper alignment fail. In this paper, we discuss the flaws in the Titanium language’s type system for enforcing textual alignment of collectives. We then present a system that uses runtime checks to ensure alignment for two definitions of textual alignment. The system instruments the code to keep track of alignment in each thread and then checks that alignment matches prior to performing a collective operation. We have implemented the system in the Titanium compiler, verifying that it catches alignment errors. We tested its performance on multiple application programs, demonstrating that the checks have no appreciable impact on execution time.

Dan Bonachea, Paul Hargrove, Mike Welcome, Katherine Yelick, "Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT", Cray Users Group (CUG), May 2009, doi: 10.25344/S4RP46

Partitioned Global Address Space (PGAS) Languages are an emerging alternative to MPI for HPC applications development. The GASNet library from Lawrence Berkeley National Lab and the University of California at Berkeley provides the network runtime for multiple implementations of four PGAS Languages: Unified Parallel C (UPC), Co-Array Fortran (CAF), Titanium and Chapel. GASNet provides a low overhead one-sided communication layer has enabled portability and high performance of PGAS languages. This paper describes our experiences porting GASNet to the Portals network API on the Cray XT series.

S. Williams, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Resource-Efficient, Hierarchical Auto-Tuning of a Hybrid Lattice Boltzmann Computation on the Cray XT4", Proceedings of the Cray User Group (CUG), Atlanta, GA, 2009,

Download File: cug09-lbmhd.pdf (pdf: 443 KB)

Rajesh Nishtala, Paul Hargrove, Dan Bonachea, Katherine Yelick, "Scaling Communication-Intensive Applications on BlueGene/P Using One-Sided Communication and Overlap", 23rd International Parallel & Distributed Processing Symposium (IPDPS), May 2009, doi: 10.1109/IPDPS.2009.5161076

In earlier work, we showed that the one-sided communication model found in PGAS languages (such as UPC) offers significant advantages in communication efficiency by decoupling data transfer from processor synchronization. We explore the use of the PGAS model on IBM Blue-Gene/P, an architecture that combines low-power, quad-core processors with extreme scalability. We demonstrate that the PGAS model, using a new port of the Berkeley UPC compiler and GASNet one-sided communication layer, outperforms two-sided (MPI) communication in both microbenchmarks and a case study of the communication-limited benchmark, NAS FT. We scale the benchmark up to 16,384 cores of the BlueGene/P and demonstrate that UPC consistently outperforms MPI by as much as 66% for some processor configurations and an average of 32%. In addition, the results demonstrate the scalability of the PGAS model and the Berkeley implementation of UPC, the viability of using it on machines with multicore nodes, and the effectiveness of the BG/P communication layer for supporting one-sided communication and PGAS languages.

K Madduri, S Williams, S Ethier, L Oliker, J Shalf, E Strohmaier, K Yelick, "Memory-efficient optimization of gyrokinetic particle-to-grid interpolation for multicore processors", Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC 09, January 2009, doi: 10.1145/1654059.1654108

Download File: sc09-gtc.pdf (pdf: 3 MB)

K. Datta, S. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, K. Yelick, "Auto-Tuning the 27-point Stencil for Multicore", Proceedings of Fourth International Workshop on Automatic Performance Tuning (iWAPT2009), January 2009,

Download File: iwapt09-27pt.pdf (pdf: 465 KB)

J Gebis, L Oliker, J Shalf, S Williams, K Yelick, "Improving memory subsystem performance using ViVA: Virtual vector architecture", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2009, 5455 LNC:146--158, doi: 10.1007/978-3-642-00454-4_16

Download File: arcs09-viva.pdf (pdf: 448 KB)

K Datta, M Murphy, V Volkov, S Williams, J Carter, L Oliker, D Patterson, J Shalf, K Yelick, "Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures", 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008, January 2008, doi: 10.1109/SC.2008.5222004

Download File: sc08-stencil.pdf (pdf: 598 KB)

S Williams, J Carter, L Oliker, J Shalf, K Yelick, "Lattice Boltzmann simulation optimization on leading multicore platforms", IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM, 2008, doi: 10.1109/IPDPS.2008.4536295

Download File: ipdps08-lbmhd.pdf (pdf: 560 KB)

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2007, doi: 10.1145/1362622.1362674

Download File: sc07-spmv.pdf (pdf: 438 KB)

Shivali Agarwal, Rajkishore Barik, Dan Bonachea, Vivek Sarkar, Rudrapatna Shyamasundar, Katherine Yelick,, "Deadlock-free scheduling of X10 computations with bounded resources", Annual ACM Symposium on Parallelism in Algorithms and Architectures, October 18, 2007, 229--240, doi: 10.1145/1248377.1248416

In this paper, we address the problem of guaranteeing the absence of physical deadlock in the execution of a parallel program using the async, finish, atomic, and place constructs from the X10 language. First, we extend previous work-stealing memory bound results for fully strict multi-threaded computations to terminally strict multithreaded computations in which one activity may wait for completion of a descendant activity (as in X10's async and finish constructs), not just an immediate child (as in Cilk's spawn and sync constructs). This result establishes physical dead-lock freedom for SMP deployments. Second, we introduce a new class of X10 deployments for clusters, which builds on an underlying Active Message network and the new concept of Doppelgänger mode execution of X10 activities. Third, we use this new class of deployments to establish physical deadlock freedom for deployments on clusters of uniprocessors. Together these results give the user the ability to execute a rich set of programs written with async finish atomic and place constructs without worrying about the possibility of physical deadlock due to computation, memory and communication resources. A major open topic for future work is to extend these results to deployments on clusters of SMPs.

Katherine Yelick, Dan Bonachea, Wei-Yu Chen, Phillip Colella, Kaushik Datta, Jason Duell, Susan L. Graham, Paul Hargrove, Paul Hilfinger, Parry Husbands, Costin Iancu, Amir Kamil, Rajesh Nishtala, Jimmy Su, Michael Welcome, Tong Wen, "Productivity and Performance Using Partitioned Global Address Space Languages", Proceedings of the 2007 International Workshop on Parallel Symbolic Computation (PASCO), July 2007, 24--32, doi: 10.1145/1278177.1278183

Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a consortium that boasts multiple proprietary and open source compilers. Another PGAS language, Titanium, is a dialect of Java T M designed for high performance scientific computation. In this paper we describe some of the highlights of two related projects, the Titanium project centered at U.C. Berkeley and the UPC project centered at Lawrence Berkeley National Laboratory. Both compilers use a source-to-source strategy that translates the parallel languages to C with calls to a communication layer called GASNet. The result is portable high-performance compilers that run on a large variety of shared and distributed memory multiprocessors. Both projects combine compiler, runtime, and application efforts to demonstrate some of the performance and productivity advantages to these languages.

Wei-Yu Chen, Dan Bonachea, Costin Iancu, Katherine A. Yelick, "Automatic nonblocking communication for partitioned global address space programs", Proceedings of the International Conference on Supercomputing (ICS), June 17, 2007, 158--167, doi: 10.1145/1274971.1274995

Overlapping communication with computation is an important optimization on current cluster architectures; its importance is likely to increase as the doubling of processing power far outpaces any improvements in communication latency. PGAS languages offer unique opportunities for communication overlap, because their one-sided communication model enables low overhead data transfer. Recent results have shown the value of hiding latency by manually applying language-level nonblocking data transfer routines, but this process can be both tedious and error-prone. In this paper, we present a runtime framework that automatically schedules the data transfers to achieve overlap. The optimization framework is entirely transparent to the user, and aggressively reorders and aggregates both remote puts and gets. We preserve correctness via runtime conflict checks and temporary buffers, using several techniques to lower the overhead. Experimental results on application benchmarks suggest that our framework can be very effective at hiding communication latency on clusters, improving performance over the blocking code by an average of 16% for some of the NAS Parallel Benchmarks, 48% for GUPS, and over 25% for a multi-block fluid dynamics solver. While the system is not yet as effective as aggressive manual optimization, it increases programmers' productivity by freeing them from the details of communication management.

C Iancu, W Chen, K Yelick, "Performance portable optimizations for loops containing communication operations", Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, January 1, 2007, 411, doi: 10.1109/PACT.2007.4336239

Download File: portableperformance2.pdf (pdf: 333 KB)

A Kamil, K Yelick, "Hierarchical pointer analysis for distributed programs", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), January 2007, 4634 LNC:281--297, doi: 10.1007/978-3-540-74061-2_18

We present a new pointer analysis for use in shared memory programs running on hierarchical parallel machines. The analysis is motivated by the partitioned global address space languages, in which programmers have control over data layout and threads and can directly read and write to memory associated with other threads. Titanium, UPC, Co-Array Fortran, X10, Chapel, and Fortress are all examples of such languages. The novelty of our analysis comes from the hierarchical machine model used, which captures the increasingly hierarchical nature of modern parallel machines. For example, the analysis can distinguish between pointers that can reference values within a thread, within a shared memory multiprocessor, or within a network of processors. The analysis is presented with a formal type system and operational semantics, articulating the various ways in which pointers can be used within a hierarchical machine model. The hierarchical analysis has several applications, including race detection, sequential consistency enforcement, and software caching. We present results of an implementation of the analysis, applying it to data race detection, and show that the hierarchical analysis is very effective at reducing the number of false races detected.

S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil, K. Yelick, "The Potential of the Cell Processor for Scientific Computing", ACM International Conference on Computing Frontiers, 2006, doi: 10.1145/1128022.1128027

Download File: cf06-cell-potential.pdf (pdf: 213 KB)

Christian Bell, Dan Bonachea, Rajesh Nishtala, Katherine Yelick, "Optimizing bandwidth limited problems using one-sided communication and overlap", 20th International Parallel and Distributed Processing Symposium (IPDPS), April 25, 2006, doi: 10.1109/IPDPS.2006.1639320

Partitioned Global Address Space languages like Unified Parallel C (UPC) are typically valued for their expressiveness, especially for computations with fine-grained random accesses. In this paper we show that the one-sided communication model used in these languages also has a significant performance advantage for bandwidth-limited applications. We demonstrate this benefit through communication microbenchmarks and a case-study that compares UPC and MPI implementations of the NAS Fourier Transform (FT) benchmark. Our optimizations rely on aggressively overlapping communication with computation but spreading communication events throughout the course of the local computation. This alleviates the potential communication bottleneck that occurs when the communication is packed into a single phase (e.g., the large all-to-all in a multidimensional FFT). Even though the new algorithms require more messages for the same total volume of data, the resulting overlap leads to speedups of over 1.75x and 1.9x for the two-sided and one-sided implementations, respectively, when compared to the default NAS Fortran/MPI release. Our best one-sided implementations show an average improvement of 15 percent over our best two-sided implementations. We attribute this difference to the lower software overhead of one-sided communication, which is partly fundamental to the semantic difference between one-sided and two-sided communication. Our UPC results use the Berkeley UPC compiler with the GASNet communication system, and demonstrate the portability and scalability of that language and implementation, with performance approaching 0.5 TFlop/s on the FT benchmark running on 512 processors.

S Kamil, K Datta, S Williams, L Oliker, J Shalf, K Yelick, "Implicit and explicit optimizations for stencil computations", Proceedings of the 2006 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, MSPC 2006, 2006, 51--60, doi: 10.1145/1178597.1178605

Download File: mspc06-stencil.pdf (pdf: 421 KB)

H Shan, E Strohmaier, J Qiang, DH Bailey, K Yelick, "Performance modeling and optimization of a high energy colliding beam simulation code", Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC 06, January 2006, doi: 10.1145/1188455.1188557

A Kamil, K Yelick, "Concurrency analysis for parallel programs with textually aligned barriers", Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), January 2006, 4339 LNC:185--199, doi: 10.1007/978-3-540-69330-7_13

A fundamental problem in the analysis of parallel programs is to de- termine when two statements in a program may run concurrently. This analysis is the parallel analog to control flow analysis on serial programs and is useful in detecting parallel programming errors and as a precursor to semantics-preserving code transformations. We consider the problem of analyzing parallel programs that access shared memory and use barrier synchronization, specifically those with textually aligned barriers and single-valued expressions. We present an intermediate graph representation for parallel programs and an efficient interprocedural analysis algorithm that conservatively computes the set of all concurrent statements. We improve the precision of this algorithm by using context-free language reachability to ignore infeasible program paths. We then apply the algorithms to static race detection and show that it can benefit from the concurrency information provided.

A Kamil, J Su, K Yelick, "Making sequential consistency practical in titanium", Proceedings of the International Conference on Supercomputing, January 2005, 2005-Nov, doi: 10.1109/SC.2005.43

The memory consistency model in shared memory parallel programming controls the order in which memory operations performed by one thread may be observed by another. The most natural model for programmers is to have memory accesses appear to take effect in the order specified in the original program. Language designers have been reluctant to use this strong semantics, called sequential consistency, due to concerns over the performance of memory fence instructions and related mechanisms that guarantee order. In this paper, we provide evidence for the practicality of sequential consistency by showing that advanced compiler analysis techniques are sufficient to eliminate the need for most memory fences and enable high-level optimizations. Our analyses eliminated over 97% of the memory fences that were needed by a naive implementation, accounting for 87 to 100% of the dynamically encountered fences in all but one benchmark. The impact of the memory model and analysis on runtime performance depends on the quality of the optimizations: more aggressive optimizations are likely to be invalidated by a strong memory consistency semantics. We consider two specific optimizations pipelining of bulk memory copies and communication aggregation and scheduling for irregular accesses and show that our most aggressive analysis is able to obtain the same performance as the relaxed model when applied to two linear algebra kernels. While additional work on parallel optimizations and analyses is needed, we believe these results provide important evidence on the viability of using a simple memory consistency model without sacrificing performance.

Christian Bell, Dan Bonachea, Wei-Yu Chen, Katherine Yelick, "Evaluating Support for Global Address Space Languages on the Cray X1", Proceedings of the International Conference on Supercomputing (ICS), November 22, 2004, 184--195, doi: 10.1145/1006209.1006236

The Cray X1 was recently introduced as the first in a new line of parallel systems to combine high-bandwidth vector processing with an MPP system architecture. Alongside capabilities such as automatic fine-grained data parallelism through the use of vector instructions, the X1 offers hardware support for a transparent global-address space (GAS), which makes it an interesting target for GAS languages. In this paper, we describe our experience with developing a portable, open-source and. high performance compiler for Unified Parallel C (UPC), a SPMD global-address space language extension of ISO C. As part of our implementation effort, we evaluate the X1's hardware support for GAS languages and provide empirical performance characterizations in the context of leveraging features such as vectorization and global pointers for the Berkeley UPC compiler. We discuss several difficulties encountered in the Cray C compiler which are likely to present challenges for many users, especially implementors of libraries and source-to-source translators. Finally, we analyze the performance of our compiler on some benchmark programs and show that, while there are some limitations of the current compilation approach, the Berkeley UPC compiler uses the X1 network more effectively than MPI or SHMEM, and generates serial code whose vectorizability is comparable to the original C code.

G Griem, L Oliker, J Shalf, K Yelick, "Identifying performance bottlenecks on modern microarchitectures using an adaptable probe", Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2004 (Abstracts and CD-ROM), 2004, 18:3505--3512,

Download File: pmeo2004.pdf (pdf: 419 KB)

Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu, Katherine Yelick,, "A Performance Analysis of the Berkeley UPC Compiler", Proceedings of the International Conference on Supercomputing (ICS), ACM, June 23, 2003, 63--73, doi: 10.1145/782814.782825

Unified Parallel C (UPC) is a parallel language that uses a Single Program Multiple Data (SPMD) model of parallelism within a global address space. The global address space is used to simplify programming, especially on applications with irregular data structures that lead to fine-grained sharing between threads. Recent results have shown that the performance of UPC using a commercial compiler is comparable to that of MPI [7]. In this paper we describe a portable open source compiler for UPC. Our goal is to achieve a similar performance while enabling easy porting of the compiler and runtime, and also provide a framework that allows for extensive optimizations. We identify some of the challenges in compiling UPC and use a combination of micro-benchmarks and application kernels to show that our compiler has low overhead for basic operations on shared data and is competitive, and sometimes faster than, the commercial HP compiler. We also investigate several communication optimizations, and show significant benefits by hand-optimizing the generated code.

Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Michael L. Welcome, Katherine A. Yelick, "An Evaluation of Current High-Performance Networks", Proceedings of the International Parallel & Distributed Processing Symposium (IPDPS), April 22, 2003, doi: 10.1109/IPDPS.2003.1213106

High-end supercomputers are increasingly built out of commodity components, and lack tight integration between the processor and network. This often results in inefficiencies in the communication subsystem, such as high software overheads and/or message latencies. In this paper we use a set of microbenchmarks to quantify the cost of this commoditization, measuring software overhead, latency, and bandwidth on five contemporary supercomputing networks. We compare the performance of the ubiquitous MPI layer to that of lower-level communication layers, and quantify the advantages of the latter for small message performance. We also provide data on the potential for various communication-related optimizations, such as overlapping communication with computation or other communication. Finally, we determine the minimum size needed for a message to be considered 'large' (i.e., bandwidth-bound) on these platforms, and provide historical data on the software overheads of a number of supercomputers over the past decade.

Book Chapters

E. Georganas, S. Hofmeyr, L. Oliker, R. Egan, D. Rokhsar, A. Buluc, K. Yelick, "Extreme-scale de novo genome assembly", Exascale Scientific Applications: Scalability and Performance Portability, edited by T.P. Straatsma, K. B. Antypas, T. J. Williams, ( November 13, 2017) doi: 10.1201/b21930

James Demmel, Samuel Williams, Katherine Yelick, "Automatic Performance Tuning (Autotuning)", The Berkeley Par Lab: Progress in the Parallel Computing Landscape, edited by David Patterson, Dennis Gannon, Michael Wrinn, (Microsoft Research: August 2013) Pages: 337-376

Katherine Yelick, Susan Graham, Paul Hilfinger, Dan Bonachea, Jimmy Su, Amir Kamil, Kaushik Datta, Phillip Colella, Tong Wen, "Titanium", Encyclopedia of Parallel Computing, edited by David Padua, (Springer US: 2011) Pages: 2049--2055 doi: 10.1007/978-0-387-09766-4_516

Titanium is a parallel programming language designed for high-performance scientific computing. It is based on Java and uses a Single Program Multiple Data (SPMD) parallelism model with a Partitioned Global Address Space (PGAS).

K Datta, S Williams, V Volkov, J Carter, L Oliker, J Shalf, K Yelick, "Auto-tuning stencil computations on multicore and accelerators", Scientific Computing with Multicore and Accelerators, ( 2010) Pages: 219--254 doi: 10.1201/b10376

S Williams, K Datta, L Oliker, J Carter, J Shalf, K Yelick, "Auto-Tuning Memory-Intensive Kernels for Multicore", Chapman \& Hall/CRC Computational Science, (CRC Press: 2010) Pages: 273--296 doi: 10.1201/b10509-14

Presentation/Talks

Katherine A. Yelick, Amir Kamil, Damian Rouson, Dan Bonachea, Paul H. Hargrove, UPC++: An Asynchronous RMA/RPC Library for Distributed C++ Applications (SC21), Tutorial at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC21), November 15, 2021,

UPC++ is a C++ library supporting Partitioned Global Address Space (PGAS) programming. UPC++ offers low-overhead one-sided Remote Memory Access (RMA) and Remote Procedure Calls (RPC), along with future/promise-based asynchrony to express dependencies between computation and asynchronous data movement. UPC++ supports simple/regular data structures as well as more elaborate distributed applications where communication is fine-grained and/or irregular. UPC++ provides a uniform abstraction for one-sided RMA between host and GPU/accelerator memories anywhere in the system. UPC++'s support for aggressive asynchrony enables applications to effectively overlap communication and reduce latency stalls, while the underlying GASNet-EX communication library delivers efficient low-overhead RMA/RPC on HPC networks.

This tutorial introduces UPC++, covering the memory and execution models and basic algorithm implementations. Participants gain hands-on experience incorporating UPC++ features into application proxy examples. We examine a few UPC++ applications with irregular communication (metagenomic assembler and COVID-19 simulation) and describe how they utilize UPC++ to optimize communication performance.

Katherine A. Yelick, Amir Kamil, Dan Bonachea, Paul H Hargrove, UPC++: An Asynchronous RMA/RPC Library for Distributed C++ Applications (SC20), Tutorial at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC20), November 10, 2020,

UPC++ is a C++ library supporting Partitioned Global Address Space (PGAS) programming. The UPC++ API offers low-overhead one-sided Remote Memory Access (RMA) and Remote Procedure Calls (RPC), along with future/promise-based asynchrony to express dependencies between asynchronous computations and data movement. UPC++ supports simple, regular data structures as well as more elaborate distributed structures where communication is fine-grained, irregular, or both. UPC++'s support for aggressive asynchrony enables the application to overlap communication to reduce communication wait times, and the GASNet communication layer provides efficient low-overhead RMA/RPC on HPC networks.

This tutorial introduces basic concepts and advanced optimization techniques of UPC++. We discuss the UPC++ memory and execution models and examine basic algorithm implementations. Participants gain hands-on experience incorporating UPC++ features into several application examples. We also examine two irregular applications (metagenomic assembler and multifrontal sparse solver) and describe how they leverage UPC++ features to optimize communication performance.

Amir Kamil, John Bachan, Scott B. Baden, Dan Bonachea, Rob Egan, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Kathy Yelick, UPC++: An Asynchronous RMA/RPC Library for Distributed C++ Applications (ALCF'20), Argonne Leadership Computing Facility (ALCF) Webinar Series, May 27, 2020,

UPC++ is a C++ library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The UPC++ API offers low-overhead one-sided RMA communication and Remote Procedure Calls (RPC), along with futures and promises. These constructs enable the programmer to express dependencies between asynchronous computations and data movement. UPC++ supports the implementation of simple, regular data structures as well as more elaborate distributed data structures where communication is fine-grained, irregular, or both. The library’s support for asynchrony enables the application to aggressively overlap and schedule communication and computation to reduce wait times.

UPC++ is highly portable and runs on platforms from laptops to supercomputers, with native implementations for HPC interconnects. As a C++ library, it interoperates smoothly with existing numerical libraries and on-node programming models (e.g., OpenMP, CUDA).

In this webinar, hosted by DOE’s Exascale Computing Project and the ALCF, we will introduce basic concepts and advanced optimization techniques of UPC++. We will discuss the UPC++ memory and execution models and walk through basic algorithm implementations. We will also look at irregular applications and show how they can take advantage of UPC++ features to optimize their performance.

ALCF'20 Event page

ALCF'20 Video recording

Amir Kamil, John Bachan, Scott B. Baden, Dan Bonachea, Rob Egan, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Kathy Yelick, UPC++: A PGAS/RPC Library for Asynchronous Exascale Communication in C++ (ECP'20), Tutorial at Exascale Computing Project (ECP) Annual Meeting 2020, February 6, 2020,

In this tutorial we will introduce basic concepts and advanced optimization techniques of UPC++. We will discuss the UPC++ memory and execution models and walk through basic algorithm implementations. We will also look at irregular applications and show how they can take advantage of UPC++ features to optimize their performance.

ECP'20 Event page

Amir Kamil, John Bachan, Scott B. Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Kathy Yelick, UPC++ Tutorial (NERSC Dec 2019), National Energy Research Scientific Computing Center (NERSC), December 16, 2019,

This event was a repeat of the tutorial delivered on November 1, but with the restoration of the hands-on component which was omitted due to uncertainty surrounding the power outage at NERSC.

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. UPC++ provides mechanisms for low-overhead one-sided communication, moving computation to data through remote-procedure calls, and expressing dependencies between asynchronous computations and data movement. It is particularly well-suited for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces are designed to be composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds.

In this tutorial we introduced basic concepts and advanced optimization techniques of UPC++. We discussed the UPC++ memory and execution models and walked through implementing basic algorithms in UPC++. We also discussed irregular applications and how to take advantage of UPC++ features to optimize their performance. The tutorial included hands-on exercises with basic UPC++ constructs. Registrants were given access to run their UPC++ exercises on NERSC’s Cori (currently the #14 fastest computer in the world).

NERSC Dec 2019 Event page

Amir Kamil, John Bachan, Scott B. Baden, Dan Bonachea, Rob Egan, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Kathy Yelick, UPC++ Tutorial (NERSC Nov 2019), National Energy Research Scientific Computing Center (NERSC), November 1, 2019,

In this tutorial we will introduce basic concepts and advanced optimization techniques of UPC++. We will discuss the UPC++ memory and execution models and walk through implementing basic algorithms in UPC++. We will also look at irregular applications and how to take advantage of UPC++ features to optimize their performance.

NERSC Nov 2019 Event Page

Amir Kamil, Katherine Yelick, Three Challenges and Three Solutions for Exascale Computing, NSF Workshop on Research Directions in the Principles of Parallel Computing, June 2012,

Modern high performance machines look increasingly different from those in the past. They are more hierarchical, with non-uniform memory access within a node and even within a single socket, resulting in a wider range of communication costs. They consist of heterogeneous computational elements, providing different performance and capabilities at different energy costs. Fault-tolerance is a growing concern due to a trade-off between failure rates and power use at the chip level, combined with a growing number of components in large scale systems. In this talk, we discuss three approaches to these challenges, focusing on machine hierarchy. The first is to expose the problem directly to the user in the programming model, and we present the hierarchical partitioned global address space (HPGAS) and recursive single-program, multiple-data (RSPMD) models that do so for machine hierarchy. Other solutions include using compiler analysis to automatically tackle the problem and building domain-specific libraries that hide it from the application programmer. We briefly discuss the latter two approaches, as well as some open questions in handling the three problems of hierarchy, heterogeneity, and resilience.

J. Demmel, K. Yelick, M. Anderson, G. Ballard, E. Carson, I. Dumitriu, L. Grigori, M. Hoemmen, O. Holtz, K. Keutzer, N. Knight, J. Langou, M. Mohiyuddin, O. Schwartz, E. Solomonik, S. Williams, Hua Xiang, Rethinking Algorithms for Future Architectures: Communication-Avoiding Algorithms, Hot Chips 23, 2011,

Yili Zheng, Filip Blagojevic, Dan Bonachea, Paul H. Hargrove, Steven Hofmeyr, Costin Iancu, Seung-Jai Min, Katherine Yelick, Getting Multicore Performance with UPC, SIAM Conference on Parallel Processing for Scientific Computing, February 2010,

Download File: Multicore-Performance-with-UPC-SIAMPP10-Zheng.pdf (pdf: 933 KB)

Rajesh Nishtala, Yili Zheng, Paul H. Hargrove, Katherine Yelick, UPC at Scale, SIAM Conference on Parallel Processing for Scientific Computing, February 25, 2010,

Yili Zheng, Costin Iancu, Paul H. Hargrove, Seung-Jai Min, Katherine Yelick, Extending Unified Parallel C for GPU Computing, SIAM Conference on Parallel Processing for Scientific Computing, February 24, 2010,

Kamesh Madduri, Williams, Ethier, Oliker, Shalf, Strohmaier, Katherine A. Yelick, Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2009,

Download File: siampp10-gtc-talk.pdf (pdf: 2.7 MB)
Download File: siampp10-gtc-talk.pptx (pptx: 1.3 MB)

S. Williams, et al., PERI: Auto-tuning Memory Intensive Kernels for Multicore, SciDAC PI Meeting, 2008,

Download File: scidac08-peri-talk.pdf (pdf: 9.5 MB)
Download File: scidac08-peri-talk.ppt (ppt: 5.5 MB)

Dan Bonachea, Rajesh Nishtala, Paul Hargrove, Katherine Yelick, Efficient Point-to-point Synchronization in UPC, 2nd Conf. on Partitioned Global Address Space Programming Models (PGAS06), October 4, 2006,

Download File: upc-sem-0.2.pdf (pdf: 174 KB)
Download File: PGAS06-p2p.pdf (pdf: 945 KB)
Download File: UPC-p2p-abstract.pdf (pdf: 37 KB)

C. Kozyrakis, J. Gebis, D. Martin, S. Williams, I. Mavroidis, S. Pope, D. Jones, D. Patterson, K. Yelick, Vector IRAM: A media-oriented vector processor with embedded DRAM, Hot Chips 12, 2000,

Download File: hotchips00-viram-talk.pdf (pdf: 57 KB)

Reports

UPC Consortium, "UPC Language and Library Specifications, Version 1.3", Lawrence Berkeley National Laboratory Technical Report, November 16, 2013, LBNL 6623E, doi: 10.2172/1134233

UPC is an explicitly parallel extension to the ISO C 99 Standard. UPC follows the partitioned global address space programming model. This document is the formal specification for the UPC language and library syntax and semantics, and supersedes prior specification version 1.2 (LBNL-59208).

K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, K. Yelick, "The Landscape of Parallel Computing Research: A View from Berkeley", EECS Technical Report, December 2006,

Dan Bonachea, Paul Hilfinger, Kaushik Datta, David Gay, Susan Graham, Amir Kamil, Ben Liblit, Geoff Pike, Jimmy Su, Katherine Yelick, "Titanium Language Reference Manual, Version 2.20", University of California, Berkeley Tech Report (UCB/EECS-2005-15.1), August 3, 2006, doi: 10.25344/S4H59R

The Titanium language is a Java dialect for high-performance parallel scientific computing. Titanium’s differences from Java include multi-dimensional arrays, an explicitly parallel SPMD model of computation with a global address space, a form of value class, and zone-based memory management. This reference manual describes the differences between Titanium and Java.

W. Kramer, J. Carter, D. Skinner, L. Oliker, P. Husbands, P. Hargrove, J. Shalf, O. Marques, E. Ng, A. Drummond, K. Yelick, "Software Roadmap to Plug and Play Petaflop/s", 2006,

Paul N. Hilfinger, Dan Bonachea, Kaushik Datta, David Gay, Susan L. Graham, Benjamin Robert Liblit, Geoffrey Pike, Jimmy Zhigang Su, Katherine A. Yelick, "Titanium Language Reference Manual, Version 2.19", University of California, Berkeley Tech Report (UCB/EECS-2005-15), November 17, 2005, doi: 10.25344/S47305

S. Williams, J. Shalf, L. Oliker, P. Husbands, K. Yelick, "Dense and Sparse Matrix Operations on the Cell Processor", LBNL Technical Report, 2005,

UPC Consortium, "UPC Language Specifications, v1.2", Lawrence Berkeley National Laboratory Technical Report, May 31, 2005, LBNL 59208, doi: 10.2172/862127

Katherine Yelick, Dan Bonachea, Charles Wallace, "A Proposal for a UPC Memory Consistency Model, v1.0", Lawrence Berkeley National Laboratory Technical Report, May 5, 2004, LBNL 54983, doi: 10.2172/823757

The memory consistency model in a language defines the order in which the results of write operations maybe observed through read operations. The behavior of a UPC program may depend on the timing of accesses to shared variables, so a program defines a set of possible executions, rather than a single execution. The memory consistency model constrains the set of possible executions for a given program; the user may then rely on properties that are true of all of those executions.The memory consistency model is defined in terms of the read and write operations issued by each thread in naive translation of the code, i.e., without any code transformations by the compiler – with each thread issuing operations as defined by the abstract machine defined in ISO C 5.1.2.3. A UPC compiler or runtime system may perform various code transformations to improve performance, so long as they are not visible to the programmer – i.e. provided the set of externally-visible behaviors (the input/output dynamics and volatile behavior defined in ISO C 5.1.2.3) from any execution of the transformed program are identical to those of the original program executing on the abstract machine and adhering to the consistency model defined in this document.

Paul N. Hilfinger, Dan Bonachea, David Gay, Susan L. Graham, Benjamin Liblit, Geoffrey Pike, Katherine A. Yelick, "Titanium Language Reference Manual, Version 1.5", University of California, Berkeley, Technical Report No. UCB/CSD-01-1163, November 9, 2001, doi: 10.25344/S4388P

Web Articles

"Accelerating Time-to-Solution for Computational Science and Engineering", J. Demmel, J. Dongarra, A. Fox, S. Williams, V. Volkov, K. Yelick, SciDAC Review, Number 15, December 2009,

"Code controls communication to boost computer performance", Katherine Yelick, Paul Hargrove, Lawrence Berkeley National Laboratory CS Area Communications, August 27, 2009,

The Berkeley UPC compiler, backed by the GASNet communication system, are both developed at Berkeley Lab and UC Berkeley and provide computational scientists with a portable HPC programming model focused on high-performance one-sided communication.

Posters

Dan Bonachea, Rajesh Nishtala, Paul Hargrove, Mike Welcome, Kathy Yelick,, "Optimized Collectives for PGAS Languages with One-Sided Communication", ACM/IEEE Conference on Supercomputing (SC'06) Poster Session, November 2006, doi: 10.1145/1188455.1188604

Optimized collective operations are a crucial performance factor for many scientific applications. This work investigates the design and optimization of collectives in the context of Partitioned Global Address Space (PGAS) languages such as Unified Parallel C (UPC). Languages with one-sided communication permit a more flexible and expressive collective interface with application code, in turn enabling more aggressive optimization and more effective utilization of system resources. We investigate the design tradeoffs in a collectives implementation for UPC, ranging from resource management to synchronization mechanisms and target-dependent selection of optimal communication patterns. Our collectives are implemented in the Berkeley UPC compiler using the GASNet communication system, tuned across a wide variety of supercomputing platforms, and benchmarked against MPI collectives. Special emphasis is placed on the newly added Cray XT3 backend for UPC, whose characteristics are benchmarked in detail.

Dan O Bonachea, Christian Bell, Rajesh Nishtala, Kaushik Datta, Parry Husbands, Paul Hargrove, Katherine Yelick, "The Performance and Productivity Benefits of Global Address Space Languages", ACM/IEEE Conference on Supercomputing (SC'05) Poster Session, November 2005,

Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Wei Tu, Mike Welcome, Kathy Yelick, "GASNet 2 - An Alternative High-Performance Communication Interface", ACM/IEEE Conference on Supercomputing (SC'04) Poster Session, November 2004,

Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Kathy Yelick, "GASNet: Project Overview (SC'03)", ACM/IEEE Conference on Supercomputing (SC'03) Poster Session, November 2003,

Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Kathy Yelick, "GASNet: Project Overview (SC'02)", ACM/IEEE Conference on Supercomputing (SC'02) Poster Session, November 2002,

Others

Ed Younis, Koushik Sen, Katherine Yelick, Costin Iancu, QFAST: Quantum Synthesis Using a Hierarchical Continuous Circuit Space, Bulletin of the American Physical Society, March 2021,

We present QFAST, a quantum synthesis tool designed to produce short circuits and to scale well in practice. Our contributions are: 1) a novel representation of circuits able to encode placement and topology; 2) a hierarchical approach with an iterative refinement formulation that combines "coarse-grained" fast optimization during circuit structure search with a good, but slower, optimization stage only in the final circuit instantiation. When compared against state-of-the-art techniques, although not always optimal, QFAST can reduce circuits for "time-dependent evolution" algorithms, as used by domain scientists, by 60x in depth. On typical circuits, it provides 4x better depth reduction than the widely used Qiskit and UniversalQ compilers. We also show the composability and tunability of our formulation in terms of circuit depth and running time. For example, we show how to generate shorter circuits by plugging in the best available third party synthesis algorithm at a given hierarchy level. Composability enables portability across chip architectures, which is missing from similar approaches.
QFAST is integrated with Qiskit and available at github.com/bqskit.

Alfredo Buttari, Jack Dongarra, Parry Husbands, Jakub Kurzak, Katherine Yelick, Multithreading for synchronization tolerance in matrix factorization, Journal of Physics: Conference Series, 2007, doi: 10.1088/1742-6596/78/1/012028

Physical constraints such as power, leakage and pin bandwidth are currently driving the HPC industry to produce systems with unprecedented levels of concurrency. In these parallel systems, synchronization and memory operations are becoming considerably more expensive than before. In this work we study parallel matrix factorization codes and conclude that they need to be re-engineered to avoid unnecessary (and expensive) synchronization. We propose the use of multithreading combined with intelligent schedulers and implement representative algorithms in this style. Our results indicate that this strategy can significantly outperform traditional codes.