1. Performance scaling and energy efficiency for scientific and big-data computing


The past decades have witnessed a tremendous increase of supercomputing performance. With exascale computing currently under way, harnessing the complex resources in large-scale platforms has become increasingly difficult. Moreover, data movement has become a key performance bottleneck in modern applications, owing to their irregular memory access patterns in high-dimensional sparse data representation. Besides the loss of computational power, this also entails enormous waste of energy consumption. The situation is further exacerbated by scientific applications and simulations that continue to create massive amount of data, and the increasing gap between memory and I/O latency and computational performance. It is therefore critical to develop better solutions to meet the big-data challenge on the road to exascale computing.

We investigate performance and power issues as we scale scientific applications on emerging architectures. We study scheduling and energy reduction techniques for the complex resources on these architectures. We are interested in the management of data movement, as well as in profiling, scaling, and evaluating memory- and/or I/O-bound computations through conscious exploitation of the memory hierarchy through memory/cache partitioning and NUMA-aware techniques. We seek scalable algorithms for the efficient processing of large graphs (e.g., partitioning, clustering) and sparse linear systems (e.g., SpMV, Sparse Triangular Solver).

Recent publications:

An embedded sectioning scheme for multiprocessor topology-aware mapping of irregular applications.

Shad Kirmani, JeongHyung Park, Padma Raghavan.

International Journal of High Performance Computing Applications, 31(1) : 91-103, 2017.

Co-Scheduling Algorithms for Cache-Partitioned Systems.

Guillaume Aupy, Anne Benoit, Loïc Pottier, Padma Raghavan, Yves Robert, Manu Shantharam.

19th Workshop on Advances in Parallel and Distributed Computational Models, 874-883, 2017.

Periodic I/O scheduling for super-computers.

Guillaume Aupy, Ana Gainaru, Valentin Le Fèvre

8th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, 2017

Co-scheduling algorithms for high-throughput workload execution.

Guillaume Aupy, Manu Shantharam, Anne Benoit, Yves Robert, Padma Raghavan.

Journal of Scheduling, 19(6) : 627-640, 2016.

Phase Detection with Hidden Markov Models for DVFS on Many-Core Processors.

Joshua Dennis Booth, Jagadish Kotra, Hui Zhao, Mahmut T. Kandemir, Padma Raghavan.

35th IEEE International Conference on Distributed Computing Systems ( ICDCS’15 ), 185-195, 2015.

STS-k: a multilevel sparse triangular solution scheme for NUMA multicores.

Humayun Kabir, Joshua Dennis Booth, Guillaume Aupy, Anne Benoit, Yves Robert, Padma Raghavan.

International Conference for High Performance Computing, Networking, Storage and Analysis ( SC’15 ), 55:1-55:11, 2015.

Scalable parallel graph partitioning.

Shad Kirmani, Padma Raghavan.

International Conference for High Performance Computing, Networking, Storage and Analysis ( SC’13 ), 51:1-51:10, 2013.

2. Fault tolerance and soft error resilience for HPC systems and applications


A large percentage of computing capacity in today’s large high-performance computing systems is wasted due to failures and recoveries. Even if each node provides an individual MTBF (Mean Time Between Failures) of, one century, a machine with 100,000 such nodes will encounter a failure every 9 hours in average, which is smaller than the execution time of many HPC applications. In addition to classical fail-stop errors that simply crash, silent errors must be taken into account. Contrarily to fail-stop failures, silent errors are not detected immediately, but instead after some arbitrary detection latency can cause applications to degrade their performance, crash or reach a false result. Moreover, not all failures impact applications in the same way. Some are transient, others are unrecoverable. Some cause a fatal interruption of the application as soon as they happen while others may degrade the performance of different components or corrupt the data in a silent way.

We investigate performance variability issues and resiliency properties of scientific applications and HPC systems. We study and design new fault tolerance methods at every level of the software stack, from including preventive methods based on hardware counters to optimizing checkpointing strategies and to ABFT methods for different scientific applications. We are interested in analyzing the resiliency properties of different applications to fail stop applications and to silent failures, as well as adapting fault tolerance schemes to failure characteristics of current HPC systems and to application's memory, network and computational patterns.

Recent publications:

Multi-Level Checkpointing and Silent Error Detection for Linear Workflows.

Anne Benoit, Aurélien Cavelan, Yves Robert. Hongyang Sun.

Journal of Computational Science (To appear), 2017.

Towards Optimal Multi-Level Checkpointing.

Anne Benoit, Aurélien Cavelan, Valentin Le Fèvre, Yves Robert, Hongyang Sun.

IEEE Transactions on Computers, 66(7):1212-1226, 2017.

Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale

Anne Benoit, Aurélien Cavelan, Franck Cappello, Padma Raghavan, Yves Robert, Hongyang Sun

Fault Tolerance for HPC at eXtreme Scale 2017: 31-38

Reducing Waste in Large Scale Systems Through Introspective Analysis

Leonardo Bautista-Gomez, Ana Gainaru, Swann Perarnau, Devesh Tiwari, Saurabh Gupta, Franck Cappello, Christian Engelmann, Marc Snir

International Parallel and Distributed Processing Symposium 2016

Coping with Recall and Precision of Soft Error Detectors.

Leonardo Bautista-Gomez, Anne Benoit, Aurélien Cavelan, Saurabh K. Raina, Yves Robert, Hongyang Sun.

Journal of Parallel and Distributed Computing, 98:8-24, 2016.

Failure prediction for HPC systems and applications: current situation and open issues

Ana Gainaru, Franck Cappello, Marc Snir, William Kramer

The International Journal of High Performance Computing, Volume 27 Issue 3 Pages 272 - 281

Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing

Mohamed Slim Bouguerra, Ana Gainaru, Franck Cappello, Leonardo Bautista Gomez, Naoya Maruyama, Satoshi Matsuoka

International Parallel and Distributed Processing Symposium 2013

3. Towards better understanding of human brains with computational neuroscience


Neuroscience is traditionally hypothesis-driven with relatively small scale. As with many scientific and engineering applications, its advancement is gradually becoming data-driven. Entering the big-data era, the neuroscience community expects the rich brain data available and still being created today would lead to fundamental breakthroughs in human brain research. However, this won't happen without novel tools for today's computing systems to effectively and efficiently process the massive amount of data.

In this recent endeavor of the group in partnership with the Department of Psychology at Vanderbilt University and Vanderbilt Brain Institute, we plan to design new metrics, algorithms and software tools for high-performance computing (HPC) systems to efficiently process large neuroscience datasets (e.g. functional magnetic resonance imaging (fMRI) data to detect evolutions of human brain networks and connectivity patterns). The results of this research will help neuroscientists to advance our understanding of the human brain. The lessons learnt from developing these metrics, algorithms and software tools will also benefit other application domains and the design of next generation HPC systems.