TSUBAME2 System Architecture

Overview

TSUBAME2 is a production supercomputer operated by Global Scientific Information and Computing Center (GSIC), Tokyo Institute of Technology in corporation with our industrial partners, including NEC, HP, NVIDIA, Microsoft, Voltaire among others. Since Fall 2010, it has been one of the fastest and greenest supercomputers in the world, boasting 2.4 PFlops peak performance by aggressive GPU acceleration, which allows scientists to enjoy significantly faster, larger computing than ever. This is the second instantiation of our TSUBAME-series supercomputers with the first being, as you might guess, TSUBAME1. It also employed various cutting-edge HPC acceleration technologies, such as ClearSpeed and NVIDIA GPUs, where we had learned many important technical lessons that eventually played a crucial role in designing and constructing our latest supercomputer. Compared to its predecessor, TSUBAME2, while keeping its power consumption nearly the same as before, achieves 30x performance boost by inheriting and further enhancing the successful architectural designs.

M2050 Extended Usage of GPU Accelerators TSUBAME2's 1408 compute nodes are all equipped with three NVIDIA Tesla M2050 GPU accelerators, each of which consists of 448 small power-efficient processing cores with 3GB of high-bandwidth GDDR5 memory. Most of TSUBAME2's 2.4 PFlops performance comes from its 4224 GPUs; 512 GFlops per GPU and 2.2 PFlops in total. While historically using GPUs required significant effort in exploiting its higher performance and it still remains so to some extent, it has been made much simpler and easier with the recent technical advancements as well as adoption by major HPC libraries and applications (e.g., Amber, ABAQUS, Mathematica, etc).

Much Improved Intra- and Inter-Node Bandwidths  One of the limitations in the previous machine was its cost of data moving. This was especially highlighted with the adoption of NVIDIA Tesla S1070 GPU systems, which were the latest architecture at that time but the rest of the machine was designed and constructed two years ago before the GPU adoption. In TSUBAME2, we paid significant amount of system resources for performance-critical data paths: The peak CPU memory bandwidth is now 32 GB/s per CPU and 64 GB/s per node; the GPU memory, while limited in its capacity compared to CPU memory, features 150GB/s bandwidth; and they are connected by 8GB/s PCI Express lanes. All compute nodes are interconnected by low-latency, high-bandwidth full-bisection Infiniband networks, providing each node 10GB/s inter-node bandwidth.

Peta-Scale High-Bandwidth Shared Storage TSUBAME2 also excels at the support for large-scale data intensive computing for modern-day simulation sciences. Each node shares two file systems: NFS for user home directories, and Lustre for providing higher capacity and performance file I/O in large-scale parallel applications. The 7PB storage system is also carefully designed so that no single-point-of-failure exists, and further orchestrated with a 4PB tape storage system located offsite.

Ultra Fast Local Storage (SSD) In addition to the shared storage, each node has 120GB of fast SSD-backed local storage (some nodes have 240GB). This, along with the large-scale shared storage, makes it possible to do large-scale data-intensive computing in a much more power and performance efficient way than conventional systems.

All the above improvements in hardware aspects are organized so that the software interfaces are kept mostly unchanged from TSUBAME1. As such the majority of applications written for TSUBAME1 should need little changes to be used on TSUBAME2, and often just recompilation would be enough. For example, the new GPUs can be programmed with CUDA as in TSUBAME1, so existing CUDA programs should just run faster thanks to the 7x improved raw performance as well as various architectural improvements.

While basic software interfaces unchaged, some system configurations on TSUBAME1 were not as flexible and user-friendly as they should have been. TSUBAME2 significantly improves its usability by much more flexible resource sharing ways implemented with redesigned batch queue configurations, virtualization technologies, and newly-supported Windows operating system.

Further Information

Further details can be found at the following pages.