SPNS2013 Program

December 5 (Th)
09:20 - 09:30	Opening Talk Akihiro Ida (ACCMS, Kyoto University)
ppOpen-HPC and Numerical Libraries in Post-Peta/Exa-Scale Era (Chair: Takahiro Katagiri)
09:30 - 10:15	ppOpen-HPC and Beyond Kengo Nakajima (ITC, The University of Tokyo) Abstract: Recently, high-end parallel computer systems have become larger and more complex. Yet, it is very difficult for scientists and engineers to develop efficient application code, that can take advantage of the potential for performance improvement of these systems. We propose an open source infrastructure for development and execution of optimized and reliable simulation code on large-scale parallel computers. We have named this infrastructure "ppOpen-HPC" (http://ppopenhpc.cc.u-tokyo.ac.jp/), where "pp" stands for "post-peta". The target system is the Post T2K System based on many-core architectures, which will be installed in FY.2015. "ppOpen-HPC" is part of a five-year project (FY.2011-2015) spawned by the "Development of System Software Technologies for Post-Peta Scale High Performance Computing" funded by JST-CREST. We released the first version of developed software last year, and second one is available in November 2013. This talk overviews various achievements of the project since April 2011, and provides future prospects of ppOpen-HPC on exascale supercomputer systems.
10:15 - 11:00	Migrating the Numerical Software Infrastructure towards Extreme-scale Computing Osni Marques (LBNL) Abstract: The impact of the architectural features of upcoming extreme-scale supercomputers is expected to be felt across the whole software stack. The sheer levels of parallelism to be available on such computers, together with more levels of memory, more concurrency with less memory per core, heterogeneous cores, as well as an anticipated increase in the number of hard and soft failures, will require not only a redesign of the software stack but also new approaches for the interplay between applications and that software stack. This presentation will summarize (some) ideas and efforts that aim at producing numerical software that is responsive to extreme-scale computing levels, e.g. by hiding communication latency, incorporating communication reducing techniques, and providing resilience mechanisms. In particular, I will summarize ideas for a hybrid linear solver, for broad classes of PDEs systems, which combines algebraic multigrid (AMG) with a structured sparse factorization method based on low-rank structures steaming from hierarchically semi-separable (HSS) matrices. This approach offers opportunities for reducing the communication/flops ratio with overall increased performance. I will also comment on the mechanism that has been adopted for recovering in case of failures.
Large-Scale Simulations on Post-Peta/Exa-Scale Systems (I) (Chair: Hiroshi Okuda)
11:20 - 12:05	Large-Scale Parallel FDM Simulation of Strong Ground Motion and Tsunami for Recent and Future Earthquakes Takashi Furumura (ERI, The University of Tokyo) Abstract: The strong impact caused by the great 2011 off Tohoku, Japan, (Mw=9.0) earthquake were reproduced by a large-scale FDM simulation. We developed a parallel FDM code for ground motion and tsunami suitable for massive parallel computing using the K-Computer together with high-resolution earthquake source rupture model and subsurface structure model. The visualized seismic wavefield derived by the simulation illustrating the significance of this earthquake and the process in which large and long-time lasting of ground motion is developed in populated cities and destructive tsunami is amplified along the coast of Tohoku, which is verified by comparison between actual observations from nation-wide seismic network and offshore ocean bottom tsunami network both deployed recently over Japan to demonstrate the effectiveness of our simulation. Thanks to the steadily growing computer power and parallel computing techniques, we believe the reliable disaster information will be getting more effective in exchange to unreliable earthquake prediction for mitigation of earthquake related disasters. Towards this goal a development of an integrated earthquake disaster prevention simulation is highly expected by linking of sequence of simulations such as for earthquake nucleation, seismic wave and tsunami propagation, and building oscillation with interacting to soil with an aid of new HPCI technology.
12:05 - 12:50	Acceleration of Physics-based Seismic Hazard Analysis on Hybrid Many-core Architectures Yifeng Cui (San Diego Supercomputer Center) Abstract: CyberShake is a computational platform developed by the Southern California Earthquake Center (SCEC) that explicitly incorporates earthquake rupture time histories and deterministic wave propagation effects into seismic hazard calculations through the use of 3D waveform simulations. The first physics-based probabilistic seismic hazard analysis (PSHA) models of the Los Angeles region was created in 2009 using this platform from suites of simulations comprising ~10^8 seismograms. The current models are, however, limited to low seismic frequencies (<= 0.5 Hz). Our goals are to increase this limit to above 1 Hz and provide a California-wide model that is more than 100 times larger in computational size than the current Los Angeles models. This talk presents the technical efforts in accelerating key strain tensor calculations critical to probabilistic seismic hazard analysis: algorithm-level communication reductions, data locality, scalable IO framework development, and CPU-GPU co-scheduling for post-processing. These performance improvements allowed us to efficiently use the hybrid many-core systems including OLCF Titan, NCSA Blue Waters, XSED Keenelandand XSEDE Stampede, make a statewide hazard model a goal reachable with existing supercomputers. The presentation will conclude with a discussion on how the seismology community can prepare for the challenges of Exascale computing.
Large-Scale Simulations on Post-Peta/Exa-Scale Systems (II) (Chair: Kengo Nakajima)
14:00 - 14:45	A Super High-Resolution Global Atmospheric Simulation by the Nonhydrostatic Icosahedral Atmospheric Model using the K-computer Masaki Satoh (AORI, The University of Tokyo) Abstract: We overview the recent numerical experiences of the Nonhydrostatic Icosahedral Atmospheric Model (NICAM) using the K-computer. NICAM is known as the first global nonhydrostatic model aiming at explicitly simulating deep convective motions over the globe (Satoh et al. 2008). The advantage of the high-resolution global nonhydrostatic model is clearly shown particularly on the multi-scale structure of the tropical cloud systems by representing the hierarchy of meso-scale convective systems, cloud clusters, super-cloud clusters, and MJO. Realistic simulations of tropical cloud systems, such as tropical cyclones which are originated from cloud clusters in the tropics, are more required both for numerical weather prediction and future climate projection. As a recent result, the NICAM reseach members peformed for the first time a subkilometer-mesh global simulation using the K-computer (Miyamoto et al. 2013). In this simulation, all the global domain is covered by a mesh whose interval is about 870 m. Through a series of grid-refinement resolution testing, we found that an essential change for convection statistics occurred around 2 km grid spacing. The convection structure, number of convective cells, and distance to the nearest convective cell dramatically changed at this resolution. The convection core was resolved using multiple grids in simulations with grid spacings less than 2.0km.
14:45 - 15:15	A Coupling Library ppOpen-MATH/MP, Its Feature and Application Takeshi Arakawa (RIST) Abstract: Our group is developing a coupling tool ppOpen-MATH/MP for large-scale weak coupling simulation that enables multi-component coupling and large-scale data transfer/interpolation. PpOpen-MATH/MP has functions as follows: 1) Support various grid systems coupling, 2) 2D, 3D coupling, 3) Parallel execution of a simulation program and a visualization program (ppOpen-VIS), 4) Optimization of process mapping. A coupling library Jcup is utilized for a basic layer of ppOpen-MATH/MP software stack. By stacking upper layers that have various functions such as grid mapping and interpolation on the basic Jcup layer, ppOpen-MATH/MP will be constructed. In this presentation, features of ppOpen-MATH/MP will be explained at first, and next, two application examples will be introduced. The first example is climate model coupling. The target models utilized in this study are atmospheric general circulation model NICAM and ocean general circulation model COCO. The reason why these models have been selected as a target is that NICAM has icosahedral grid and COCO has tri-polar grid, and wide applicability of ppOpen-MATH/MP can be shown by coupling such semi-structured (not latitude-longitude) grid models. In addition to these two models coupling, NICAM and IO-component coupling will be reported. The second example is seismic model and structure model coupling. Seismic model is FDM model employing structured grid system. On the other hand, structure mode is FVM model that has non-structured grid model. This study is ongoing, so in the presentation, current status and future plan will be briefly reported.
15:15 - 15:45	Application-Level Checkpoint/Restart Framework with Checkpoint Interval Optimization Hideyuki Jitsumoto (ITC, The University of Tokyo) Abstract: An application level checkpoint is frequently implemented within an application that has time stepping. However the checkpoint interval tends to depend on the application programmer's ad-hoc decisions. Essentially, the checkpoint interval is determined based on execution environment information such as the failure rate of hardware and the checkpoint time. We propose a directive-based application-level checkpoint/restart framework that includes optimizing the checkpoint interval automatically. The subject of this study is an application that has time stepping and utilizes the SPMD model. The optimization and renewal of the checkpoint interval are done asynchronously. A prototype implementation that includes cooperation with the job submitting system has been designed. On T2K supercomputer system, we evaluated the FDM3D invocation cost. The results indicated that the overhead of the framework is about 50 seconds with 7680 processes.
H-Matrix Methods and Applications (Chair: Takeshi Iwashita)
16:05 - 16:35	Development of the Parallel Hierarchical Matrices Library for Large-Scale SMP Cluster Akihiro Ida (ACCMS, Kyoto University) Abstract: Hierarchical matrices (H-matrices) are known as a fast technique for boundary integral equations. In this presentation, we discuss a scheme for the H-matrices with adaptive cross approximation (ACA) for large-scale simulation. An improved method of H-matrices with ACA and a set of parallel algorithms applicable to the method are proposed. We have been developing the parallel H-matrices library based on the algorithms by using the flat-MPI and hybrid MPI+OpenMP programming models. Performance of the library is evaluated using numerical experiments on practical simulation, such as electric field analyses and earthquake cycle simulations, computed on SMP cluster systems. Although the flat-MPI version gives better parallel scalability when constructing hierarchical matrices, the speed-up reaches a limit for the hierarchical matrix-vector multiplication. We develop a hybrid MPI+OpenMP version to improve the parallel scalability. The hybrid version shows the better parallel speedup for the HMVM in the numerical test using up to 256 cores.
16:35 - 17:05	Effect of the Earth's Surface Topography on the Earthquake Cycle Makiko Otani (Kyoto University) Abstract: In quasi-dynamic earthquake simulations, we usually use the analytic slip response function, assuming half-space. However, the actual seafloor topography at the subduction zones is more complicated, and coseismic slips have sometimes reached the trench, such as the 2011 Tohoku earthquake. In such cases, the rupture areas on the faults come close to the Earth's surface, and the topography may affect the earthquake rupture and cycle. Therefore, in this study, we developed a method for calculating slip response functions in a homogeneous elastic medium with an arbitrary topography of the Earth's free surface and examined the effect on the earthquake cycle simulated with H-matrix method.
17:05 - 17:50	Challenges in the Numerical Simulations of Physics of Dipolar Bose-Einstein Condensates Aleksandra Maluckov (INS Vinca, University of Belgrade) Abstract: Investigations of the ultracold quantum gases open the door to a wide interdisciplinary field of physics ranging from nonlinear dynamics to strongly correlated quantum phases and quantum information processing. So, nowadays leading world laboratories challenge theoretical teams by new experimental findings and vice versa. In our research we model the dipolar Bose-Einstein Condensate (BEC) systems by the equations of the generalized nonlinear Schrodinger type. Depending on the surroundings (geometry, dimensionality, complexity) they can be the differential-difference, or integro-differential equations (stochastic in a disordered environment). We mainly consider the ground state and dynamical properties of different localized patterns in dipolar BEC in the photonic lattices, chip environment, and composite BECs in the external electromagnetic fields implementing numerical methods: the imaginary-time, or Powell minimization method and the split-step spectral, or the Runge-Kutta method, respectively. Intention to include long-range interactions in multi-component BECs challenges the computing efficiency, precision and reliability. The possibility to implement the software for large-scale computational simulations on parallel computer systems developed by the colleagues at Kyoto University (Japan) will provide us with a powerful tool to study the complex ultra-cold matter systems. The first tests are in progress.

December 6 (Fri)
Manycore Systems and Applications (Chair: Takeshi Kitayama)
09:30 - 10:15	StarPU: Leveraging Clusters of Heterogeneous Machines through Dynamic Task Scheduling Samuel Thibault (University of Bordeaux) Abstract: Heterogeneous accelerator-based architectures are more and more seen in production HPC clusters, featuring multicore CPUs, GPU accelerators, and now even Xeon Phi accelerators, thus providing an unprecedented amount of processing power per node. It has thus become one of the biggest challenges in HPC to deal with such machines which expose such a highly unbalanced computing power. To fully tap into the potential of these heterogeneous machines, pure offloading approaches, that consist in running an application on regular processors while offloading part of the code on accelerators, are not sufficient. This talk will present the latest advances in the StarPU project, which aims at providing portable optimized performance on clusters of heterogeneous multicore+accelerator machines to task-based applications. The goal is to relieve the programmer from the technical aspects of data management and task scheduling, while applying theoretical task scheduling algorithms on actual application execution to improve performance. It also provides performance feedback through task profiling and trace analysis.
10:15 - 10:45	Fast SpMV Implementation on GPU/Manycore Architectures and Its Application Satoshi Ohshima (ITC, The University of Tokyo) Abstract: Sparse matrix-vector multiplication is one of an important calculation of HPC because this is a dominant calculation of various applications. On the other hand, various parallel calculation units such as multi-core CPU, Manycore processor, and GPU are utilized in current computer systems. The aim of this study is to accelerate SpMV on GPU and MIC. In order to obtain high SpMV performance on these hardware, many computation kernels are implemented and performance are evaluated. In this talk, the current status of these activities are shown. Moreover, as the example of the utilization of SpMV, acceleration of FEM application are shown, too.
10:45 - 11:15	Basic Algorithm of Parallelized Particle Method on GPU and Cooperation with Continuous Model Miki Yamamoto (JAMSTEC) Abstract: The computation of the motion of a large number of interacting particles, the particle method, is a fundamental method for simulating a behavior of a system. In the particle method, particles carry a physical property of the system, that is being simulated through the evolution of the trajectories and the evolution of the properties carried by the particles. The particle method is quite effective to avoid various difficulties associated with mesh-based methods. However, due to the mobility of the particles, it is not straightforward to develop a high-performance code of the method. We recently made a novel algorithm of the particle method on target for GPU architecture. In this presentation, we present our parallelized algorithm and show that it enables us increase the execution performance and save memory resources.
11:15 - 11:45	Parallel Performance of Particle Method in Many-Core System Mikito Furuichi (JAMSTEC) Abstract: We present a computational performance of the smoothed particle hydrodynamics (SPH) simulation on three types of current shared-memory parallel computer devices: many integrated core (MIC: Intel Xeon Phi) processor, graphics processing units (GPU: Nvidia Geforce GTX Titan), and multi-core Central Processing Unit (CPU: Intel Xeon E5-2680 and Fujitsu SPARC64 processors). We are especially interested in the efficient shared-memory allocation methods with proper data access patterns on each chipset. We first introduce several parallel implementation techniques of SPH code for shared-memory system. Then they are examined on our target architectures to find the best algorithms for each processor unit. In addition, the computing and the power efficiency, which are increasingly important to compare multi device computer systems, are also examined for SPH calculation. In our bench mark test, GPU is found to mark the best arithmetic performance as the standalone device and the most efficient power consumption. The multi-core CPU shows the best computing efficiency. On the other hand, the computational speed by the MIC on Xeon Phi approached to that by two Xeon CPUs. This indicates that using MIC is attractive choice for the existing SPH codes parallelized by OpenMP to gain the computational acceleration by the many many-core processors.
ppOpen-HPC and Automatic Tuning (Chair: Hideyuki Jitsumoto)
13:00 - 13:30	A Test of Performance on Many Cores and Many Integrated Core using by FDM Calculation Code Futoshi Mori (ERI, The University of Tokyo) Abstract: Recently, the performance of CPU has improved very much and CPU core has been increased by the improvement of performance. However, the tuning method of many cores and many integrated core are different from current architecture. The tuning that adjusts to its cores is necessary and indispensable. FX10 in the University of Tokyo has achieved high-speed computing such as vector operation by many cores. Moreover, recently, Intel Xeon Phi coprocessor that is a difference kind of acceleration general purpose graphics processing units (GPGPUs) was developed. By adding the compiler options, the calculation on Intel Xeon Phi coprocessor can be used without changing the code on. We developed the elastic wave propagation by the FDM code that is parallelized by OpenMP/MPI. We present performance in the developed FDM code using FX10 and Intel Xeon Phi coprocessor.
13:30 - 14:00	Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-AT Takahiro Katagiri (ITC, The University of Tokyo) Abstract: To maximize performance of scientific codes in current multi-core, many-core CPUs, it needs to utilize high parallelism for kernel loops. The parallelism reaches more than 200 in many-core CPUs. We need to modify kernel loops by adapting loop transformation, such as loop fusion for outer loops, to maximize the parallelism. Modifying the kernels is time-consuming work. In this presentation, we show a novel auto-tuning (AT) framework based on loop transformation to extract parallelism from the kernels. To decrease costs of software development, we are developing an AT language to specify the AT, named ppOpen-AT. The target kernels are selected from real application codes in ppOpen-APPL. Performance evaluation is performed with advanced CPU architectures, such as the Fujitsu FX10, the Intel Ivy Bridge, and the Xeon Phi. The results indicated that crucial factors of speedup are obtained by adapting our AT framework via ppOpen-AT.
14:00 - 14:30	Development of an Adaptive Mesh Refinement Framework for ppOpen-APPL/FDM Masaharu Matsumoto (ITC, The University of Tokyo) Abstract: We have developed an adaptive mesh refinement (AMR) framework for ppOpen-APPL/FDM software. The demands of multi-scale and multi-physics simulations will be met with the advent of post-peta scale supercomputer systems. To achieve such simulations with a reasonable cost of computer resources, the spatial and temporal resolutions have to be adjusted locally and dynamically, depending on the local scales of physical phenomena. In the AMR framework, computational grids with different spacing are dynamically created in hierarchical layers according to the local conditions of phenomena. Fine grids suitable to the local domain which need high resolution are applied only there, and other regions are simulated by using moderate size grids. In this talk, key features and evaluations of out AMR framework will be shown.
Linear Solvers (Chair: Futoshi Mori)
14:50 - 15:35	A Parallel Multifrontal Solver that Exploits Hierarchically Semi-Separable Representations Francois-Henry Rouet (LBNL) Abstract: Low-rank approximation techniques are an increasingly popular way of speeding-up sparse matrix algorithms. We focus on direct methods for the solution of sparse linear systems, and we show how to embed hierarchically semi-separable (HSS) representations into a multifrontal solver. These techniques allow to decrease both the operation complexity and the memory footprint of the solver for many real-life applications. We focus on a new parallel geometric code that aims at solving discretized Helmholtz equations on regular meshes. We present experimental results on very large domains (200+ million unknowns on 3D problems) and up to 16,000 cores. We also present some new research directions, such as the use of randomized sampling.
15:35 - 16:05	ICCG(B): Starategy of the Fill-in Selection for Incomplete Factroization Preconditioning using SIMD Instructions Takeshi Iwashita (ACCMS, Kyoto University) Abstract: Many recent processors are equipped with SIMD (Single Instruction Multiple Data) instructions. We consider efficient use of SIMD instructions in the ICCG method, which is one of most popular linear iterative solvers for finite element analyses or other discretization methods for PDE problems. To efficiently use the SIMD operation, we introduce a new fill-in strategy for incomplete Cholesky factorization. The method is based on the block CRS matrix data format and the fill-ins in each block are allowed in the factorization. Finally, the forward and backward substations in the preconditioning step are performed based on the blocks. The computation in the block is given by the dense matrix operation, in which the SIMD instruction is effectively used. Moreover, the increase in the number of fill-ins is expected to improve the convergence rate compared with IC(0) preconditioning, in which no fill-ins are permitted. Numerical tests using the UF matrix collection show that the proposed method has the better solver performance than the IC(0)-CG method.
16:05 - 16:35	Effects of Re-Ordering and Blocking in a Hybrid Parallel Solver of ppOpen-APPL/FEM Takeshi Kitayama (GSFS, The University of Tokyo) Abstract: We have developed a hybrid parallel iterative linear solver of ppOpen-APPL/FEM (ppohFEM) middleware. On the iterative solver used in the Finite Element Method, Sparse Matrix Vector product operation (SpMV) and matrix preconditioning are the most time consuming steps. On the ppohFEM middleware, given mesh structure is partitioned as domain and calculated by MPI parallel processes. For each partitionized domain, OpenMP parallelization is implemented for SpMV and matrix preconditioning. SpMV operation can be parallelized straight forward. Some strategies for blocking of loop on OpenMP parallelization are tested, such as block partitioning for rows, cyclic partitioning for rows, block cyclic partitioning for rows, block partitioning for number of non-zero elements. On the matrix preconditioning, Incomplete Choleskey factorization cause dependency between loop and parallelization is not straight forward. To remove the dependency, indices of nodes are reordered by using multi color algorithm. According to the number of colors used in reorder, the pattern of matrix is changed and it cause performance difference. In this talk, the effects of these blocking and reordering of matrix for the performance of iterative solver will be shown.
16:35 - 17:05	Considerations on Mixed Precision Computation in an Iterative Refinement Solver Hiroshi Okuda (GSFS, The University of Tokyo) Abstract: Mixed-precision solvers enable faster computation by doing most of the work in lower precision. Together with hardware optimized for lower precision computations, from CPUs and GPUs to FPGAs, important speedups can be achieved. These imply lower computing time or fewer machines. We have proposed an algorithm for automating stopping the outer solver and for choosing the residual target for the inner solver. Use of iterative refinement has shown a great potential when utilized with Krylov solvers.
17:10 - 17:20	Closing Talk Satoshi Ohshima (ITC, The University of Tokyo)