December 5 (Th) |
09:20 - 09:30 |
Opening Talk
Akihiro Ida (ACCMS, Kyoto University) |
ppOpen-HPC and Numerical Libraries in Post-Peta/Exa-Scale Era (Chair: Takahiro Katagiri) |
09:30 - 10:15 |
ppOpen-HPC and Beyond
Kengo Nakajima (ITC, The University of Tokyo)
Abstract:
Recently, high-end parallel computer systems have become larger and more
complex. Yet, it is very difficult for scientists and engineers to develop
efficient application code, that can take advantage of the potential for
performance improvement of these systems. We propose an open source infrastructure
for development and execution of optimized and reliable simulation code
on large-scale parallel computers. We have named this infrastructure "ppOpen-HPC"
(http://ppopenhpc.cc.u-tokyo.ac.jp/), where "pp" stands for "post-peta". The target system
is the Post T2K System based on many-core architectures, which will be
installed in FY.2015. "ppOpen-HPC" is part of a five-year project
(FY.2011-2015) spawned by the "Development of System Software Technologies
for
Post-Peta Scale High Performance Computing" funded by JST-CREST. We
released the first version of developed software last year, and second
one is available in November 2013. This talk overviews various achievements
of the project since April 2011, and provides future prospects of ppOpen-HPC
on exascale supercomputer systems. |
10:15 - 11:00 |
Migrating the Numerical Software Infrastructure towards Extreme-scale Computing
Osni Marques (LBNL)
Abstract:
The impact of the architectural features of upcoming extreme-scale supercomputers
is expected to be felt across the whole software stack. The sheer levels
of parallelism to be available on such computers, together with more levels
of memory, more concurrency with less memory per core, heterogeneous cores,
as well as an anticipated increase in the number of hard and soft failures,
will require not only a redesign of the software stack but also new approaches
for the interplay between applications and that software stack. This presentation
will summarize (some) ideas and efforts that aim at producing numerical
software that is responsive to extreme-scale computing levels, e.g. by
hiding communication latency, incorporating communication reducing techniques,
and providing resilience mechanisms. In particular, I will summarize ideas
for a hybrid linear solver, for broad classes of PDEs systems, which combines
algebraic multigrid (AMG) with a structured sparse factorization method
based on low-rank structures steaming from hierarchically semi-separable
(HSS) matrices. This approach offers opportunities for reducing the communication/flops
ratio with overall increased performance. I will also comment on the mechanism
that has been adopted for recovering in case of failures. |
Large-Scale Simulations on Post-Peta/Exa-Scale Systems (I) (Chair: Hiroshi Okuda) |
11:20 - 12:05 |
Large-Scale Parallel FDM Simulation of Strong Ground Motion and Tsunami
for Recent and Future Earthquakes
Takashi Furumura (ERI, The University of Tokyo)
Abstract:
The strong impact caused by the great 2011 off Tohoku, Japan, (Mw=9.0)
earthquake were reproduced by a large-scale FDM simulation. We developed
a parallel FDM code for ground motion and tsunami suitable for massive
parallel computing using the K-Computer together with high-resolution earthquake
source rupture model and subsurface structure model. The visualized seismic
wavefield derived by the simulation illustrating the significance of this
earthquake and the process in which large and long-time lasting of ground
motion is developed in populated cities and destructive tsunami is amplified
along the coast of Tohoku, which is verified by comparison between actual
observations from nation-wide seismic network and offshore ocean bottom
tsunami network both deployed recently over Japan to demonstrate the effectiveness
of our simulation. Thanks to the steadily growing computer power and parallel
computing techniques, we believe the reliable disaster information will
be getting more effective in exchange to unreliable earthquake prediction
for mitigation of earthquake related disasters. Towards this goal a development
of an integrated earthquake disaster prevention simulation is highly expected
by linking of sequence of simulations such as for earthquake nucleation,
seismic wave and tsunami propagation, and building oscillation with interacting
to soil with an aid of new HPCI technology. |
12:05 - 12:50 |
Acceleration of Physics-based Seismic Hazard Analysis on Hybrid Many-core
Architectures
Yifeng Cui (San Diego Supercomputer Center)
Abstract:
CyberShake is a computational platform developed by the Southern California
Earthquake Center (SCEC) that explicitly incorporates earthquake rupture
time histories and deterministic wave propagation effects into seismic
hazard calculations through the use of 3D waveform simulations. The first
physics-based probabilistic seismic hazard analysis (PSHA) models of the
Los Angeles region was created in 2009 using this platform from suites
of simulations comprising ~10^8 seismograms. The current models are, however,
limited to low seismic frequencies (<= 0.5 Hz). Our goals are to increase
this limit to above 1 Hz and provide a California-wide model that is more
than 100 times larger in computational size than the current Los Angeles
models. This talk presents the technical efforts in accelerating key strain
tensor calculations critical to probabilistic seismic hazard analysis:
algorithm-level communication reductions, data locality, scalable IO framework
development, and CPU-GPU co-scheduling for post-processing. These performance
improvements allowed us to efficiently use the hybrid many-core systems
including OLCF Titan, NCSA Blue Waters, XSED Keenelandand XSEDE Stampede,
make a statewide hazard model a goal reachable with existing supercomputers.
The presentation will conclude with a discussion on how the seismology
community can prepare for the challenges of Exascale computing. |
Large-Scale Simulations on Post-Peta/Exa-Scale Systems (II) (Chair: Kengo Nakajima) |
14:00 - 14:45 |
A Super High-Resolution Global Atmospheric Simulation by the Nonhydrostatic
Icosahedral Atmospheric Model using the K-computer
Masaki Satoh (AORI, The University of Tokyo)
Abstract:
We overview the recent numerical experiences of the Nonhydrostatic Icosahedral
Atmospheric Model (NICAM) using the K-computer. NICAM is known as the first
global nonhydrostatic model aiming at explicitly simulating deep convective
motions over the globe (Satoh et al. 2008). The advantage of the high-resolution
global nonhydrostatic model is clearly shown particularly on the multi-scale
structure of the tropical cloud systems by representing the hierarchy of
meso-scale convective systems, cloud clusters, super-cloud clusters, and
MJO. Realistic simulations of tropical cloud systems, such as tropical
cyclones which are originated from cloud clusters in the tropics, are more
required both for numerical weather prediction and future climate projection.
As a recent result, the NICAM reseach members peformed for the first time
a subkilometer-mesh global simulation using the K-computer (Miyamoto et
al. 2013). In this simulation, all the global domain is covered by a mesh
whose interval is about 870 m. Through a series of grid-refinement resolution
testing, we found that an essential change for convection statistics occurred
around 2 km grid spacing. The convection structure, number of convective
cells, and distance to the nearest convective cell dramatically changed
at this resolution. The convection core was resolved using multiple grids
in simulations with grid spacings less than 2.0km. |
14:45 - 15:15 |
A Coupling Library ppOpen-MATH/MP, Its Feature and Application
Takeshi Arakawa (RIST)
Abstract: Our group is developing a coupling tool ppOpen-MATH/MP for large-scale weak coupling simulation that enables multi-component coupling and large-scale data transfer/interpolation. PpOpen-MATH/MP has functions as follows: 1) Support various grid systems coupling, 2) 2D, 3D coupling, 3) Parallel execution of a simulation program and a visualization program (ppOpen-VIS), 4) Optimization of process mapping. A coupling library Jcup is utilized for a basic layer of ppOpen-MATH/MP software stack. By stacking upper layers that have various functions such as grid mapping and interpolation on the basic Jcup layer, ppOpen-MATH/MP will be constructed.
In this presentation, features of ppOpen-MATH/MP will be explained at first, and next, two application examples will be introduced. The first example is climate model coupling. The target models utilized in this study are atmospheric general circulation model NICAM and ocean general circulation model COCO. The reason why these models have been selected as a target is that NICAM has icosahedral grid and COCO has tri-polar grid, and wide applicability of ppOpen-MATH/MP can be shown by coupling such semi-structured (not latitude-longitude) grid models. In addition to these two models coupling, NICAM and IO-component coupling will be reported.
The second example is seismic model and structure model coupling. Seismic model is FDM model employing structured grid system. On the other hand, structure mode is FVM model that has non-structured grid model. This study is ongoing, so in the presentation, current status and future plan will be briefly reported. |
15:15 - 15:45 |
Application-Level Checkpoint/Restart Framework with Checkpoint Interval
Optimization
Hideyuki Jitsumoto (ITC, The University of Tokyo)
Abstract:
An application level checkpoint is frequently implemented within an application
that has time stepping. However the checkpoint interval tends to depend
on the application programmer's ad-hoc decisions. Essentially, the checkpoint
interval is determined based on execution environment information such
as the failure rate of hardware and the checkpoint time. We propose a directive-based
application-level checkpoint/restart framework that includes optimizing
the checkpoint interval automatically. The subject of this study is an
application that has time stepping and utilizes the SPMD model. The optimization
and renewal of the checkpoint interval are done asynchronously. A prototype
implementation that includes cooperation with the job submitting system
has been designed. On T2K supercomputer system, we evaluated the FDM3D
invocation cost. The results indicated that the overhead of the framework
is about 50 seconds with 7680 processes. |
H-Matrix Methods and Applications (Chair: Takeshi Iwashita) |
16:05 - 16:35 |
Development of the Parallel Hierarchical Matrices Library for Large-Scale
SMP Cluster
Akihiro Ida (ACCMS, Kyoto University)
Abstract:
Hierarchical matrices (H-matrices) are known as a fast technique for boundary
integral equations. In this presentation, we discuss a scheme for the H-matrices
with adaptive cross approximation (ACA) for large-scale simulation. An
improved method of H-matrices with ACA and a set of parallel algorithms
applicable to the method are proposed. We have been developing the parallel
H-matrices library based on the algorithms by using the flat-MPI and hybrid
MPI+OpenMP programming models. Performance of the library is evaluated
using numerical experiments on practical simulation, such as electric field
analyses and earthquake cycle simulations, computed on SMP cluster systems.
Although the flat-MPI version gives better parallel scalability when constructing
hierarchical matrices, the speed-up reaches a limit for the hierarchical
matrix-vector multiplication. We develop a hybrid MPI+OpenMP version to
improve the parallel scalability. The hybrid version shows the better parallel
speedup for the HMVM in the numerical test using up to 256 cores. |
16:35 - 17:05 |
Effect of the Earth's Surface Topography on the Earthquake Cycle
Makiko Otani (Kyoto University)
Abstract:
In quasi-dynamic earthquake simulations, we usually use the analytic slip
response function, assuming half-space. However, the actual seafloor topography
at the subduction zones is more complicated, and coseismic slips have sometimes
reached the trench, such as the 2011 Tohoku earthquake. In such cases,
the rupture areas on the faults come close to the Earth's surface, and
the topography may affect the earthquake rupture and cycle.
Therefore, in this study, we developed a method for calculating slip response
functions in a homogeneous elastic medium with an arbitrary topography
of the Earth's free surface and examined the effect on the earthquake cycle
simulated with H-matrix method. |
17:05 - 17:50 |
Challenges in the Numerical Simulations of Physics of Dipolar Bose-Einstein
Condensates
Aleksandra Maluckov (INS Vinca, University of Belgrade)
Abstract:
Investigations of the ultracold quantum gases open the door to a wide interdisciplinary
field of physics ranging from nonlinear dynamics to strongly correlated
quantum phases and quantum information processing. So, nowadays leading
world laboratories challenge theoretical teams by new experimental findings
and vice versa.
In our research we model the dipolar Bose-Einstein Condensate (BEC) systems
by the equations of the generalized nonlinear Schrodinger type. Depending
on the surroundings (geometry, dimensionality, complexity) they can be
the differential-difference, or integro-differential equations (stochastic
in a disordered environment). We mainly consider the ground state and dynamical
properties of different localized patterns in dipolar BEC in the photonic
lattices, chip environment, and composite BECs in the external electromagnetic
fields implementing numerical methods: the imaginary-time, or Powell minimization
method and the split-step spectral, or the Runge-Kutta method, respectively.
Intention to include long-range interactions in multi-component BECs challenges
the computing efficiency, precision and reliability. The possibility to
implement the software for large-scale computational simulations on parallel
computer systems developed by the colleagues at Kyoto University (Japan)
will provide us with a powerful tool to study the complex ultra-cold matter
systems. The first tests are in progress. |
December 6 (Fri) |
Manycore Systems and Applications (Chair: Takeshi Kitayama) |
09:30 - 10:15 |
StarPU: Leveraging Clusters of Heterogeneous Machines through Dynamic Task
Scheduling
Samuel Thibault (University of Bordeaux)
Abstract:
Heterogeneous accelerator-based architectures are more and more seen in
production HPC clusters, featuring multicore CPUs, GPU accelerators, and
now even Xeon Phi accelerators, thus providing an unprecedented amount
of processing power per node. It has thus become one of the biggest challenges
in HPC to deal with such machines which expose such a highly unbalanced
computing power. To fully tap into the potential of these heterogeneous
machines, pure offloading approaches, that consist in running an application
on regular processors while offloading part of the code on accelerators,
are not sufficient.
This talk will present the latest advances in the StarPU project, which
aims at providing portable optimized performance on clusters of heterogeneous
multicore+accelerator machines to task-based applications. The goal is
to relieve the programmer from the technical aspects of data management
and task scheduling, while applying theoretical task scheduling algorithms
on actual application execution to improve performance. It also provides
performance feedback through task profiling and trace analysis. |
10:15 - 10:45 |
Fast SpMV Implementation on GPU/Manycore Architectures and Its Application
Satoshi Ohshima (ITC, The University of Tokyo)
Abstract:
Sparse matrix-vector multiplication is one of an important calculation
of HPC because this is a dominant calculation of various applications.
On the other hand, various parallel calculation units such as multi-core
CPU, Manycore processor, and GPU are utilized in current computer systems.
The aim of this study is to accelerate SpMV on GPU and MIC. In order to
obtain high SpMV performance on these hardware, many computation kernels
are implemented and performance are evaluated. In this talk, the current
status of these activities are shown. Moreover, as the example of the utilization
of SpMV, acceleration of FEM application are shown, too. |
10:45 - 11:15 |
Basic Algorithm of Parallelized Particle Method on GPU and Cooperation
with
Continuous Model
Miki Yamamoto (JAMSTEC)
Abstract:
The computation of the motion of a large number of interacting particles, the particle method, is a fundamental method for simulating a behavior of a system. In the particle method, particles carry a physical property of the system, that is being simulated through the evolution of the trajectories and the evolution of the properties carried by the particles. The particle method is quite effective to avoid various difficulties associated with mesh-based methods. However, due to the mobility of the particles, it is not straightforward to develop a high-performance code of the method. We recently made a novel algorithm of the particle method on target for GPU architecture. In this presentation, we present our parallelized algorithm and show that it enables us increase the execution performance and save memory resources. |
11:15 - 11:45 |
Parallel Performance of Particle Method in Many-Core System
Mikito Furuichi (JAMSTEC)
Abstract:
We present a computational performance of the smoothed particle hydrodynamics
(SPH) simulation on three types of current shared-memory parallel computer
devices: many integrated core (MIC: Intel Xeon Phi) processor, graphics
processing units (GPU: Nvidia Geforce GTX Titan), and multi-core Central
Processing Unit (CPU: Intel Xeon E5-2680 and Fujitsu SPARC64 processors).
We are especially interested in the efficient shared-memory allocation
methods with proper data access patterns on each chipset. We first introduce
several parallel implementation techniques of SPH code for shared-memory
system. Then they are examined on our target architectures to find the
best algorithms for each processor unit. In addition, the computing and
the power efficiency, which are increasingly important to compare multi
device computer systems, are also examined for SPH calculation. In our
bench mark test, GPU is found to mark the best arithmetic performance as
the standalone device and the most efficient power consumption. The multi-core
CPU shows the best computing efficiency. On the other hand, the computational
speed by the MIC on Xeon Phi approached to that by two Xeon CPUs. This
indicates that using MIC is attractive choice for the existing SPH codes
parallelized by OpenMP to gain the computational acceleration by the many
many-core processors. |
ppOpen-HPC and Automatic Tuning (Chair: Hideyuki Jitsumoto) |
13:00 - 13:30 |
A Test of Performance on Many Cores and Many Integrated Core using by FDM
Calculation Code
Futoshi Mori (ERI, The University of Tokyo)
Abstract:
Recently, the performance of CPU has improved very much and CPU core has
been increased by the improvement of performance. However, the tuning method
of many cores and many integrated core are different from current architecture.
The tuning that adjusts to its cores is necessary and indispensable. FX10
in the University of Tokyo has achieved high-speed computing such as vector
operation by many cores. Moreover, recently, Intel Xeon Phi coprocessor
that is a difference kind of acceleration general purpose graphics processing
units (GPGPUs) was developed. By adding the compiler options, the calculation
on Intel Xeon Phi coprocessor can be used without changing the code on.
We developed the elastic wave propagation by the FDM code that is parallelized
by OpenMP/MPI. We present performance in the developed FDM code using FX10
and Intel Xeon Phi coprocessor. |
13:30 - 14:00 |
Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-AT
Takahiro Katagiri (ITC, The University of Tokyo)
Abstract:
To maximize performance of scientific codes in current multi-core, many-core
CPUs, it needs to utilize high parallelism for kernel loops. The parallelism
reaches more than 200 in many-core CPUs. We need to modify kernel loops
by adapting loop transformation, such as loop fusion for outer loops, to
maximize the parallelism. Modifying the kernels is time-consuming work.
In this presentation, we show a novel auto-tuning (AT) framework based
on loop transformation to extract parallelism from the kernels. To decrease
costs of software development, we are developing an AT language to specify
the AT, named ppOpen-AT. The target kernels are selected from real application
codes in ppOpen-APPL. Performance evaluation is performed with advanced
CPU architectures, such as the Fujitsu FX10, the Intel Ivy Bridge, and
the Xeon Phi. The results indicated that crucial factors of speedup are
obtained by adapting our AT framework via ppOpen-AT. |
14:00 - 14:30 |
Development of an Adaptive Mesh Refinement Framework for ppOpen-APPL/FDM
Masaharu Matsumoto (ITC, The University of Tokyo)
Abstract:
We have developed an adaptive mesh refinement (AMR) framework for ppOpen-APPL/FDM
software. The demands of multi-scale and multi-physics simulations will
be met with the advent of post-peta scale supercomputer systems. To achieve
such simulations with a reasonable cost of computer resources, the spatial
and temporal resolutions have to be adjusted locally and dynamically, depending
on the local scales of physical phenomena. In the AMR framework, computational
grids with different spacing are dynamically created in hierarchical layers
according to the local conditions of phenomena. Fine grids suitable to
the local domain which need high resolution are applied only there, and
other regions are simulated by using moderate size grids. In this talk,
key features and evaluations of out AMR framework will be shown. |
Linear Solvers (Chair: Futoshi Mori) |
14:50 - 15:35 |
A Parallel Multifrontal Solver that Exploits Hierarchically Semi-Separable
Representations
Francois-Henry Rouet (LBNL)
Abstract:
Low-rank approximation techniques are an increasingly popular way of speeding-up
sparse matrix algorithms. We focus on direct methods for the solution of
sparse linear systems, and we show how to embed hierarchically semi-separable
(HSS) representations into a multifrontal solver. These techniques allow
to decrease both the operation complexity and the memory footprint of the
solver for many real-life applications. We focus on a new parallel
geometric code that aims at solving discretized Helmholtz equations on
regular meshes. We present experimental results on very large domains (200+
million unknowns on 3D problems) and up to 16,000 cores. We also present
some new research directions, such as the use of randomized sampling. |
15:35 - 16:05 |
ICCG(B): Starategy of the Fill-in Selection for Incomplete Factroization
Preconditioning using SIMD Instructions
Takeshi Iwashita (ACCMS, Kyoto University)
Abstract:
Many recent processors are equipped with SIMD (Single Instruction Multiple
Data) instructions. We consider efficient use of SIMD instructions in the
ICCG method, which is one of most popular linear iterative solvers for
finite element analyses or other discretization methods for PDE problems.
To efficiently use the SIMD operation, we introduce a new fill-in strategy
for incomplete Cholesky factorization. The method is based on the block
CRS matrix data format and the fill-ins in each block are allowed in the
factorization. Finally, the forward and backward substations in the preconditioning
step are performed based on the blocks. The computation in the block is
given by the dense matrix operation, in which the SIMD instruction is effectively
used. Moreover, the increase in the number of fill-ins is expected to improve
the convergence rate compared with IC(0) preconditioning, in which no fill-ins
are permitted. Numerical tests using the UF matrix collection show that
the proposed method has the better solver performance than the IC(0)-CG
method. |
16:05 - 16:35 |
Effects of Re-Ordering and Blocking in a Hybrid Parallel Solver of ppOpen-APPL/FEM
Takeshi Kitayama (GSFS, The University of Tokyo)
Abstract:
We have developed a hybrid parallel iterative linear solver of ppOpen-APPL/FEM
(ppohFEM) middleware. On the iterative solver used in the Finite Element
Method, Sparse Matrix Vector product operation (SpMV) and matrix preconditioning
are the most time consuming steps. On the ppohFEM middleware, given mesh
structure is partitioned as domain and calculated by MPI parallel processes.
For each partitionized domain, OpenMP parallelization is implemented for
SpMV and matrix preconditioning. SpMV operation can be parallelized straight
forward. Some strategies for blocking of loop on OpenMP parallelization
are tested, such as block partitioning for rows, cyclic partitioning for
rows, block cyclic partitioning for rows, block partitioning for number
of non-zero elements. On the matrix preconditioning, Incomplete Choleskey
factorization cause dependency between loop and parallelization is not
straight forward. To remove the dependency, indices of nodes are reordered
by using multi color algorithm. According to the number of colors used
in reorder, the pattern of matrix is changed and it cause performance difference.
In this talk, the effects of these blocking and reordering of matrix for
the performance of iterative solver will be shown. |
16:35 - 17:05 |
Considerations on Mixed Precision Computation in an Iterative Refinement
Solver
Hiroshi Okuda (GSFS, The University of Tokyo)
Abstract:
Mixed-precision solvers enable faster computation by doing most of the work in lower precision.
Together with hardware optimized for lower precision computations, from CPUs and GPUs
to FPGAs, important speedups can be achieved. These imply lower computing time or fewer
machines. We have proposed an algorithm for automating stopping the outer solver and for
choosing the residual target for the inner solver. Use of iterative refinement has shown
a great potential when utilized with Krylov solvers.
|
17:10 - 17:20 |
Closing Talk
Satoshi Ohshima (ITC, The University of Tokyo) |
|