Provider : Barcelona Supercomputing Center
Supercomputing efficiency of a simulation code is a multiplicative combination of different aspects.
Efficiency is attained by focusing on the different aspects, notably choosing the proper algorithms to implement and their correct implementation. Moreover, due to Amdahl’s Law, eliminating a bottleneck automatically points to the next one, which can be in a different “box”, i.e., by improving an aspect of the Parallel Efficiency one can discover that the next problem is in the Computational Scalability side.
In the Alya Development Team, shared between the Barcelona Supercomputing Center and its spinoff company ELEM Biotech, we are very aware of this important aspect. We take good care not only on the code accuracy but also on the parallel efficiency. Our cardiac computational model involves fluid mechanics, electrophysiology and solid mechanics of both tissue and biomaterials, all problems tightly coupled. Coupling comes as fluid-solid-interaction, immersed boundary methods, contact problems, or electromechanical coupling, with supplementary solution of particle transport or species concentration. All these problems must be coupled without losing the parallel efficiency of the individual Physics, requiring a specific effort when implementing the coupling schemes. In our case, coupling is based on a staggered multi-code coupling, in which different instances of Alya run using point-to-point MPI communication schemes employing communicators to group the different MPI tasks in an efficient way. Additionally, the different individual instances can run by off-loading part of work onto accelerators such as GPUs. In this way, a multi-physics simulation can run on heterogeneous systems made of CPUs hosting GPUs.
BSC is involved in the development of a RISC-V accelerator, leveraging a wide vector processing unit, in collaboration with the European Processor Initiative (EPI), a European project that conducts research to advance High-Performance Computing (HPC) through the development of European technology. EPI aims to develop a general- purpose processor and a RISC-V-based accelerator. Efficient exploitation of the vector architecture is crucial for harnessing the computational power provided by the new design.
In this co-design collaboration, we used Alya as the computational challenge to efficiently run on the EPI vector accelerator. We considered a computationally intensive part of Alya (the solver) and demonstrated that, after careful study and improvement, it can fully leverage a vector architecture while maintaining its portability.
This project can be summarized through the following items:
The scalar execution information is extracted using the FPGA-SDV performance analysis tools. In particular, the hardware counters are read using the PAPI library.
The metrics used for scalar analysis together with the value limits are the following:
The information about vector instructions is extracted using the Vehave emulator and later converted to a trace format. When running in the FPGA, the timings of these vector instructions are extracted using hardware counters.
The metrics used for vector analysis are the following:
To gain a deeper understanding of the instructions, we considered an instruction hierarchy for later analysis purposes to classify each instruction into a type. The instruction hierarchy tree is shown in the figure below. The box Vector aggregates the instructions executed on the vector processing unit. Vector configuration instructions are the instructions that configure the vector length and element width of the vector processing unit at run-time. Control lane vector instructions include all vector instructions that do not compute any arithmetic values or communicate with memory, such as moves, shifts or sign extensions.
The overall conclusion from our general performance evaluation for this particular Alya kernel is that the VECTOR_SIZE240 configuration is the faster configuration when running in the Vector Processing Unit (VPU), achieving a speed-up of 7.6× over the scalar execution. The trend is that a RISC-V processor would provide a substantial improvement in terms of speed-up. We will extend this report in the final conclusions of the project.
BSC’s start-up company, ELEM Biotech, is carrying out the porting of some specific kernels of Alya to GPUs. ELEM was created in 2018 to co-develop with and commercialize modelling and simulation R+D coming from BSC for the biomedical domain. In particular, Alya is the core simulation software used by ELEM’s Virtual Human’s platform.
The specific GPU porting done by ELEM in collaboration with BSC is for Alya’s electrophysiology module, focusing on three kernels, which represent the heaviest part of a typical electrophysiology run:
We have extracted the kernels and converted them into mini-apps, porting them using OpenACC in Fortran subroutines compiled using the NVIDIA Fortran compiler. We have tested and assessed the speed-up in the three mini-apps, which is positive in all cases. Today we are fusing the ported mini-apps back into the production Alya code and starting the testing of production runs.