The ExaNeSt Prototype: Evaluation of Efficient HPC Communication Hardware in an ARM-based Multi-FPGA Rack

M Ploumidis and F Chaix and N Chrysos and M Assiminakis and N Kallimanis and N Kossifidis and M Nikoloudakis and N Dimou and M Gianioudis and G Ieronymakis and A Ioannou and G Kalokerinos and P Xirouchakis and A Damianakis and M Ligerakis and T Vavouris and M Katevenis and V Papaefstathiou and M Marazakis and I Mavroidis, ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 18, 24 (2025).

DOI: 10.1145/3715152

We present and evaluate the ExaNeSt prototype, which compactly packages 128 Xilinx ZU9EG MPSoCs, two TBytes of DRAM, and eight TBytes of SSD into a liquid-cooled rack, using a custom interconnection hardware based on 10 GB/s links. We developed this testbed in 2016-2019 in order to leverage the flexibility of FPGAs for experimenting with efficient hardware support for HPC communication among tens of thousands of processors and accelerators in the quest toward Exascale systems and beyond. In the years since then, we carefully studied this system, and we present our key design choices and insights resulting from our measurement and analysis. We developed this testbed, from architecture to the PCBs and the run-time software, within the ExaNeSt project. It is fully operational in configurations with up to 8 x 4 x 4 MPSoC nodes. It achieves high density through tight board design, while also leveraging state-of-the-art liquid cooling technology. In this article, we present a thorough architectural analysis, along with important aspects of our infrastructure development. Our custom interconnect includes a low-cost low-latency network interface, offering user-level, zero-copy RDMA, which we coupled with the ARMv8 processors in the MPSoCs. We further developed the corresponding runtimes that allow us to test real MPI applications on the large-scale testbed. We evaluated our platform through MPI microbenchmarks, mini application, and full MPI applications. Single-hop, one-way latency is 1.3 mu s; approximately 0.47 mu s out of these are attributed to network interface and the user- space library that exposes its functionality to the runtime. Latency over longer paths increases as expected, reaching 2.55 mu s for a five- hop path. Bandwidth tests show that, for single-hop, link utilization reaches 82% of the theoretical capacity. Microbenchmarks based on MPI collectives reveal that broadcast latency scales as expected when the number of participating ranks increases. We also implemented a custom MPI_Allreduce accelerator in the network interface, which reduces the latency of such collectives by up to 88%. We assess performance scaling through weak and strong scaling tests for HPCG, LAMMPS, and the miniFE mini application; for all these tests, parallelization efficiency is at least 69%, or better.

Return to Publications page