Host-Independent PCIe Compute: Where We're Going, We Don't Need Nodes
by Ian Cutress on December 21, 2015 2:00 PM ESTThe typical view of a cluster or supercomputer that uses a GPU, an FPGA or a Xeon Phi type device is that each node in the system requires one host or CPU to communicate through the PCIe root complex to 1-4 coprocessors. In some circumstances, the CPU/host model adds complexity, when all you really need is more coprocessors. This is where host-independent compute comes in.
The CPU handles the networking transfer and when combined with the south bridge, manages the IO and other features. Some orientations allow the coprocessors to talk directly with each other, and the CPU part allows large datasets to be held in local host DRAM. However for some compute workloads, all you need is more coprocessor cards. Storage and memory might be decentralized, and adding in hosts creates cost and complexity - a host that seamlessly has access to 20 coprocessors is easier to handle than 20 hosts with one each. This is the goal of EXTOLL as part of the DEEP (Dynamical Exascale Entry Platform) Project.
At SuperComputing 15, one of the academic posters on display from Sarah Neuwirth and her team from the University of Heidelberg was around developing the hardware and software stacks to allow for host-independent PCIe coprocessors through a custom fabric. This is theory would allow for compute nodes in a cluster to be split specifically into CPU and PCIe compute nodes, depending on the need of the simulation, but also allows for fail over or multiple user access. All of this is developed through their EXTOLL network interface chip, which has subsequently been spun out into a commercial entity.
A side note - In academia, it is common enough that the best ideas, if they're not locked down by funding terms and conditions, are spun out into commercial enterprises. With enough university or venture capital in exchange for a percentage of ownership, an academic team can hire external experts to make their ideas a commercial product. These ideas either work and fail, or sometimes the intellectual property is sold up the chain to a tech industry giant.
The concept of EXTOLL is to act as a mini-host to initialize the coprocessor but also handles the routing and memory address translation such that it is transparent to all parties involved. On a coprocessor with EXTOLL equipped, it can be connected into a fabric of other compute, storage and host nodes and yet be accessible to all. Multiple hosts can connect into the fabric, and coprocessors in the fabric can communicate directly to each other without the need to move out to a host. This is all controlled via MPI command extensions for which the interface is optimised.
The top level representation of the EXTOLL gives seven external ports supporting cluster architectures up to a 3D Torus plus one extra. The internal switch manages which network port is in use, derived from the translation layer provided by the IP blocks: VELO is the Virtualized Engine for Low Overhead that deals with MPI and in particular small messages, RMA is the Remote Memory Access unit that implements put/get with one-or-zero-copy operations and zero CPU interaction, and the SMFU which is the Shared Memory Function Unit for exporting segments of local memory to remote nodes. This all communicates to the PCIe coprocessor via the host interface which supports both PCIe 3.0 or HyperTransport 3.0.
From topology point of view, EXTOLL is not to act as a replacement for a regular network fabric and adds in a separate fabric layer. In the diagram above, the exploded view gives compute and host nodes (CN) offering standard fabric options, booster interface nodes (BI) that have both the standard fabric and EXTOLL fabric, then booster nodes (BN) which are just the PCIe coprocessor and an EXTOLL NIC. With this there can be a 1 to many or a many to many representation depending on what is needed, or in most cases the BI and BN can be combined into a single unit. From the end users perspective, this should all be seamless.
I discussed this and was told that several users could allocate themselves a certain number of coprocessors or the admin can set the limits depending on login or other workloads queued.
On the software side, EXTOLL sits between the coprocessor driver as a virtual PCI layer. This communicates to the hardware through the EXTOLL driver, telling the hardware to perform the required methods of address translation or MPI messages etc. The driver provides the tools to do the necessary translation of PCI commands across its own API.
The goal of something like EXTOLL is to be part of the PCIe coprocessor itself, similar to how Omni-Path will be on Knights Landing, either as a custom IC on the package or internal to the die. That way the EXTOLL connected devices can be developed into devices in a different physical format to the standard PCIe coprocessor cards, perhaps with integrated power and cooling to make design more efficient. The first generation of this was built on an FPGA and used as an add-in to a power and data only PCIe interface. The second generation is similar, but this time has moved out into a 65nm TSMC based ASIC, reducing power and increasing performance. The latest version is the Tourmalet card, using upgraded IP blocks and featuring 100 GB/s per direction and 1.75 TB/s switching capacity.
Early hardware in the DEEP Project, to which EXTOLL is a key part
Current tests with the 2nd generation, the Galibier, and a dual node design gave LAMMPS (a biochemistry library) speed up of 57%.
The concept of host-less PCIe coprocessors is one of the next steps towards exascale computing, and EXTOLL are now straddling the line between commercial products and presenting their research as part of academic endeavours, even though there is the goal of external investment, similar to a startup. I am told they already have interest and proof of concept deployment with two partners, but this sort of technology needs to be integrated into the coprocessor itself - having something the size of a motherboard with several coprocessors talking via EXTOLL without external cables should be part of the endgame here, as long as power and cooling can be controlled. The other factor is ease of integration with software. If it fits easily into current MPI based codes and libraries, on C++ and FORTRAN, and it can be supported as new hardware is developed with new use cases, then it is a positive step. Arguably EXTOLL thus needs to be brought into on of the large tech firms, most likely as an IP purchase, or others will develop something similar depending on patents. Arguably the best person into that position will be Intel with its Omni-Path, but consider that FPGA vendors have close ties to Infiniband, so there could be potential there.
Relevant Paper: Scalable Communication Architecture for Network-Attached Accelerators
8 Comments
View All Comments
YaleZhang - Monday, December 21, 2015 - link
Can you achieve the same thing with PCI bridges? You can have up to 255 PCI busses, so you can host about 128 GPUs on the leaf nodes. It seems the main limitation from using PCI bridges would be that the topology will have to be a tree, which will be a bottleneck if there's no data locality.Loki726 - Monday, December 21, 2015 - link
I think that part of the problem is GPU driver performance and correct functionality with that many devices, but you can get quite far with bridges.SleepyFE - Tuesday, December 22, 2015 - link
The GPU is more power hungry and less flexible. With an FPGA you program an ASIC onto it making it suit your needs better.SaberKOG91 - Tuesday, December 22, 2015 - link
Application Specific Processor, not ASIC. ASICs are full-custom chip designs. Technically an FPGA is an ASIC which can be programmed to perform specific functionality required for an application, but never at the speed or power of an ASIC.Vatharian - Monday, December 21, 2015 - link
If the main host dies, you have to shut down whole tree. In distributed host-less environment you're mainly taking offline single nodes for very short time. Also in high concurrency PCIe tends to congest, and DRAM throughput is big limitation, especially if data from one channel connected to one CPU needs data from other DRAM/NUMA node. That can introduce high randomness into latency and further decrease performance.ddriver - Monday, December 21, 2015 - link
Yay, finally, I've been wondering with PCIe switch chips being so affordable, how long before someone figures out you can make crazy fast complex topology interconnect on the budget. I've been looking into using PCIe switches to build supercomputers out of affordable single socket motherboards - snap in a decent CPU and 2 GPUs for compute, use the other PCIe slots to connect to other nodes and there you have it - no need for crazy expensive proprietary interconnect, no need for crazy expensive components.agentd - Saturday, December 26, 2015 - link
Have you seen the Avago (formerly PLX) PEX9700 series switches?extide - Monday, December 21, 2015 - link
In there you mention you could combine the BI and the BN but that wouldn't make sense .. I mean then you have a server with a cpu and an accelerator .. right where we started.However it makes sense to combine the CN and the BI -- that way you don't need the extra layer of BI nodes.
Am I misunderstanding this or was that a typo? or?