Using PCI Express as a fabric for interconnect clustering

by Miguel Rodriguez , TechOnline India - March 16, 2011

A number of interconnect technologies are vying to replace GbE, with the top contenders being 10 Gigabit Ethernet (10GbE), InfiniBand (IB) and PCI Express (PCIe). Though a clear winner has not yet emerged, PCIe, with its advanced capabilities, makes a strong case for becoming the ideal backplane interconnect solution.

With today’s demanding backplane requirements, the era of Gigabit Ethernet (GbE) as the de-facto interconnect for that platform is coming to a close.  As such, a number of interconnect technologies are vying to replace GbE, with the top contenders being 10 Gigabit Ethernet (10GbE), InfiniBand (IB) and PCI Express (PCIe).  Though a clear winner has not yet emerged, PCIe, with its advanced capabilities, makes a strong case for becoming the ideal backplane interconnect solution.

Over the last decade, PCIe has evolved from a parallel bus functioning merely as the transport for a single host, to IO devices wherein one host manages a set of IO devices, to a point-to-point high-speed serial interconnect with advanced features capable of taking on challenging backplane demands. 

Today, PCIe can easily support an efficient host-to-host communication model as well as other configurations that include IO resource sharing across multiple hosts. Such features lead to a significant reduction in systems’ cost and complexity. 

In addition, mainstream processor companies, such as Intel, have been integrating PCIe -- not just in their chipsets, but also as an integral part of the core silicon.  With such inherent advantages, PCIe can indeed fill the mantle of being an ideal backplane interconnect.

A fundamental backplane requirement (Figure 1 below) is obviously the need for a powerful fabric delivering high throughput (>10Gbps) and low latency (< 5µs).  This fabric must also support backplane distances not only for deploying bladed environments (e.g. blade servers) but also for cabling across multiple blade chasses or potentially supporting rack-mounted servers.

 

         

                          Figure 1. Traditional backplane for supporting IPC, LAN and SAN connectivity.

 

From a functional point of view, the backplane must support inter-processor communication (IPC) as well as access to an external local area network (LAN) and a storage area network (SAN).  Today, traditional approaches use three different IO interfaces on each server node to accomplish this.  Consequently, three different backplane interconnects are required for supporting the IPC, LAN and SAN communication model in the backplane. 

Figure 1 above shows a traditional backplane where a server uses a GbE interface for LAN connectivity, a Fiber Channel (FC) card for SAN connectivity and 10GigE- or IB-based card for IPC connectivity. 
Clearly, this is neither the optimal nor preferred model from both a cost perspective and complexity level. The need for a unified backplane – one that supports all three types of traffic wherein each server connected to the backplane uses a single IO interface instead of three – isn’t just obvious but necessary.

The interconnect technologies in discussion here – PCIe, 10GbE and IB -- all can make the claim for a unified backplane, each providing a feature set for supporting that application. PCIe, however, appears to deliver the array of features optimal for a unified backplane.

High Throughput

Unlike 10GbE, PCIe is a lossless fabric at the transport layer.  The PCIe specification has defined a robust flow-control mechanism, which prevents packets from being dropped.  Every PCIe packet is acknowledged at every hop, insuring a successful transmission.  In the case of a transmission error, the packet is replayed again. 
This happens in hardware without any involvement of upper layers.  In contrast, 10GbE has an intrinsic tendency to drop packets in the wake of congestion and relies on upper-layer protocols, such as TCP/IP, as a means to re-transmit the dropped packets.  PCIe provides a more reliable communication over Ethernet, notwithstanding the inherent overheads associated with it.  Data loss and corruption in storage systems is simply not an option.

Providing low latency and high throughput at the hardware level is a good foundation for a high-performance system.  However, equally important is the interconnect’s capability of providing the applications with an efficient interface to maximize use of the underlying hardware. 

PCIe has an extremely low end-to-end latency (<1µs).  The new PCIe 3.0 standard also supports higher throughput of 8Gbps per lane.  This results in an aggregate bandwidth of 128Gbps on a 16-lane (x16) PCIe interface.  Furthermore, dedicated DMA controllers inside PCIe switches, such as those from PLX, provide an efficient, high-performance data mover that can be programmed to push/pull large amounts of data without involving the CPU.

Ethernet offers current speeds of 1 GbE, 10GbE and 40GbE and 100GbE in the roadmap.  However, throughput isn’t considered the sole performance metric designers take into consideration.  Ethernet fails in two important areas – latency and jitter. 

Its inherently unreliable nature which allows it to drop packets in the wake of congestion results in higher unpredictable latencies.  Although protocol enhancements, in the form of Converged Enhanced Ethernet (Figure 2 below), are in the works, it’s still unclear whether or not the improved latencies can rival the low latencies currently offered by both PCIe and IB.

 

Figure 2. Supporting IPC, LAN and SAN with Converged Enhanced Ethernet (CEE)

 

IB can support up to 14Gbps with IB-FDR and 26Gbps with IB-EDR of throughput per lane, while also offering low latencies, so it’s commonly deployed in high-performance computing (HPC).  However, when the need for LAN or SAN connectivity to an IB fabric is taken into consideration, disadvantages arise. 
For LAN connectivity, servers must use the TCP/IP-over-IB protocol (IPoIB) and must go through an IPoIB gateway – one that serves as a bridge between IB and the LAN.  Deployments of such components have been minimal at best.
From the perspective of the two end points involved in the communication channel, IB and Ethernet adapters serve as bridges to PCIe.  Moreover, the communication from these adapters to the server CPU/memory subsystem is through PCIe.  Thus, PCIe holds the key to the bandwidth performance of both Ethernet and IB. 
So, rather than terminating PCIe inside the system and using a different protocol (IB or Ethernet) for communication, it’s advantageous to extend PCIe outside the system so as to realize its full latency and bandwidth potential, benefiting from direct read/write of remote memory -- a benefit which can only be achieved in PCIe. 

Non-Transparency Enables Clustering and More

Non-transparent bridging (NTB) is a feature supported by major PCIe-switch vendors, such as PLX (Figure 3 below).  At a high level, it provides a mechanism in which two PCIe subsystems can communicate with each other all the while keeping their corresponding IO devices isolated from each other.  This allows memory windows to be enabled and programmed, providing local memory access to the remote memory domain for large data transfers. 

Mailbox, scratchpad and doorbell registers are also supported to provide a lean yet efficient message passing mechanism between both hosts. The main uses of NT are for clustering and fail-over.  Figure 3 below  shows a typical system with and without NT. 
 

                              

Figure 3. Non-Transparent Bridging

 

A typical cluster will have a number of servers interconnected and working collectively on a given task or distributing a large workload into smaller tasks or even providing different services to an end user.  The interconnect between them is referred to as the backplane whether it is made up of PCB such as is the case in blade servers or across cables as is the case in rack-mount servers.

PCIe as a Cluster Interconnect

IPC in a server cluster with PCIe as the backplane can be achieved by using advanced hardware features such as NTB and/or DMA.  As previously mentioned, the need for efficient message passing as well as high throughput data passing is key for IPC.  Messages, small in size, tend to be latency-sensitive while moving large blocks of data benefits from the high-performance interface PCIe provides. 

Using NTB, a system can connect a group of servers into a cluster.  The NTB would be configured to open-memory windows across the NTB to the remote address space. Once configured, the system’s local CPU generates a read or write operation targeting the local BAR of the NTB, the NTB consumes the incoming transaction and automatically generates an outgoing transaction on the remote server. 

This transaction contains an address that falls within the remote server’s address space.  From a sender’s perspective, the operation targets its own address space, while from a receiver’s perspective, the operation originates from within its own address space. The implication is that the CPU can directly write data from its first-level cache to a remote memory location. 

In the context of IPC communication across a backplane, where most of the IPC communication would involve small- and medium-sized messages, the direct-addressing mode as opposed to using RDMA, for example, would be considered more efficient.  For small- to medium-sized clusters (less than 200 nodes), PCIe is a superior choice for the backplane. 

For larger clusters, PCIe can be used to complement 10GbE.  Multiple small- to medium-sized clusters can be interconnected together over a 10GBase-T gateway to increase the number of computing nodes, which can be many kilometers away.  The user realizes a range of benefits from the high-performance capabilities of PCIe within the cluster, as well as from the long-reach capabilities of 10GBase-T Ethernet.

IO Virtualization

PCIe offers a simplified solution by allowing all IO adapters (whether it be 10GbE or FC) to be completely moved outside the server.  With a PCIe switch fabric providing virtualization support, each adapter can be shared across multiple servers and at the same time provide each server with a logical adapter. 
The servers (or the VMs on each server) continue to have direct access to their own set of hardware resources on the shared adapter.  The resulting virtualization allows for better scalability wherein the IO and the servers can be scaled independent of each other.

IO virtualization avoids over-provisioning the servers or the IO resources, leading to cost and power reduction.  For example, from a high-availability point of view, each server must be provided with two adapters.  But by disaggregating the IO adapters, one could potentially use one additional adapter as a redundant spare for a group of servers. 

So, for a group of N servers, this leads to a reduction from 2N adapters to N+1 adapters.  And as such, PCIe based IO virtualization will continue to support the CEE or other IO adapters while enabling efficient use of expensive IO adapters by removing the one-to-one association between the server and the IO resource.

 

                          

                        Figure 4. Supporting IPC, LAN and SAN with PCI Express.

Mainstream Technology

Yet another important factor determining the cost of deployment in the long run is the level of integration of technology by the processor vendors.  Mainstream processor vendors like Intel have long made PCIe part of their chipset’s. 

But the level of integration has reached a point that PCIe has now become an integral part of the processor die and resides alongside the memory controller and processing cores.  Such a level of integration eliminates the need for discrete components, and offers even wider flexibility in the deployment of PCIe.

Ethernet isn’t as tightly integrated as PCIe, but nonetheless it has been offered by processor vendors as part of their chipsets.  Though mostly available as a dedicated PCIe adapter card, 10GbE is on its way to getting integrated into the chipset.  At the same time, it’s unlikely that Ethernet will become part of the processor core in the near future.

As for IB, most of its deployments are in the form of PCIe-based adapters which connect to servers’ PCIe slots.  IB has not been integrated to any chipset, and with its being far removed from the processing logic, there continues to be a need for discrete components for IB to interface with the rest of the system. 
Also, with there being just one company as the sole supplier of IB-based silicon, the costs associated with deploying it will continue to be higher than that of Ethernet and PCIe.

Conclusion

Unlike other interconnect technologies, PCIe provides a unique cost/performance value benefit by providing an affordable solution for low-end deployments, while also offering exceptional performance for high-end applications. 

PCIe delivers high performance, in terms of bandwidth and latency; support for an efficient IPC paradigm, through use of non-transparent shared memory; IO virtualization in a multi-host environment, allowing IO resources to be used more efficiently; and integration in mainstream chipsets.

 

About the Author:

Miguel Rodriguez is a field applications engineer at PLX Technology and has been working closely with PCI Express since its inception.  He has authored several PCI Express-based technical articles in various major technical publications and presented at leading trade events.  Rodriguez holds a BSEE from The University of Texas-Pan American.  He can be reached at mrodriguez@plxtech.com.

Comments

blog comments powered by Disqus