Tips and Trends: Processor offload in the Switch

by Gary Lee , TechOnline India - February 02, 2011

Today's communication and networking equipment use a variety of NPU and CPU subsystems. Typically, NPUs are used for L2-4 processing and CPUs are used for L4-7 processing, although more powerful multi-core CPUs are starting to support function previously reserved for NPUs.

Today's communication and networking equipment use a variety of NPU and CPU subsystems. Typically, NPUs are used for L2-4 processing and CPUs are used for L4-7 processing, although more powerful multi-core CPUs are starting to support function previously reserved for NPUs. In applications such as security appliances, NPUs may be used to identify packet types and/or IP addresses, and then forward the packet on the appropriate CPU for higher layer processing. 

NPU architectures range from a series of unique processing engines, each dedicated to a special function, to a sea of general-purpose cores. Most devices on the market today fall in between these extremes. In any case, the NPU must perform frame processing at the incoming line rate. For example, a 100G NPU must be able to process up to 150 million 64-byte frames every second. What this typically boils down to is how many NPU instructions can be processed per packet. Complex frame processing requirements seen in applications such as wireless access systems may exceed the number supported by the NPU, requiring that some of the frame processing be offloaded to external devices such as FPGAs. This adds further cost and complexity to the system.
In many cases, a system will be designed such that the packet will travel through an Ethernet switch before it reaches the NPU. The switch may exist on the line card in front of the NPU, or in the backplane (for example, an ATCA hub card) in which case the packet may travel through this switch before reaching an NPU card. Because of this, it would be ideal if the switch could perform some pre-processing for the NPU, eliminating the need for additional devices such as FPGAs.
Historically, Ethernet switches have not had the capability to do the packet pre-processing required by an NPU. They contain simple forwarding engines that look at pre-determined packet header locations and forward the packet with little if any frame modification. These switches have grown more sophisticated in recent years due to the wide range of tunneling and other protocols that have been developed by the networking industry. This has lead to more flexible frame parsing and header modification engines built into the switch chips, but to support advanced NPU pre-processing, even more flexibility is required.
One example of this generation of new pre-processing switches is Fulcrum's FocalPoint(r) FM6000 10GbE switch based on its third-generation "Alta" architecture. Alta includes the FlexPipe frame processing pipeline, along with total port bandwidth up to 720Gbps. FlexPipe allows many of the key blocks such as the frame parser and egress frame modification unit, to have their functionality changed through new microcode images. Unlike a built-in processing core, this microcode capability can maintain performance levels of over 1 billion packets per second and less than 300nS latency under all corner conditions, making it ideal for NPU offload.



Based on the NPU and target market, the microcode can be changed as needed. For example, certain telecom access applications may require that the parser microcode place certain header fields into certain internal processing channels. This header information can be used for frame classification in the frame forwarding unit (FFU). Once Alta classifies the frame, microcode in the egress modify block can be configured to place classification results in various frame header locations based on the requirements of the NPU. Additional information such as header byte offsets can also be presented to the NPU.
The FFU classification block contains a 24K by 36-bit cascadible TCAM, which can be used to match header fields and produce results from an action RAM. The action RAM results are used for various purposes later in the frame-processing pipeline such as frame header modification. The FFU also contains a 64K by 32-bit binary search tree, which can also be configured with up to 16K 128-bit entries. This can be used for longest prefix matching, producing the same action results as the TCAM.
In applications such as security appliances, Alta can be used as an NPU pre-classification engine, supporting up to eight 100G NPUs with its 1 billion packet per second frame processing pipeline. In addition, advanced load distribution features are supported, where various header fields can be used as keys into several hash functions, providing uniform flow-based workloads to these NPUs. Finally, with a low 300nS latency, the switch acts as a true bump-in-the-wire with very little impact to network performance.


About the Author:

Gary Lee is director of product marketing for Fulcrum Microsystems and has been working in the semiconductor industry for over 29 years.  For the last 14 years he has been involved in the development of switch fabrics for the telecommunications, data communications and storage industries. While at Vitesse Semiconductor, he was a key member of the team that developed the CrossStream, GigaStream and TeraStream switch fabric families and holds patents in this area. He has also worked on ASI, PCI-Express, SAS and Ethernet switch fabrics. Gary has BSEE and MSEE from the University of Minnesota.

About Author


blog comments powered by Disqus