TechOnline India Header
Most Popular
Top 5 Courses
  • Evaluating Face Tracking Solutions
  • An Embedded C++ Case Study
  • Fundamentals of LTE Physical Layer and Test Requirements
  • Fundamentals of Signal Integrity
  • Fundamentals of MOSFETs for Switching
    Most Popular
    Top 5 Technical Papers
  • Digital Signal Processing: A Practical Guide (Part 1)
  • C++ Under the Hood
  • Digital Signal Processing: A Practical Guide (Part 2)
  • Digital Signal Processing: A Practical Guide (Part 5)
  • A New Type of DC/DC Module
    Most Popular
    Top 5 Virtual Labs
  • MC9S12NE64
  • Texas Instruments eZ430-RF2500 Wireless Development Tool
  • Texas Instruments MSP430
    Most Popular
    Top 5 Webinars
  • An Overview of ADI's iSensor' Intelligent Motion-sensing Technology
  • Mutexes vs. Semaphores: How to Use Each Properly
  • On-chip system level visibility for optimized ARM platforms & shorter time to market
  • Learn how to run the uC/OS-III real-time kernel on an ARM Cortex M3
    All Articles Products Courses Papers VirtuaLabs Webinars
    Top Search Items
    Solar
    microcontroller
    fpga
    xilinx
    LTE


    Techpaper Spotlight

    Wind River
    Accelerating the Development of Embedded Linux Devices with JTAG On-Chip Debugging
        Login | Register | Welcome, Guest

    Topics
    POLL
    How much code have you produced in your career?
    A few KLOC
        37%
    100s of KLOC
        46%
    Millions of LOC
        11%
    A trillion
        6%
     



    Maximize multicore performance with content aware routing
    When moving from a single processor single core to a multi-processor multi-core system, are you finding that I/O performance does not scale at 10Gb/s? This article tells you why, and how you can solve the problem with content aware routing.
    CommsDesign
    Currently all traffic arriving from I/O ports (e.g. network port or storage port) to server/appliance, passes through the platform hub (chipset) and is sent to the server processor(s)/core(s). The processor that receives the data is responsible for classifying and directing data frames to their final destination. This destination can be another I/O agent, either a network port or local/remote storage (i.e., disk).

    In most cases the server processor only needs to inspect and process a small portion of the data frame (header or header fields), or even make automatic decisions (i.e., send compressed packets to decompress engine).

    Due to the current nature of I/O traffic, all frame data is sent to the processor(s), resulting in the following:

    • Increase traffic via the platform hub bus interface (PCIe or HT)
    • Increase traffic towards the processor host memory (in NUMA architecture) or towards the host shared memory (in non-NUMA architecture)
    • Increase cache pollution caused by redundant data loaded to the processor cache
    • Increase processing latency as more data is written to the host memory which in turn, becomes a bottleneck
    • Increase power consumption due to increased access to the host memory, more traffic on local bus, more processing on processors, etc.

    In recent years, these phenomena have worsened due to the introduction of multi-core/multi-processor environments and the increase in I/O BW from networks and storage--currently with network ports of 10Gb/s, and 40Gb/s and 100Gb/s in the future.

    The overall effect is that most users discovered that their I/O performance did not scale well after moving from a single processor single core to a multi-processor multi-core system. Moreover, they found it difficult to scale their system to process and handle traffic in the range of 10Gb/s and higher. The solution
    To solve the problems described herein, we recommend improving platform architecture, making it the base for future chipsets. The improved architecture is based on content aware routing of incoming traffic. As part of our solution, we will first address the previous and current generations of platform architecture.

    Previous generation chipsets
    In the past, chipsets were aimed at supporting three main functions:

    • Interface to external memory (SDRAM, DDR)
    • Interface to graphic card
    • Bridge to local bus (PCI, PCI express) used to connect to various peripherals (i.e., networking and storage)

    Figure 1 below describes initial platform architecture. A processor connected to a chipset (generally based on two devices--The Northbridge (Memory Controller Hub (MCH), and The Southbridge, (I/O Controller Hub (ICH)). The chipset contains an integral memory controller, graphic interface, and a bridge to the platform's local bus and peripheral interfaces.


    In recent years, we have witnessed a migration towards the NUMA architecture (Non-Uniform Memory Architecture), where a memory controller becomes part of the processor. This is carried out to solve shared memory bottleneck problems that arise due to the increase in the number of cores.

    Additional trends include:

    • Integrating the graphic core as part of the processor
    • Allowing direct connection to the processor (i.e., using HyperTransport in AMD processors)

    Based on these trends, only bridge functionality exists in the chipset.


    In parallel, the amount of traffic now handled by some interfaces (LAN or SAN) has increased dramatically (from 1G to 10G), and as noted above, the current trend is to move towards multi-processor multi-core environments.

    These trends have led to the need to define new chipset architecture that will handle not only data transfer, but also handle more complex tasks, such as load balancing, higher levels of traffic classification, persistency, bandwidth management, and other tasks that will result in enhanced performance and improved I/O scaling.

    New Chipset Architecture
    A new chipset architecture, defined below, effectively handle the challenges described.

    The new chipset is detailed in Figure 3.


    The new chipset includes the following:

    1. Switch fabric--Previously, incoming traffic was first sent to the host memory, processed by the host processor(s) and then (if required), sent back to other I/O ports. Recommended is a different approach: transfer "decision" elements to the chipset (via programming), so it will be able to transfer traffic that does not require any processing or traffic that needs external processing (decryption or decompression) prior to host processing to the relevant port. For this reason, we require a switch fabric in the chipset that will allow such forwarding between all ports.
    2. Classification engines and action engines--"Smart" classification that enables a better understanding of the traffic's nature in fine grain resolution (i.e., per packet or per flow base). This allows improved and more accurate traffic management and decisions. Further, it results in rich sets of decisions that will not only include filter/direct decisions, but also the option to send traffic to another port, plus better granularity about the specific destination core and the portion of data needed to be sent (all packets, headers, and descriptors). This achieves improved destination core granularity and superior data transfer). Action engines must carry out various classification decisions, including filtering traffic, forwarding traffic to its destination, and modifying packet fields (add, remove, or modify various fields). In addition, action engines are used to offload tasks for outgoing traffic (i.e., checksum, time stamping).
    3. DMA engine--The DMA engine is used to handle traffic sent to the various processors and cores. Physical functions or DMA channels per core with different programmable parameters (i.e., priority, discard policy, interrupt policy, etc.), allow flexibility and QoS. Descriptor-based DMA enables sending as part of the descriptor information on incoming traffic extracted during classification, and reduces processing time (if only the descriptor is sent to the host memory, it saves I/O and memory traffic).
    4. Memory interface--Adding physical memory to store traffic that does not require further processing by the host processor saves I/O and host memory BW. The processor should only receive the traffic parameters (i.e., from the descriptor or header) and send direction decisions to the chipset. This memory is used to handle new problems that arise from NUMA architecture where traffic, required processing or not, is sent to the host memory.
    1 | 2 NEXT >
     
     
    Latest Webinars
    · Distributor Brand Preference Study
    · The Meaning of Total Jitter
    · Editorial Webinar: Optimized Linux Development Tools for Multicore
    · Evaluating Oscilloscope Sample Rates vs. Sampling Fidelity: How to Make the Most Accurate Digital Measurements
    · High-Power Amplifier Characterization using a Nonlinear Vector Network Analyzer
     
    Member Company Spotlight
    austriamicrosystems
     

    austriamicrosystems is a global leader in the design and manufacture of high performance analog ICs producing industry-leading analog semiconductors, high performance standard products and customized ASIC solutions. More information here.


    Member Companies

    Virtualab
    Texas Instruments

    Texas Instruments eZ430-RF2500 Wireless Development Tool