Mixed abstraction furthers HLS advantage

by Thomas Bollaert and Mike Fingeroff, Mentor Graphics , TechOnline India - March 01, 2011

It is important to understand where and how to use these modeling options in order to increase productivity and QoR. In brief, when modeling the algorithmic aspects of a design, use untimed models; when timing and concurrency are involved, add these details using SystemC constructs.

Mixing high-level models of varying degrees of abstraction allows high-level synthesis tools to produce full chip designs in a highly productive and efficient way. Dual-language HLS flows allow designers to express complex interface protocols using a timed SystemC source while keeping the rest of the design functionality in pure untimed ANSI C++. And they can express structure and hierarchy either by using SystemC modules or by inferring them from natural C++ boundaries.
 
It is important to understand where and how to use these modeling options in order to increase productivity and QoR. In brief, when modeling the algorithmic aspects of a design, use untimed models; when timing and concurrency are involved, add these details using SystemC constructs. The point is to keep the models as simple as possible as long as possible. Simplicity is bound by what constitutes sufficient detail: models must have enough detail to be meaningful; yet be no more detailed than is required for accuracy.
 
Perhaps this can be made clearer if we think of a design as consisting of four modeling domains: processing, control, interfaces, and hierarchy. Let’s look at each of these independently to see how to best utilize a mixed abstraction HLS flow.

Timing-Independent Behavior

There are many portions of a system where processing functionality does not depend on specific timing properties. Such blocks are algorithmic in nature and can be expressed as a transfer function or data path where time is not a parameter. All that’s concerned is to get data in, crunch it, and produce the results.
 
Time will be an artifact of the implementation, but it’s not an attribute of the functionality. Because time is not a part of the behavior, there is no need to add timing detail. Thus, purely untimed models are more appropriate for algorithmic applications, following our primary directives not to add detail unnecessarily.
 
The implementation will almost always require some parallelism, but it has been proven that it is easy to extract parallelism from sequential sources. Thus, it doesn’t add anything to describe the parallelism in the source itself, but it does create more work and more overhead; more coding, slower simulations, and harder debug. Moreover, the algorithmic model will often be written in pure C/C++ to begin with. So you might as well keep it simple and untimed at this point and use the language it was written in.
 
Below is a synthesizable C++ implementation of an 8 tap constant coefficient FIR filter.
 
#include <ac_int.h>
 
typedef ac_int<16,true>         i_typ;
typedef ac_int<16,true>         c_typ;
typedef ac_int<16,true>         o_typ;
 
const int N = 8;
const c_typ coeffs[N] =  {1,2,3,4,5,6,7,8};
 
#pragma hls_design
void fir  (i_typ &in,
  o_typ &out) {
                        static i_typ sreg[8];
 
                        SHIFT_LOOP:for (int i=N-1;i>0;i--)
                                                sreg[i] = sreg[i-1];
                        sreg[0] = in;
 
                        o_typ acc = 0;
                        MAC_LOOP: for (int j=0;j<N;j++)
                                                acc += sreg[j] * coeffs[j];
                        out = acc;
}
 
Time-Dependent Logic

All this changes when you get to the control logic. Control logic is the opposite of algorithmic processing: it is time dependent. The results, the output of the block, depend on when the inputs arrive.
 
Typically control logic needs to respond to something. It involves tight interactions between blocks in the design, as in, for example, a request-response type of interaction. When a control block is triggered to do something and needs to come back with a response almost immediately, concurrency is required.
 
Now that time is part of the functional specification, it is part of what defines correctness, so it needs to be in the source code, and you need to model it. Therefore, cycle-accurate SystemC is a more natural choice for reactive control because it allows explicit modeling of timing and concurrency.
 
It does not require a complex example to illustrate this. A useful feature of SystemC, in this case, is to allow the testbench to execute as a parallel thread to the DUT, enabling the testbench to react to any control requests coming from the DUT. The example shown below is a simple averaging filter that averages two values of x and writes the output to y. The average, or DUT, module provides a static base address and offset to the testbench, which is used to look up the values of x from some location in memory. The reading of x will block until the testbench provides the data using the address written to addr. In SystemC this is not an issue since the testbench is running concurrently to the DUT. When the DUT stalls, control is transferred back to the testbench, allowing it to react to the address supplied by the DUT.
 
#define BASE_ADDR 1024
#define OFFSET 128
SC_MODULE (average){
    sc_in< bool > clk;
    sc_in< bool > rst;
    wait_in< ac_fixed<8,1> >  x;
    wait_out< ac_fixed<19,4> >  y;
    sc_out<int > addr;
    SC_CTOR(average):
        x("x"),
        y("y")
        {
            SC_CTHREAD(exec,clk.pos());
            reset_signal_is(rst,true);
        }
        void exec (){
            wait();
            while(1){
                ac_fixed<19,4> temp = 0;
                addr.write(BASE_ADDR + OFFSET);
                wait();
                for(int i=0;i<128;i++){
                    temp += x.get();
                    if(i&1){
                        y.put(temp/2);
                        temp = 0;
                    }
                }
            }
        }
};
 

Complex and Simple Interfaces

An interface implies a protocol, and a protocol means time. So generally they fall under the category of control. Interfaces can be complex or simple and they can be custom or standard.
 
When timing is part of the interface behavior, such as with bus protocols, cycle-accuracy is the better choice. With protocols, you usually want to test how something is done: is it compliant. Therefore, you need to simulate and test the impact of the interface on the timing in the source description, because time is a property of protocols. Therefore, you should use timed models and concurrency. Because the protocol is hard coded in the source, you can test the consequence of a specific protocol in the source, and the HLS tool will not change it; it will build it how you tell it to.
 
Yet for simpler interfaces, such as point-to-point connections used to the connect the blocks found in processing pipelines, we recommend abstract, untimed transactions, because the goal is to make sure that the transfer of data is “safe,” regardless of how it is implemented. In other words, all you care about is what it does.
 
If you care only about the outcome, the what, you don’t want to redesign the interface in your own model. You simply pull from a library definition the interface implementation that fits your needs. Since you don’t have the actual protocol model in the source, the source behaves correctly. Thus, instead of committing to the interface in the source you wait to commit to it as you go through synthesis. A pure sequential implementation preserves the ability to change from one interface protocol to another without modifying the source, resulting in less chance for errors and more time for exploring design space options.
 
An example system with a bus-based communication architecture is shown in Figure 1. The system contains a master and two slaves communicating through an AMBA 3.0 AHB-Lite bus. This example shows how to combine untimed functional descriptions with cycle-accurate interface protocols wrapped in Modular IO classes. In this example, the master is assumed to be the target for HLS. The rest of the components in the system are considered external, from a synthesis stand point. However, these external components exist in the testbench for verification.
 
The master and slaves communicate using Modular IO interface classes that abstract the protocol used for communication. The reason for this is to improve ease-of-use and design reuse. The master consists of three, two-way handshake interfaces and an AHB-Lite bus interface (bus I/F). The two-way handshake interfaces are: command interface (cmd in I/F), data in interface (data in I/F), and the data out interface (data out I/F). Depending on the commands received from cmd in I/F, the master would do one of the following: 1) read samples from data in I/F and write it to Slave 1 memory using bus I/F, 2) one at a time, read samples from Slave 1 memory using bus I/F, filter it using FIR filter, and write the result to Slave 2 memory using bus I/F, and 3) read samples from Slave 2 memory using bus I/F and write it out on data out I/F.

 

Figure 1: Bus-based communication architecture.

 

The master’s interface and port definitions are provided below:
 
SC_MODULE (bus_master)
{
                public:
 
                 sc_in<bool> hclk;
                 sc_in<bool> hreset;
 
                 wait_in<uint2> cmd_in;
                 wait_in<i_typ> in;
                 wait_out<o_typ> out;
 
                 bus_master_if<data_t,addr_t> bus_if;
 
Master1 uses the wait_in interface class to receive commands and data. It uses the wait_out interface class to send data out.
 
The sc_cthread process inside the master is provided below:
 
                void proc()
                {
 
                                data_t rdata, wdata;
                                i_typ fir_in, din;
                                o_typ fir_out, dout;
 
                                uint2 cmd;
                                bus_if.reset();
                                cmd_in.reset();
                                in.reset();
                                out.reset();
 
                                while(true)
                                {
 
                                                wait();
                                                cmd = cmd_in.get();
 
                                                if (cmd == READ_INPUT){
                                                                for (int i=0;i<64;i++){
                                                                                din = in.get();
                                                                                wdata = din;
                                                                                bus_if.mem_write(MEM1_BASE_ADRS+i,wdata);
                                                                }
                                                }
                                                else if (cmd == FILTER){
                                                                for (int i=0;i<64;i++){
                                                                                rdata = bus_if.mem_read(MEM1_BASE_ADRS+i);
                                                                                fir_in = rdata;
                                                                                fir<0>(fir_in,coeffs,fir_out);
                                                                                wdata=fir_out;
                                                                            bus_if.mem_write(MEM2_BASE_ADRS+i, wdata);
                                                                }
                                                }
                                                else if (cmd == WRITE_OUTPUT){
                                                                for (int i=0;i<64;i++){
                                                                                rdata = bus_if.mem_read(MEM2_BASE_ADRS+i);
                                                                                dout = rdata;
                                                                                out.put(dout);
                                                                }
                                                }
                                }
                }
 

Figure 2 shows the advantages of using abstract untimed interfaces when only simple point-to-point protocols are required. The top-level C++ source contains only C++ variables with no concept of interface protocol. The process of interface synthesis allows designers to specify simple, timed protocols as constraints during the synthesis process.
 

 

Figure 2: Using interface synthesis to add a timed protocol to C++ interfaces.

 

Partitioning the Design

The term hierarchy can mean several things when talking about hardware design. In the context of HLS, it is used to denote the partitioning of a section of the design so that it can run concurrently with respect to the rest of the system.
 
Structuring the design in SystemC is straightforward as the hierarchy is explicitly hardcoded in the source. The downside is that the HLS tool has no ability to explore or optimize the design across those explicit boundaries. Hierarchical decisions have to be made early in the design flow, and if they turn out to be wrong, you have to go back to the source code and restructure the design.
 
A more powerful approach involves synthesizing hierarchical partitions from a more abstract and uncommitted source description. With this approach, the resulting architecture is defined during the synthesis process as opposed to being locked in the source. Advanced HLS tools can infer RTL hierarchy from arbitrary design boundaries present in the abstract model. The HLS tool leverages user constraints to determine if C++ scopes, loops, or functions should be mapped on separate blocks or on the same one. This approach offers greater flexibility than having to hardcode structure with explicit SystemC constructs. This results in greater exploration potential for the user and higher reusability of the models. All of which are key advantages when optimizing for performance and area in order to improve QoR.
 
The following example shows an algorithmic implementation of a Discrete Cosine Transform (DCT). There are two main processing sets of loops, mult1 and mult2, that exchange data via a two-dimensional array temp. In C++ synthesis, the user has the flexibility of using design constraints to push the nested loops into hierarchy, allowing them to run concurrent to each other. The temp array used to exchange data between the loops is inferred as a communications channel that can be constrained to be anything from a ping-pong memory, for highest performance, to a shared memory. Trying to implement this type of functionality in SystemC would require at least three SC_MODULES: two to implement the nested loop processing and a third to implement the memory architecture of the shared communications channel. Because the memory architecture is hard-coded, re-coding would be required to meet a different set of performance constraints.
 
#pragma design top
void hier_dct(ac_int<9> input[XYSIZE][XYSIZE], ac_int<11> output[XYSIZE][XYSIZE]) {
   ac_int<21> temp[XYSIZE][XYSIZE];
   ac_int<21> tmp;
   ac_int<31> dct_value;
   #pragma hls_design
   mult1:for (int i=0; i < XYSIZE; ++i )
     middle1:for (int j=0; j < XYSIZE; ++j ) {
       tmp = 0;
       inner1:for (int k=0; k < XYSIZE; ++k )
           tmp = tmp + input[i][k] * coeff[j][k];
       temp[j][i] = tmp;
     }
  #pragma hls_design
  mult2:for (int i=0 ; i < XYSIZE; ++i )
     middle2:for (int j=0; j < XYSIZE; ++j ) {
       dct_value = 0;
       inner2:for (int k=0 ; k < XYSIZE ; ++k ) 
           dct_value = dct_value + coeff[i][k] * temp[j][k];
         output[i][j] = dct_value >> 20;
     } 
}
 

Using SystemC Wrappers

To summarize, we recommend doing algorithmic processing and simple point-to-point interfaces with abstract and untimed models, and doing control and complex interfaces with cycle-accurate ones. Thus, the HLS flow will include both descriptions.
 
Fortunately, it is easy to create SystemC wrappers to bridge untimed and cycle-accurate models. This preserves the advantages of working with simpler, more abstract models as much as possible. This structure allows you to put all the blocks together and synthesize the full chip.

 

 

Figure 3: Full chip synthesis using a mixed-abstraction system model with C++ processing units (P0, P1, P2) wrapped in SystemC.

The example below uses the filter design example discussed earlier to show how untimed C++ can be easily incorporated into a SystemC top-level wrapper with cycle-accurate interfaces. The flow leverages the Modular IO modeling style, which enforces a clean separation between functionality, timing, and communication. Other than the minor code overhead required for the module declaration, the entire untimed C++ filter design becomes an SC_CTHREAD, with the modular IO get and put member functions handling the reading and writing of data.
 
SC_MODULE(fir) {
public:
    sc_in<bool> clk;
    sc_in<bool> rst;
    wait_in<i_typ > in; //modular IO input
    wait_out<o_typ > out; //modular IO output
    i_typ sreg[8];
    SC_CTOR(fir):
        in("in"),
        out("out")
        {
            SC_CTHREAD(exec,clk.pos());
            reset_signal_is(rst,true);
        }
    void exec (){
      for(int i=0;i<N;i++)
        sreg[i] = 0;
      wait();
      while(1){
        wait();
     
        SHIFT_LOOP:for (int i=N-1;i>0;i--)
          sreg[i] = sreg[i-1];
        sreg[0] = in.get();
        o_typ acc = 0;
        MAC_LOOP: for (int j=0;j<N;j++)
          acc += sreg[j] * coeffs[j];
        out.put(acc);
      }
    }
};
 

Conclusion

In order for HLS tools, like Catapult® C Synthesis, to deliver full-chip synthesis, both untimed and cycle-accurate abstractions are required to maximize efficiency and productivity. For this reason, pure C++ and SystemC complement each other in a mixed-language HLS flow by serving different design needs. Remember to keep things simple. When modeling algorithmic aspects of a design, use purely untimed C++ models. When timing and concurrency are involved, use SystemC classes. Then put these two descriptions together in a top-level hierarchy that preserves the advantages of both.

About the Authors:

Thomas Bollaert is product marketing manager for high-level synthesis at Mentor Graphics, with 14 years in digital signal processing and system-level design practices. Prior to his current position, Thomas spent five years developing the Mentor Graphics high-level synthesis product line in Europe. He earned his electronic engineering degree from ESIEE Paris. Reach him at thomas_bollaert@mentor.com

Mike Fingeroff has worked as a technical marketing engineer for the Catapult product line at Mentor Graphics since 2002. Prior to working for Mentor Graphics he worked as a hardware design engineer developing real-time broadband video systems. Mike Fingeroff received both his bachelors and masters degrees in electrical engineering from Temple University, in 1990 and 1995 respectively. Reach him at mike_fingeroff@mentor.com

Comments

blog comments powered by Disqus