Improving performance using SPI-DDR NOR flash memory

by Qamrul Hasan, Cliff Zitlaw , TechOnline India - October 11, 2011

Developers now have the option of using NOR Flash memory based on the Serial Peripheral Interface (SPI) to meet the needs of mobile and embedded applications.

Excellent read bandwidth and low access latency has made NOR Flash the technology of choice for real-time code execution from non-volatile memory.  Parallel NOR devices continue as the memory of choice for many applications, but low pin count serial devices are becoming increasingly attractive for many mobile and embedded systems.  The availability of new NOR Flash memory based upon the serial peripheral interface (SPI) provides developers with performance approaching parallel NOR while greatly reducing the device pin count.

With the rising complexity of today’s mobile devices and embedded systems, developers must address the increasingly challenging task of designing efficient memory subsystems that maximize system performance.  Specifically, these systems often have megabytes of program code stored in non-volatile memory.  For systems where performance is essential, code can be moved from non-volatile memory to fast RAM to speed execution.  For systems where device size and cost are key, program code can be executed directly from non-volatile memory using an approach known as Execute-in-Place (XiP).

With either approach, the memory subsystem has a significant impact on overall system performance and the user experience.  In general, the greater the memory bus bandwidth, the better the overall user experience.  High read bandwidth and low latency enables instructions to be copied or fetched more quickly.  Several other factors are also important for developers to consider when choosing a non-volatile memory technology.

Parallel NOR Flash

Parallel NOR Flash devices have existed for nearly 25 years, and the bus has evolved over time to provide increasing levels of performance.  NOR devices are available that are compatible with one or more of the following three bus protocols:

1. Async Mode: Each 16 bit (2 byte) read operation requires a unique array access for every read operation.

2. Page Mode: A contiguous range of addresses (typically 32 bytes) is read from the array during a single access.  The target word (2 bytes) is output during the initial access and then as long as subsequent accesses are within the 32 byte region, a shorter “intra-page” access time is possible (i.e., when only changing the low order address bits).  Any access outside of the 32 byte page address range will incur the longer inter-page initial access time.

3. Burst Mode: Like page mode, burst mode reads a contiguous range of addresses (typically 32 bytes) from the array during a single initial access.  After the required initial access time has elapsed, data is clocked out in a predetermined manner.  Burst mode requires a few additional pins (typically 3 pins) but the rate at which data can be clocked out provides a significantly higher throughput than either page mode or async mode.



Bus pin count imposes a critical design constraint for mobile and embedded applications.  I/O pins are not free: each I/O pin adds manufacturing and PCB layout cost as well as adds to device size.  Pin count has become one of the deciding factors when choosing between NOR-based non-volatile memory or NAND-based options.

Systems using NOR-based memory historically employ a parallel bus between the host SoC and an external memory device.  Parallel memory bus architectures provide high read bandwidth and low latency but they come at the expense of higher pin count compared to serial bus architectures. Serial bus architectures keep pin count down but at the cost of lower bandwidth and greater latency.  The challenge for developers has been that they either get high bandwidth or low pin count but not both.



Developers now have the option of using NOR Flash memory based on the Serial Peripheral Interface (SPI) to meet the needs of mobile and embedded applications.  SPI is a flexible interface that balances pin count and bandwidth to maximize overall system performance at a lower cost.  SPI is a well-established standard that has served the electronics industry for over 25 years.  There is already a wide variety of chipsets and peripheral devices available that natively support SPI.  The SPI standard has also been extremely stable over the years.  While operating voltages have dropped and clock rates increased to improve bandwidth, the core command protocol has remained unchanged.

SPI continues to evolve to meet shifting market needs.  To offer increased throughput and support for multi-input/output (MIO) functionality, the interface has been extended to include 2-bit IO and 4-bit IO configurations.  To further improve throughput devices have recently been introduced that use a double data rate (DDR) protocol.  The combination of all of these improvements – higher clock frequency, 4-bit I/O and now DDR – has resulted in SPI-DDR NOR.  The initial SPI-DDR offerings use the legacy 3V operating voltages which will allow bus operating frequencies of 66MHz up to perhaps 100MHz.  Future offerings will include the SPI-DDR functionality on 1.8V devices that are expected to achieve bus operating frequencies between 100MHz and 133MHz.  With enhanced read bandwidth, lower access latency, and a compact 6-pin bus interface, SPI-DDR NOR provides an attractive alternative for low-cost mobile and embedded devices that have historically used parallel NOR due to higher performance.


Executing Code from RAM

The impact of a memory subsystem on system performance depends a great deal upon how the memory is being used.  Determining the optimal balance of bandwidth and pin count,  requires an analysis of how of the various types of NOR – SPI-DDR NOR, parallel NOR (Async, Page or Burst) – perform under different operating conditions. 

For applications where performance is critical, execution speed can be improved by copying program code from non-volatile memory into higher throughput RAM.  The memory subsystem affects system performance both when initially copying program code from NOR Flash to RAM during boot-up and when paging new segments of code during run-time.

During boot-up, code shadowing performance is determined by the ability of the NOR memory bus to transfer large blocks of data continuously.  The ability to sustain burst rates is the key factor here given that the amount of data transferred could be many megabytes..  Performance in this case determines how long the system will take to start up and directly impacts the user experience.  Once the program begins executing, system performance can be impacted whenever a new block of program code needs to be loaded (i.e., the overall program is larger than the available RAM and must be loaded in blocks as each section of code is needed).  Sustainable throughput is again important since the amount of memory to load could be a few megabytes.

For the continuous read operations required during boot up or for demand paging, memory is accessed a page at a time.  Figure 2 shows the performance of various NOR Flash memories using a typical 4 KB page size.  Because SPI-DDR NOR performs each page read with a single command, memory read time for multiple pages increases in a linear fashion, providing a consistent sustained rate. 




As can be seen in Figure 2, SPI-DDR NOR and the higher performing burst NOR provide comparable performance for both system boot and demand paging for a given frequency, making SPI-DDR NOR an attractive choice because of its low pin count.  Applications that copy program data to RAM using Asynchronous or Page NOR can achieve a significant improvement in both performance and pin count by migrating  to SPI-DDR NOR.


Executing Code from Non-Volatile Memory

Systems using an Execute-in-Place (XiP) approach must consider that because the non-volatile memory subsystem is constantly being accessed to retrieve program code, it can potentially introduce memory bottlenecks into the primary execution path.  Analyzing the efficiency of an XiP-based memory subsystem is not a simple calculation like it is for systems that execute code from RAM.  Depending how a system is architected, there are many factors that contribute to memory performance.

System performance is often measured in terms of the number of instructions per cycle (IPC) that can be achieved by the system.  Consider a CPU that takes 4 cycles to execute an instruction.  For this CPU, an IPC of 0.25 would be ideal.  There are many factors that influence the IPC, for example, a cache miss will stall the system as an instruction is fetched from memory, resulting in a lower IPC.

For XiP-based applications where the program will be executed directly out of non-volatile memory, system performance is affected by the ability of the memory subsystem to fill the cache whenever there is a cache miss.  Given the tendency of code to execute within a locality of reference systems with level 1 and level 2 caches can achieve hit rates over 99%.  The memory subsystem needs to be able to fill the entire cache line as quickly as possible to maintain system performance when a cache miss does occur.  There are many factors that determine how quickly this can be accomplished:

Read Bandwidth: A high bandwidth bus is needed to minimize the overall read latency even though only a single cache line of memory is being read (typically 32 bytes).  In addition, the nature of application code requires the ability to make small, fast memory accesses throughout the entire code region with minimum latency.

Read bandwidth performance varies across bus interfaces and operating frequencies and must be balanced against pin count.  Figure 3 compares the performance of the different NOR bus interfaces.  Consider the performance of SPI-DDR NOR with an initial access time of 120ns.  SPI-DDR significantly outperforms both Page Mode and especially Async NOR.  Burst Mode NOR has the highest bandwidth but this advantage over SPI-DDR is minimized in a cache based system.




Controller Latency: Initiating a read command incurs controller latency when dealing with address and protocol overhead.  A common way to measure controller latency is from the time the command is sent to the controller to when the controller returns the first byte of data.  Controller latency is higher for SPI-DDR NOR, especially at low operating frequencies given that command/address and data is transferred serially.  Figure 4 shows that SPI-DDR has a somewhat longer controller latency than the parallel NOR offerings.  The lower performance is primarily due to the serialization of the command and address information that is required at the beginning of an SPI transaction.  Note that the gap in performance closes significantly as the memory bus frequency is increased.  In many mobile and embedded systems a sub 200ns controller latency would provide adequate performance and allow SPI-DDR to be considered as a viable alternative to Parallel NOR.




Figure 4: Controller latency


Instant CPU Stall Time: When the next instruction to execute is not available in the cache, it must be loaded from memory.  Figure 5 shows the impact of a cache miss when using a 100 MHz memory bus.  The delay when using Burst NOR, Page NOR, and SPI-DDR NOR ranges from 160 to 210 ns.  The instant delay is the worst for Async NOR.  As can be seen from the graph, the instant delay comes in over 330 ns, which could be tolerable depending upon the frequency of the cache missing.  However, as can be seen from Figure 5, all subsequent Async NOR instruction fetches experience the 330 ns delay as well.  For a cache line containing eight instructions, the actual instant Async NOR delay incurred is 2.6 us which may adversely impact the user experience.  From this perspective SPI-DDR compares favorably to both Async and Page Mode products from a performance and pin count perspective.  When SPI-DDR  is compared to Burst Mode devices a system developer will need to consider whether the additional pins (30+) required for the higher performance Burst Mode interface is an application requirement.



Average CPU Stall Time: The impact on system responsiveness from instant delay depends upon how often the cache misses; if the miss rate is very low, the system can tolerate a relatively higher instant delay.  Table 1 shows the average CPU stall time measured in CPU clock cycles as calculated for a 2% the cache miss rate (i.e., 4 cache misses over 200 instructions).  The impact of stall time on system performance depends upon the CPU clock frequency.  As can be seen from the graph, Burst NOR provides minimal stalling of the CPU in the range of 1 or 2 clock cycles.  For CPU operating frequencies from 100 MHz to 166 MHz, SPI-DDR also provides an acceptable stall response when compared with both Burst and Page NOR.




Figure 6 shows the overall effect these factors have on a system’s IPC using a system with a CPU operating at 166 MHz and a 100 MHz memory bus. To put these figures in perspective, a typical mobile or embedded system has a cache miss rate of less than 1%.  In general, SPI-DDR performance compares favorably to both Async and Page Mode NOR products.  For systems with a cache miss rate of 0.5%, both Burst NOR and SPI-DDR NOR have a minimal impact on IPC of 1 to 2%.  For systems with a higher cache miss rate of 1%, Burst NOR provides an advantage by impacting the IPC by 6% compared to 12% for SPI-DDR NOR.  In systems that require the highest performance Burst NOR will continue to be the preferred solution but if slightly lower performance can be tolerated SPI-DDR provides a competitive, low pin count alternative.





Designing an efficient memory subsystem for mobile and embedded systems requires developers to consider many system factors beyond memory bus read bandwidth (see Table 2).  For applications which copy program code into RAM for execution, sustained read performance determines system responsiveness, and systems currently based on Parallel NOR might consider SPI-DDR to achieve pin count reductions while improving both code shadowing during boot and demand paging during normal operation. 

For XiP-based applications, where memory performance and cache miss rate influence the IPC, factors such as read bandwidth, controller latency, instant and average stall time for cache misses determine the overall efficiency of the implementation.  For example, 166 MHz systems can often migrate from Async/Page NOR with the associated high pin counts to SPI-DDR NOR without significantly impacting bandwidth, latency, or overall system performance.  When considering the replacement of Burst NOR a system developer must consider whether the additional pins required for the burst interface are an acceptable price to pay for the improved performance. 

It is also important to note the flexibility of SPI as a technology that can adapt to changing application needs and that the slightly longer initial access time of SPI-DDR NOR is not generally a limiting factor.  Broad chipset support and lower operating voltages will lead to support for higher clock rates and greater bandwidth for SPI-DDR-based NOR products, ensuring that developers will be able to achieve small end-product form factors, lower power consumption, and reduced system cost.



About the Authors:

Qamrul Hasan is a Principal Member of Technical Staff, System Solution Engineering Division, Spansion Inc. Qamrul Hasan is working as a system architect with special focus on performance modeling of hardware components and next-generation memory systems for embedded and mobile applications. Qamrul has been involved in collaborative work with JEDEC standardization working group and provided performance simulation results driving leading to protocol specification of LPDDR2-NVM, Unified Flash Storage (UFS). He holds an MSEE from Oklahoma State University, Stillwater, Oklahoma.

Cliff Zitlaw has 28 years of experience in the non-volatile memory industry.  He has authored several articles and is the inventor or co-inventor of more than 20 patents related to memory architectures.  He has previously served as the JEDEC Chair of JC42.2 covering low power PSRAM devices and is currently Spansion’s representative on JEDEC’s Board of Directors.  Cliff has been with Spansion for four years and is currently a Spansion Fellow; prior to joining Spansion he held technical positions at Xicor, Tunitas Microsystems and Micron.



If you liked this article...

Head to the Memory Designline homepage for the latest updates in memory and storage.

Sign up for the Memory Designline Newsletter, 2X a month delivered to your mailbox with the latest highlights from the site.










































































































blog comments powered by Disqus