

# Multimedia Powerhouse

KARL M. GUTTAG

**M**ore and more, computers and applications are incorporating real-world data types, such as video and voice. To some people, dealing with these data types is a headache; we at Texas Instruments see it as a business opportunity. We designed a groundbreaking DSP (digital signal processor)—the MVP—to bring parallel-processing power to bear upon the problems of multimedia.

The MVP integrates onto a single die five very powerful, fully programmable processors, a sophisticated DMA controller with an external memory interface, 50 KB of SRAM (static RAM), and video timing control (see the figure "The MVP" on page 58). Of the 50 KB of SRAM, 32 KB can be shared among all five processors to support many different parallel-processing approaches. The MVP chip is targeted at solving the problems that are inherent in multimedia and other applications that require a large amount of processing.

#### Driven by Design

The MVP did not spring fully formed from the memories of TI's CAD workstations. Three basic algorithmic areas drove the MVP's design definition: image processing and recognition, video and still-image compression, and high-performance computer graphics.

The design of the MVP's signal-processing components was driven by the needs of image-processing, image-recognition, and image-compression algorithms. The latter category includes convolution and frequency-domain transforms that are multiplication-intensive. For example, the JPEG and MPEG standards require DCT (discrete cosine transform) frequency transforms, so TI paid particular attention to DCT performance and precision.

While the algorithms drove the design of the signal-processing components, the sheer volume of signal processing that is required by these algorithms prompted the decision to include mul-



EARL KELENY © 1994

#### TI's new MVP chip brings parallel-processing power to multimedia applications

iple DSPs on the IC. The design team also discerned that, in general, the primary multimedia algorithms required 16-bit or less fixed-point multiplies with 32-bit accumulates. Higher precision was not required in the signal-processing components.

Historically, DSPs have not been very good at processing the bit-field manipulations used in some compression algorithms or at manipulating multiple-pixel quantities, such as those

encountered in graphics block moves. The MVP's signal processors differ from traditional DSPs most markedly in their ability to manipulate bit fields and process multiple pixels in parallel through their data paths. To reflect these differences, we call these components *advanced DSPs*, or ADSPs. The MVP contains four of them.

One important point about ADSPs is that, although they are optimized for certain *types* of algorithms, they don't dedicate hardware to any specific algorithm. The goal of the MVP is to support elemental operations that can be used to implement any

## The MVP



Each ADSP has two independent 32-bit data ports (G and L in the figure) and a 64-bit instruction cache input (I). The MP has a single 64-bit data port (C/D) and a 32-bit instruction port. The TC (Transfer Controller) has a 64-bit internal port and a 64-bit external port. The ports of the various processors and TC are connected to 25 2-KB RAMs via a crossbar-switch network. The crossbar supports approximately 2.4 GBps of data, plus 1.8 GBps of instructions.

algorithm. This approach pays dividends when vendors develop new algorithms for current problems and when they use the power and programmability of the MVP to develop completely new applications.

## Inside the Advanced DSPs

The MVP's four ADSPs provide most of the chip's raw performance. Each can perform in excess of 10 RISC-like operations per cycle (see the figure "The Advanced DSPs").

To specify the multiple parallel operations that they are able to perform, the ADSPs employ a wide instruction word of 64 bits. This instruction word has fields that independently control the data unit, along with its multiplier and data path, and the two address units. All instructions nominally execute in a single cycle.

Each ADSP has a register file of 44 programmer-visible registers. Any register can be a source to, or a destination from, the ALU data path. This includes the program counter, the address registers, and the loop-control registers. Conditional PC (program counter) relative jumps, for example, are performed by conditionally writing to the PC. The register set is broken into files based on register functions. Most of the registers support more than one access per cycle, with the register file in the data unit supporting over 10 accesses in a single cycle.

An ADSP data unit consists of

three major elements: the data-unit user registers, the multiplier, and the ALU data path. The instruction set supports independent multiplier and ALU data path operations. The multiplier can perform one 16- by 16-bit multiply or two 8- by 8-bit multiplies in a single cycle. The multiplier also has a rounding option, a direct result of maintaining the specified accuracy for the video-compression standards. Whereas the ALU data path can operate on any of the registers, the multiplier is restricted to

operating on eight data-file registers.

The ALU data path includes a barrel rotator, a mask generator, a 1-to- $n$  bit expander (which is used for binary-to-color transforms, among other things), and a three-input ALU that can combine the mask or expander output with register data to create over 2000 different processing options. The ALU has a 32-bit data path that performs logical and arithmetic functions, and it can combine these to support masking or merging in a single pass. The

ALU can be split into smaller sections to perform multiple 8- or 16-bit operations in parallel.

Normally, ALU operations set four status bits: carry, negative, zero, and overflow. Any or all of these bits can be protected from being modified by the current instruction. The instruction set supports both conditional source selection between a pair of registers and writing of the result based on status.

The two address units are nearly identical, and together they can perform two memory operations per cycle. Each memory operation is a load or a store that can be totally independent of the data-unit operation. The address units add an immediate or register index to an address register to form the address. The result of the address computation can optionally modify the address register to facilitate stepping through a memory array.

Like the ALU data path, the two address units support conditional

## The Advanced DSPs



Like other DSPs, those integrated with the MVP are built to support multiple data accesses in a single cycle and to optimize the performance of the multiply-accumulate operations that characterize signal-processing algorithms. In addition, the ADSPs also support bit-field and pixel operations, making them powerful imaging and graphics processors as well.

operations. The source for a store can choose between a pair of registers, and the decision whether or not to load a register can be based on status. The source or destination of a store or a load can be any of the 44 registers. A conditional load of the PC performs a conditional jump, which can free up the ALU data path to perform other operations.

Either or both address units can be used to perform a data operation in place of a memory transfer. In such a case, the result of the address data path is written to the destination register instead of data being fetched from memory. This capability, along with conditional loads of the PC, speeds up functions that are computationally bound or jump-bound rather than memory-access-limited.

Three zero-overhead loop controllers are included in each ADSP. Because each ADSP instruction can do so much in parallel, key loops often require very few instructions. Having three loop controllers even allows for nested loops to have zero loop-control overhead.

Each loop controller has a set of registers that specifies the starting address, ending address, current loop count, and the initial count (for nested looping). Once the loop-control registers are initialized, loop counting and branching are performed with zero overhead in terms of execution time. The loop controllers can be used to perform zero-overhead branches to a run-time patch in code segments. Because the loop-control registers sit in the register file, you can write computational results to a loop-count register to specify whether or not a branch is taken based on a zero result.

Instruction prefetch and the instruction cache are controlled from within each ADSP. Instructions are executed in a three-cycle pipeline, with a new instruction starting every cycle, assuming that no stalling condition has occurred. The ADSPs' instruction controllers support interrupts and emulation control. If a cache miss occurs, the cache controller will make a packet request to the TC (Transfer Controller; described later) to get the new cache sub-block transferred.

#### Beyond DSP

In addition to signal processing and bit-level manipulations, multimedia processing requires many other types of operations, such as 3-D graphics and audio processing. These applications often require high-precision floating-point computations. Because a single FPU was all that could fit on the MVP's die, floating-point capability was not incorporated into the ADSPs but built into a separate pro-



The MP consists of both an integer unit and an FPU. Control of the instruction and data caches is integrated into the processor, although the cache memory resides in the SRAM array.

cessor called the Master Processor, or MP (see the figure "The Master Processor"). The FPU contains a special set of instructions to support 3-D graphics transforms and DSP-like floating-point operations.

The MP is a general-purpose RISC processor that is programmable generally in high-level languages. It performs opera-

store with automatic increment addressing every cycle.

The register file contains 31 32-bit registers that are common to both the integer unit and the FPU. The registers are scoreboarded for floating-point results and memory-load operations. The scoreboard allows the MP to continue execution; the MP will stall only if an instruction tries to use a register before the prior operation has loaded its result. As with some other RISC architectures, R0 is a dummy register that is always read as zero.

Instruction flow and cache management are controlled within the MP. A three-stage pipeline starts a new instruction every cycle, assuming no stalling conditions have occurred. The instruction controller also deals with interrupts and emulation support. The MP has hardware for managing the 4-KB data and 4-KB instruction caches. When a cache miss occurs, the MP's cache controller automatically makes a packet request to the TC to get the necessary data transferred.

#### Communications Matters

The final important consideration in designing the MVP was the need for high data bandwidth for off-chip communications and interprocessor communication. This requirement is common to signal pro-

| CPU                                                                                                                                                                                                          | DSP                                                                                                                                                                                                                     |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul style="list-style-type: none"> <li>Generalized functions</li> <li>Single data bus</li> <li>Hardware cache control</li> <li>Generalized addressing</li> <li>Hardware manages microarchitecture</li> </ul> | <ul style="list-style-type: none"> <li>Signal processing</li> <li>Multiple data buses</li> <li>Programmer-accessible caches</li> <li>Loop-optimized addressing</li> <li>Programmer manages microarchitecture</li> </ul> |

tions requiring a higher level of precision than is available from the ADSPs.

The MP integer unit has a 32-bit instruction word that performs integer register-to-register or load/store instructions nominally in one cycle. The basic load or store operation adds an index to a register containing the base address to form the memory address. To step an address pointer through memory, the instruction can optionally update the register that's used as the base-address register with the result of the add.

The IEEE-754 FPU is pipelined and runs in parallel with the integer unit. In normal operation, a new floating-point add or multiply can be initiated every cycle. A special set of parallel floating-point operations can initiate a multiply, an addition or subtraction, and a 64-bit load or

cessing, floating-point processing, and graphics processing. Much of the early architecture definition focused on achieving high bandwidth, making sure that the processors wouldn't have to wait on data, and ensuring that interprocessor communication would not be a bottleneck.

To address internal communications issues, we incorporated 25 small, 64-bit-wide RAMs on the MVP chip. These are accessed by the processors through a crossbar interconnection. To handle external communications, we incorporated on-chip an intelligent DMA controller for handling block data movement: the TC (Transfer Controller), mentioned earlier.

The 50 KB of on-chip memory is physically separated into 25 2-KB RAMs. 18 KB of this memory (nine 2-KB blocks) is dedicated to specific functions. Every

ADSP uses one 2-KB block as a hardware-managed instruction cache that is loaded by the TC in the event of a cache miss. The MP uses two 2-KB blocks as an instruction cache and two more as a data cache. Finally, one 2-KB block is reserved as fast RAM and is accessible only by the MP and TC.

The remaining 32 KB of RAM is shared and can be accessed in chunks of 8, 16, 32, or 64 bits at a time; the large number of individual RAMs supports many parallel accesses. A crossbar-switch network lets the following accesses to shared RAM occur simultaneously: two 32-bit accesses by each ADSP, a 64-bit access by the MP, and a 64-bit access by the TC.

Crossbar-switch connections are determined by the most significant bits of each address on a cycle-by-cycle basis. If more than one access is requested of the same RAM block in a cycle, round-robin prioritization hardware determines which processor is allowed access and which processor is stalled until the next cycle.

All the shared RAMs and the one MP/TC 2-KB RAM block reside at fixed addresses and are managed by software. Generally, the processors send packet-request commands to the TC to load data before it is needed for processing and to store results after processing. Because of the number of individual RAMs available, these packet transfers can be set up so that they do not conflict with other accesses and therefore work fully in parallel with other processing.

Crossbar-shared memory is the most generally flexible multiprocessor memory architecture because it puts the fewest restrictions on how data must be organized. While the crossbar involves nearly 1000 data and address lines that must be connected between the processors and memory, it becomes practical to use because everything is integrated on one chip. The crossbar's flexibility translates into better efficiency, in terms of both execution speed and ease of programming.

#### The Transfer Controller

The transfer controller is a very intelligent DMA controller that can autonomously transfer packets of data between the MVP and external memory (see the figure "The Transfer Controller"). The TC can address memory as either a linear or a multidimensional array of data or even as a complex shape, such as a polygon. The TC is byte-addressable and will automatically handle byte misalignment between the source and the destination. Requests for packet transfers can be made by any of the processors under program control, as well as by the cache controllers and the video controller for display refresh. Transfers can also be initiated by external requests.

The TC processes the source and destination addressing with independent controllers. The burst FIFO (first-in/first-out) supports DRAM page and burst modes and buffers byte-misaligned accesses to more efficiently move data. A separate cache-access controller can break into the middle of program-controlled packet transfers to service cache misses. The request-prioritization/control logic prioritizes the many potentially active requests and starts transfers. The TC will automatically

suspend and later resume lower-priority requests when a higher-priority request occurs.

The external memory interface provides support for ROMs, SRAMs, DRAMs, and VRAMs (video RAMs). The support for DRAMs, including timing control and address multiplexing, is relatively new in DSPs. The combination of fast on-chip SRAM and an external DRAM interface supports high performance while also reducing system costs.

The TC is capable of transferring data between sources and destinations that have different dimensions. In graphics and imaging, for example, it is common for the TC to fetch data from an image region as a 2-D array and

bring it on-chip for processing as a linear array. After processing, the results stored in a linear array can then be stored off-chip as an x,y array. The ability of the TC to make these transformations autonomously greatly improves the efficiency of processing by the ADSPs and the MP.

The MVP chip has two sets of video-timing counters and registers. The video controller keeps track of horizontal and vertical synchronization and blanking timing, as well as supporting a 2-D border region. Each counter has its own asynchronous clock input and has a set of synchronization, blanking, and border signals. The synchronization signals can be individually set up as outputs (for display) or inputs (for video capture). An SRT (shift-register transfer controller) has comparators that cause shift-register transfer cycles for VRAMs or cause packet transfers for DRAM base-display memory.

#### Support Issues

Although perhaps not as exciting as the microarchitecture, testability and software debugging were important concerns in the MVP design, and roughly 10 percent of the chip's nonmemory transistors are dedicated to these functions. All storage nodes can be scanned in or out to support boundary-scanned testing. Other features in the scan path support emulation loading and the dumping of the internal state of the MVP. Address comparators were also added to support real breakpoints.

A complete suite of software support has been developed for the MVP chip. Assemblers and C compilers have been

## The Transfer Controller



In addition to controlling the movement of data and instructions on- and off-chip, the TC performs transformations where data is processed in an order different from that in which it is stored. The TC also contains a DRAM and VRAM controller.

**NEW**

# Work Smarter

## Access 2.0 Programming Bible

Become an "Access Expert" with the Access 2.0 Programming Bible. This comprehensive guide puts complete information on Access programming at your fingertips in an easy-to-follow style that works for beginners and experienced programmers. The Access 2.0 Programming Bible begins with an overview of Access 2.0 functions, then concentrates on the how-to's, practical examples and ideas for using Access. You'll learn how to work with tables and access data, how to work with the Wizard and much more. Experienced users will especially value this book for its information on programming with Visual Basic for Applications (VBA), network usage, creating and using macros, and exchanging data via OLE 2.0.

\$39.95 with companion diskette. Item #B260. ISBN 1-55755-260-6.



### PC Intern System Programming

with Updates for MS-DOS 6.2 & Pentium™

A literal encyclopedia for DOS programmers, with examples throughout in Assembly language, C, Pascal, and BASIC, for programming video cards, sound, and TSR's. Written for programmers, by programmers. All-time bestseller!

\$59.95 with companion diskette

Item #B145

ISBN 1-55755-145-6



### The PHOTO CD Book

New for CD-ROM/ multimedia fans, graphic artists and others: Complete guidebook for Photo CD technology. Covers photography, Photo CD system configuration, image processing, manipulating images, creating a home theater, and more. Coupons and CD-ROM included with examples, photos, demos and more!

\$29.95 with companion CD-ROM

Item #B195

ISBN 1-55755-195-2



### Multimedia Mania

Explores the multimedia explosion. How to set up a complete multimedia system and how to create presentations. Contains terminology and info on popular programs. Covers audio technology, sound boards and recording, CD and CD-ROM technology. Companion CD-ROM features example programs. \$49.95 with companion CD-ROM

Item #B166

ISBN 1-55755-166-9



### EXCEL 5 Complete

Special Edition

650 pages of Excel know-how. Show how to use Excel's features to solve problems, present data visually in graphs and charts, link data to other applications and much, much more. PackRat Personal Information Manager software from Polaris FREE with book!

\$34.95 with companion disks

Item #B252

ISBN 1-55755-252-5



### EXCEL for Science & Technology

Excel beyond the spreadsheet. Learn about Excel Solver, Scenario Manager, Math Functions, Physics, Chemistry, Technology Conversion and more.

Macros and worksheets on the disk allow you to apply what you learn. Excel for Science and Technology is more than a book; it's an indispensable work tool.

\$34.95 with companion diskette

Item #B196

ISBN 1-55755-196-0

**Order  
TOLL FREE  
1-800-451-4319  
Ext. B6**

**Ask for our  
FREE Catalogs  
of books and  
software**

**Abacus** 

Dept B6, 5370 52nd Street SE  
Grand Rapids, MI 49512  
1-800-451-4319 Toll Free  
Phone 616-698-0330  
FAX 616-698-0325

developed for the MP and the ADSPs. A C-like algebraic assembler for the ADSPs supports the many different operations that they can perform. Software-simulation and hardware-emulation tools that use the same graphical interface are available. An imaging and graphics software library is also currently being developed. An MP-resident executive supports multitasking and intertask synchronization and communications. Under the executive, tasks running on the MP issue commands that are carried out by the ADSPs.

### Putting It All Together

Through the use of parallel processing, the MVP puts a new level of programmability and performance on a single IC. Not only does the MVP integrate five processors on a single chip, but each processor can execute many operations in parallel. The MVP is implemented on a 342mm<sup>2</sup> die, using a 0.6-micron, three-metal-layer process. It uses a 3.3-V power supply and will initially run at 40 MHz, with 50-MHz parts due next year. The MVP is packaged in a ceramic pin-grid array, but it will eventually move to a composite metal-plastic package. It draws 7.5 W at 50 MHz.

The MVP is capable of performing the equivalent of over 2 billion RISC-like operations per second. In specific applications, a single MVP can do the job of over 10 of the most powerful DSPs or general-purpose processors previously available. The MVP can move 2.4 GB of data and 1.8 GB of instructions within the chip—plus shuffle 400 MB of data to off-chip memory—per second.

Some of the obvious uses for the MVP chip will be multimedia applications, such as videoconferencing; document-image processing, from digital copiers to real-time OCR; 2-D and 3-D graphics; audio enhancement and compression; telecommunications; and virtual reality. But the real virtue of the MVP is that its combination of programmability and performance will undoubtedly lead to applications that are as yet unimaginable. ■

### ACKNOWLEDGMENTS

*I wish to thank all the people who made the MVP a reality, especially co-architects Bob Gove, Nick Ing-Simmons, Keith Balmer, and program manager Walt Bonneau. The development of the MVP was a worldwide TI project, involving TI employees in Houston; Dallas; Bedford, England; and Bangalore, India.*

*Karl M. Gutttag is a TI Fellow and chief architect of the MVP chip. You can contact him on the Internet at [karl@video.sc.ti.com](mailto:karl@video.sc.ti.com) or on BIX c/o "editors."*