A 200 MHz 13 mm² 2-D DCT Macrocell Using Sense-Amplifying Pipeline Flip-Flop Scheme

Masataka Matsui, Member, IEEE, Hiroyuki Hara, Yoshiharu Uetani, Lee-Sup Kim, Member, IEEE, Tetsu Nagamatsu, Member, IEEE, Yoshinori Watanabe, Akihiko Chiba, Kouji Matsuda, and Takayasu Sakurai, Member, IEEE

Abstract—The two-dimensional discrete cosine transform (2-D DCT) has been widely recognized as a key processing unit for image data compression/decompression. In this paper, the implementation of a 200 MHz 13.3 mm² 8 × 8 2-D DCT macrocell capable of HDTV rates, based on a direct realization of the DCT, and using distributed arithmetic is presented. The macrocell, fabricated using 0.8 μm base-rule CMOS technology and 0.5 μm MOSFET’s, performs the DCT processing with 1 sample-(pixel)-per-clock throughput. The fast speed and small area are achieved by a novel sense-amplifying pipeline flip-flop (SA-F/F) circuit technique in combination with CMOS differential logic. The SA-F/F, a class of delay flip-flops, can be used as a differential synchronous sense-amplifier, and can amplify dual-rail inputs with swings lower than 100 mV. A 1.6 ns 20 bit carry skip adder used in the DCT macrocell, which was designed by the same scheme, is also described. The adder is 50% faster and 30% smaller than a conventional CMOS carry look ahead adder, which reduces the macrocell size by 15% compared to a conventional CMOS implementation.

I. INTRODUCTION

A two-dimensional (2-D) discrete cosine transform (DCT) macrocell is key to image and video de/compression LSI’s because various standards including MPEG1/2 (Moving Picture Experts Group) [1],[2], CCITT H.261 [3], and JPEG (Joint Photographic Experts Group) [4] have adopted DCT-based coding. In particular among them, the MPEG2 standard covers HDTV-rate video signals which require DCT processing of more than 100 M samples (pixels) per second. A 21 mm² 2-D DCT macrocell was reported [5] which can operate at 100 MHz. However, the macrocell was still slow and large for the final goal of “a single-chip HDTV video codec” in cost sensitive consumer products.

This DCT macrocell consists of a set of iterative multiplier-accumulators (MAC’s) and buffer memories [6] as do most dedicated DSP’s. To speed up the clock rate in the MAC’s, deep pipelining and fast addition techniques like carry look ahead (CLA) and/or carry select adders [7] are usually used, but they unfortunately consume much additional area. This technique, on the other hand, emphasizes a fast circuit technique and a simple adder algorithm with shallow pipeline stages to achieve a fast and small chip.

This paper describes a 13.3 mm² dedicated macrocell which can execute 8 × 8 2-D DCT’s at 200 MHz with one pixel-per-clock throughput [8]. A new circuit technique, named SA-F/F (sense-amplifying pipeline flip-flop) is implemented, in which a special flip-flop used as a pipeline latch also acts as a sense-amplifier to regenerate low-swing differential inputs. Applying the scheme to a simple carry skip adder in the DCT MAC’s drastically shortens propagation time and also reduces the macrocell size.

The next section discusses the concept of the SA-F/F scheme, explaining in some detail why it is useful. The basic architecture and implementation of the DCT macrocell are given in Section III. The fabrication and results of the macrocell are presented in Section IV followed by the conclusion in the final section.

II. SA-F/F SCHEME

A. Concept

Sense-amplifying techniques are widely used in memory LSI’s in which complementary inputs with swings lower than 100 mV are differentially detected and regenerated to full rail-to-rail swings by a sense-amplifier. This technique significantly speeds up signal propagation when it is applied to heavily loaded and slow dual-rail signals like a bitline pair in a static RAM.

In contrast, these techniques have not been utilized for logic LSI’s except for an on-chip memory macrocell. One obvious reason is that most logic gates are single rail. However, recently dual-rail logic [10],[11],[16] is becoming popular in data-path design to achieve higher speed than conventional CMOS single-rail logic. Another reason is that it is difficult to generate a timing signal to activate a sense-amplifier. The signal would be optimum if it were activated at the moment when the difference between levels on the dual rails passes the input-offset voltage of the sense-amplifier. Unfortunately, the offset voltage is affected by process variations, noise and so on, and hence unpredictable. In memory LSI’s the timing
signal is generally generated from delay lines using self-timing and they must be carefully tuned and optimized with timing margins large enough to tolerate process variations. However, this kind of tuning among racing signals is usually avoided in the design of logic LSI's because there is a risk of a fatal malfunction which cannot be corrected by lowering the system clock frequency. Therefore, a simple solution must be found for the sense-amplifying mechanism to easily migrate into synchronous design.

The basic concept of the sense-amplifying pipeline flip-flop (SA-F/F) scheme proposed in this paper is shown in Fig. 1(a). In this scheme, a sense-amplifier is merged into a latch which is a synchronization element to a system clock. The SA/F amplifies low-swing differential inputs \((D, \bar{D})\) and latches data in the same way as a conventional static delay flip-flop (D-F/F), synchronously to a single clock (CLK) in Fig. 1(b). \(Q, \bar{Q}\) are the full-swing outputs of the SA-F/F. It is not necessary to consider the latch timing optimization of the sense-amp as it is with ordinary reduced voltage swing circuits which use self-timing, because the SA-F/F utilizes the system clock itself as signal to activate the sense-amp. As a result, the latch timing varies as the system clock frequency changes and the optimized timing can be measured as the maximum clock frequency if the path including the SA-F/F is critical. In other words, the timing margin is always optimized and there is no need to generate a critical timing signal which is constant independent of the system clock frequency. Therefore, this scheme can naturally bring the sense-amplifying mechanism into a conventional single phase clocking system widely used in recent VLSI design.

Fig. 2(a) shows circuit schematics of the SA-F/F. The SA-F/F consists of a current-controlled latch sense-amplifier [9] as the master and a NOR-type latch as the slave. While the clock is high, the master sense-amplifier outputs are predischarged and the slave holds the previous value. When the clock transitions from 1 to 0, the master sense-amplifier is activated and captures the differential value \((\Delta V_{in})\) between the \(D\) and \(\bar{D}\) inputs at the time of the clock transition, passing the stored master value to the complementary outputs of the slave latch \((Q, \bar{Q})\). The sense-amplifier is basically a RAM sense amplifier and hence can easily detect a mere 100 mV of input voltage difference, \(\Delta V_{in}\). However, as in memories, the two signal lines of the dual-rail pair must be routed adjacent to each other, in order to decrease differential-mode noise.

Another SA-F/F circuit is shown in Fig. 2(b). In this type, \(D\) and \(\bar{D}\) are connected to nMOS gates. One of the two circuits should be chosen according to the common mode level of the \(D\) and \(\bar{D}\) inputs. The estimated performance of the SA-F/F using a 0.5 \(\mu\)m device parameters at supply voltage of 3.3 V is shown in Fig. 3. A standard CMOS D flip-flop [17] is also shown in Fig. 3 for comparison. In terms of area and power, the SA-F/F is comparable to the conventional D flip-flop. Moreover, the SA-F/F can operate in a true single-phase clock and hence requires no additional inverter to generate a local clock with the opposite polarity. However, the delay from clock (CLK) to output \((Q)\) is twice as slow as that of the conventional D flip-flop. This is because the sense-amplifier requires additional time to amplify the low-swing differential inputs.

**B. Application to nMOS Differential Logic**

N MOS differential logic is one of the most promising applications for the SA-F/F scheme, since dual-rail outputs of the nMOS differential logic can be directly connected to the SA-F/F inputs. Several differential nMOS logic families such as...
Fig. 4. XOR gate using nMOS differential logic. (a) Dynamic DPTL. (b) CPL.

as differential pass transistor logic (DPTL) [10] (Fig. 4(a)) and complementary pass transistor logic (CPL) [11] (Fig. 4(b)) have been proposed to improve CMOS circuit speed. Both utilize a differential nMOS pass-transistor combinational network not only as pull-down elements but also as pull-ups, by passing variables into drain inputs of pass-transistors. The speed advantage of DPTL and CPL is due to the large conductance of nMOS, small input capacitance by eliminating pMOS, and inherent efficiency of nMOS differential switching networks compared to conventional CMOS gates. However, eliminating pMOS results in a circuit that does not pass a logic ONE efficiently. Thus, the differential signal must be restored to normal logic levels by using either a static or a clocked differential buffer. CPL simply utilizes an inverter for an amplifier (Fig. 4(b)), which is a very practical solution but suffers from difficulty in achieving low voltage operation. Dynamic DPTL uses a latch-type sense-amp synchronous to a two-phase clock (Fig. 4(b)) [12], which can reduce the voltage swing of the differential signals. Nevertheless, the swing cannot be reduced to less than the threshold voltage of nMOS (about 700 mV) on the condition that all the drain/source nodes in the nMOS pass transistor network (B, B, D, and D in Fig. 4(a)) are predischarged to ground in the precharge phase. The reason for this is that the input differential signals to the nMOS transistors are insensitive below their threshold voltage.

The SA-F/F scheme in combination with nMOS dynamic differential logic is shown in Fig. 5(a). The differential inputs are generated from an nMOS differential logic network controlled by a \( \Phi_p \) pulse. The timing diagram is shown in Fig. 5(b). All the differential drain/source nodes in the pass-transistor network (including D and D in Fig. 4(a)) are predischarged to ground while \( \Phi_p \) is active. After the differential outputs appear on D and D, they are sense-amplified and latched by the SA-F/F. Fig. 5(c) describes a circuit implementation of the SA-F/F scheme. A simple exclusive OR gate is shown. A clocked source-follower pull-up driver is used to drive pass-variable inputs. By predischarging the source/drain nodes in the network to ground, all pass-transistors operate in their linear regions and hence have a large conductance during the initial evaluation stage. The SA-F/F with pMOS gate inputs (used in Fig. 2(a)) makes it possible to detect less than 100 mV differential inputs whose common mode value is close to ground.

It is a significant limitation of this scheme that it can only be applied to the last block of a pipeline stage, because outputs of the nMOS network are directly connected to latches (i.e., SA-F/F's), and its inputs must be full-swing. Moreover, the scheme requires the generation of the precharge pulse \( \Phi_p \). Usually, the clock (CLK) is utilized for \( \Phi_p \), in which only the latter half of a clock cycle can be used for evaluation of the network. Another option for generating \( \Phi_p \) is to use self-timing. That option does have the racing signal hazard between \( \Phi_p \) and the inputs of the nMOS network. However, the \( \Phi_p \) pulse is much easier to generate than the sense-amp activation signal because all the related signals are generated by conventional full-swing CMOS gates insensitive to the input-offset voltage, and hence much more predictable. For these reasons, the scheme is always accompanied by gates with other primary logic styles like conventional static CMOS, DPTL, and so on. Gates using the SA-F/F scheme would be clearly slower than those with the other logic styles like DPTL and CPL if the scheme was applied to a simple gate like Fig. 5(c). As stated earlier, the sense-amplifying mechanism is efficient only when it is applied to heavily loaded and slow dual-rail signals. The SA-F/F scheme makes it possible to construct large nMOS differential logic networks with deep logic depths, which are...
too slow to be realized by the conventional differential nMOS logic families. An example is shown in the next section.

C. Carry Skip Adder

The nMOS differential logic style in combination with the SA-F/F is applied to a carry skip (bypass) adder. Fig. 6(a) shows a 4 bit carry skip adder. Each rectangle in the figure represents a single digit and full circuit schematics are shown only in the third digit. It uses a dual-rail complementary Manchester carry chain, and is predischarged by the $\Phi_2$ pulse. The sum is produced by an exclusive-OR of the carry and propagate signals in each digit using nMOS differential logic and is sense-amplified and latched by the SA-F/F. This implementation also uses a "conflict-free" bypass circuit [13], which improves speed by isolating the node capacitance on the local chain. Conventional static CMOS gates are used to produce propagate (PO-P3), generate (GO-G3), kill (KO-K3), and carry skip signals (POPlP2P3 and POPlP2P3) and are not shown in Fig. 6(a). Wider than 16 bit adders can easily be constructed by serially connecting the 4 bit adders, without additional area-consuming speed-up circuits such as carry look-ahead (CLA). The block diagram of the 20 bit adder is shown in Fig. 6(b). The critical path of the 20 bit adder includes the carry chain and the delay of the carry chain is largest when a carry is generated from the digit 0 and propagated to digit 18. Consequently, in the critical path of the carry chain, 10 pass-transistors are connected serially. The speed of the carry propagation is determined by the transmission-line $RC$ delay of the chain whose time constant is derived from the equivalent resistance and capacitance of the chain. In the adder, the SA-F/F can detect a mere 100 mV input voltage difference ($\Delta V_{in}$) of the dual-rail carry chains. In contrast, the inverter used as a detector in the conventional Manchester carry chain adder with a single-rail pass-transistor carry chain requires a 1.5 V input voltage swing, which is the logic threshold of the inverter. Therefore, the carry propagation of the new adder is roughly 15 times faster than that of the conventional one. It should be noted that the amplifying time of the SA-F/F—on the order of 1 ns—is not included in the addition time but counted in clock-to-data-out delay of the pipeline register. This time is of course not usually in the critical path.

Since the differential input voltage of the SA-F/F is about 100 mV and the low level of the inputs is ground, the threshold voltage drop by nMOS pass-transistors and pull-up transistors does not hinder the function of the SA-F/F, even in low-voltage operation. The area penalty of the nMOS differential logic network compared to the ordinary CMOS gates is small because only nMOS transistors are used. Thus, in the case of a 20 bit adder, the resulting circuit with no additional CLA will have about a 30% area advantage as well as a 50% speed advantage over a conventional CMOS implementation with CLA. Since both the current-controlled latch sense-amp employed in the SA-F/F and the conventional delay flip-flop do not consume dc power, and the voltage swing in the carry chain is reduced in high-speed operation; the new circuit is comparable to conventional CMOS circuits in terms of power consumption. Therefore, in terms of speed, area and power, the resulting adder is superior or equal to a conventional CMOS design using CLA.

The simulated performance of the carry skip adder with various bit lengths using 0.5 $\mu$m CMOS transistors is shown in Fig. 7. The addition times were estimated using a input offset-voltage of 100 mV. It is assumed that the adder is constructed simply by connecting the 4 bit carry skip adders shown in Fig. 6 serially. Only the transistor width used for carry bypass was optimized. The 20 bit addition time is estimated to be 1.6 ns and the 64 bit time is 3.5 ns, which is faster or competitive to adders using an asymptotically faster and area-consuming algorithms such as CLA or carry select. In the case of adders with higher bit lengths than 64 bits, it is necessary to use a multiple carry skip technique [13] to remain competitive.

III. DCT IMPLEMENTATION

A. Architecture

The 2-D DCT processor macrocell which executes a two-dimensional 8 x 8 DCT and inverse DCT (IDCT) is implemented using the row-column decomposition method based on Chen [14]. The macrocell also has a regularized parallel architecture based on distributed arithmetic by Sun [16], which delivers high throughout DCT/IDCT processing of one sample (pixel) per clock.

There are two 1-D IDCT/DCT processors; one for row DCT/IDCT and another for column DCT/IDCT. A transposition RAM is used as a buffer between them as shown in Fig. 8.
An 8-point (unnormalized) 1-D IDCT operation is defined by

\[ Y_l = \sum_{k=0}^{7} C_{k,l} X_k \]  

(1)

where \( C_{k,l} (k, l = 0, 1, \ldots, 7) \) are IDCT coefficients, \( X_k \) is transform-domain input data, and \( Y_l \) is time-domain output data. According to Chen's method, the expression in (1) can be decomposed into two groups of linear transformations by:

\[ Y_l = \frac{1}{2} \left( \sum_{i=0}^{3} C_{(2i+1),l} X_{(2i+1)} + \sum_{i=0}^{3} C_{(2i),l} X_{(2i)} \right) \]

when \( i = 0, 1, 2, 3 \)

\[ \frac{1}{2} \left( \sum_{i=0}^{3} C_{(2i+1),l} X_{(2i+1)} - \sum_{i=0}^{3} C_{(2i),l} X_{(2i)} \right) \]

when \( i = 4, 5, 6, 7 \)

(2)

which reduces the total number of multiplications from 64 to 32. Therefore, 8 MAC units are needed to calculate 8 sets of

\[ Y = \sum_{k=0}^{3} C_k X_k \]

(3)

at one-pixel-per-clock throughput.

In the distributed arithmetic method, MAC operation is done on a bit-by-bit level. Hardware multipliers are not used. A 1-D linear transformation of the form

\[ Y = \sum_{k=0}^{3} C_k X_k (C_k \text{ : DCT coefficient}), \]

\[ X_k = \sum_{n=0}^{15} x_{kn} 2^{-n} \text{ : input data:} \]

\[ x_{kn} = 0, \quad 1(n! = 0) \]

\[ x_{kn} = 0, -1(n = 0) \text{ : sign bit of 2’s complement).} \]

(4)

can be calculated by a MAC operation in the following iterative way:

\[ Y_l = \left[ \sum_{k=0}^{3} C_k x_{k(2i)} \right] + 2^{-1} \left[ \sum_{k=0}^{3} C_k x_{k(2i+1)} \right] + 2^{-2} Y_{i+1} (i = 7, 6, \ldots, 0) \]

(5)

In (5), two adjacent digits are calculated in parallel since the transform must be completed in 8 cycles. This means the input data will be shifted in at a rate of 2 bits per clock cycle. Partial products of the form \( \sum C_k x_{kn} \) are derived from two table lookup ROM's whose capacities are 16 words by 16 bits.

Fig. 9 shows a block diagram of the 8 point 1-D DCT unit in the DCT/IDCT macro which implements the above equation. The data sequence \( x_0, x_1, \ldots, x_7 \) is stored sequentially into an input buffer memory with bit-parallel structure. With a latency of 8 cycles, the contents in the buffers are read out concurrently in bit serial structure with the least significant bit first. The buffer memory is a special purpose memory for parallel-to-serial transposition, which has 16 word \( \times \) 16 bit capacity. The bit-serial data are loaded into the 8 MAC units concurrently and calculated iteratively. The resultant sums from the MAC units are sent to the butterfly stage.
B. MAC Implementation

The MAC unit which realizes the expression in (5) is implemented with ROM’s, accumulators and shifters. Fig. 10 shows the block diagram of the MAC unit and its circuit implementation. Two partial products from two different ROM’s are added in parallel first and then accumulated shown in the block diagram. The output has 20 bit accuracy. In the circuit implementation, two bits from ROM’s and one bit from the accumulation register are first added by a full adder, and then the full adder outputs are loaded into a 20 bit carry propagation adder. This carry save addition technique eliminates the need for another carry propagation adder.

A 20 bit differential carry skip adder with the SA-F/F scheme is employed as a final adder. Owing to the high-speed nature of the SA-F/F scheme, no pipeline latch is required in the entire MAC stage, which means that compared to the previous work [5] shown in Fig. 11, two pipeline latches were eliminated. This is crucial in area reduction. The DCT macrocell requires 16 MAC units, which occupy 60% of the total macro area. Because the 20 bit adders with the SA-F/F have a smaller area, the overall macro size is reduced by 15% compared to a conventional CMOS implementation.

IV. FABRICATION AND RESULTS

The DCT test chip was fabricated using 0.8 μm base-rule double-metal CMOS technology. 0.5 μm nMOSFET’s and 0.6 μm pMOSFET’s are used for 3.3 V operation. Features of the macrocell are summarized in Table I. A die microphotograph is shown in Fig. 12. The macrocell is designed using fully customized cells and measures 3.85 mm x 3.45 mm. It is primarily made up of a row 1-D DCT unit, a column 1-D DCT unit, 2 K bit one-port SRAM for row-column decomposition, and a controller. The two 1-D DCT units include 16 MAC units, two 256 bit two-port SRAM’s to transpose input data, two preprocessing units, and two post-processing units. The macro also has boundary scan registers to make testing easier. The SRAM used for row-column transposition can be tested directly, utilizing serial scan techniques. It should be noted that the SA-F/F is used only in the 20 bit adder of the MAC unit where its contribution is greatest.

The macrocell is designed to operate at 200 MHz at 3.3 V and at room temperature. Fig. 13 shows the simulated waveforms of the critical path in the macro at a 200 MHz clock rate. The critical path lies within the MAC unit. It can be seen that the 20 bit adder speed from the full adder output to the clock transition from 1 to 0 is 1.6 ns. Fig. 14 shows measured speeds of the MAC unit. The evaluation is done by changing phase of a special clock which controls the output latch of the MAC unit relative to the master clock. Typically 200 MHz operation is observed at 3.3 V, and 100 MHz operation is attainable at 2 V. Power consumption of the macro is 0.35 W at 40 MHz under 3.3 V operation and 0.15 W at 40 MHz under 2 V operation.

The computational accuracy evaluation results of the IDCT operation are shown in Table II. The bit length of the ROM was chosen to be 16 bits, and internal accuracy of the MAC unit is 20 bits. Results in the table fully comply with the IDCT accuracy specification of IEEE 1180–1990 which H.261 and MPEG use.

The macrocell was implemented in a single-chip HDTV MPEG2 decoder [15] which can decode baseband HDTV signals at a 70 MHz clock rate.
V. CONCLUSION

A 200 MHz 13.3 mm² 2-D DCT macrocell with 1 sample-(pixel)-per-clock throughput was described. The macrocell can execute both DCT and IDCT processing which is electrically switchable and fully satisfies IEEE 1180–1990. The fast speed and small area are achieved using the novel SA-F/F scheme. In the scheme, a special flip-flop, the SA-FF, was used in combination with riMOS differential logic. The SA-F/F can be used as a differential sense-amplifier synchronous to the system clock and can amplify dual-rail inputs with swings lower than 100 mV.

A 1.6 ns 20 bit carry skip adder was designed by the same scheme and used in the DCT macrocell. The adder is 50% faster and 30% smaller than a conventional CMOS carry look ahead adder, which reduces the macrocell size by 15% compared to a conventional CMOS implementation.

The SA-F/F scheme can be used for other high-speed and small-area circuit implementations [8],[16]. The macro has been fabricated as a test chip and has been implemented in a single-chip HDTV MPEG2 decoder.

ACKNOWLEDGMENT

The authors would like to thank N. Kai, T. Odaka, T. Oto, A. Parameswar, B. Baas, G. Yeh, K. Maeguchi, K. Kanzaki, S. Suzuki, S. Sasaki, S. Kohyama, H. Nakatsuka, and Y. Unno for valuable discussions and encouragement. We also wish to acknowledge the contribution of T. Shimazawa, S. Mita, F. Sano, and K. Seta for the implementation of the chip. We would also like to extend our gratitude to the editor and two anonymous reviewers for valuable comments and suggestions.

REFERENCES

Corporation, Kawasaki, Japan, where he has been engaged in the development of video compression systems.

Masataka Matsui (S'83-M'85) was born in Tokyo, Japan, on August 30, 1960. He received the B.S. and M.S. degrees in electronic engineering from the University of Tokyo, Tokyo, Japan, in 1983 and 1985, respectively.

In 1985 he joined the Semiconductor Device Engineering Laboratory, Toshiba Corporation, Kawasaki, Japan, where he has been engaged in the research and development of static memories including 1 Mbit CMOS SRAM and 1 Mbit BiCMOS SRAM, BiCMOS ASIC’s, and video compression/decompression LSI’s. Since 1993, he has been a visiting scholar at Stanford University, where he is working on low-power LSI design.

Mr. Matsui is a member of the Institute of Electronics, Information and Communication Engineers of Japan.

Hiroyuki Hara was born on November 19, 1960 in Tokyo, Japan. He received the B.S. degree in electronic engineering from Shibaura Institute of Technology, Tokyo, Japan in 1983.

In 1983, he joined Toshiba Corporation, Kawasaki, Japan, where he was engaged in bipolar and BiCMOS LSI development and design. He is now in Toshiba’s Semiconductor Device Engineering Laboratory where he has been engaged in the research and development of BiCMOS macrocells for high performance ASIC’s. He has currently developed DCT macrocell for video compression/decompression LSI’s.

Lee-Sup Kim (S'86-M'89) received the B.S. degree in electronics engineering from Seoul National University, Korea, in 1982 and the M.S. and Ph.D. degrees in electrical engineering from Stanford University, Stanford, CA, in 1986 and 1990, respectively.

He was a post-Doctoral Fellow at the Toshiba Corporation, Kawasaki, Japan, during 1990-1993, where he was involved in the design of the high performance DSP and single chip MPEG2 decoder. Since March 1993, he has been at KAIST as an Assistant Professor. His research interests are ASIC design for communication, LCD driver IC design, and SAR (Synthetic Aperture Radar) hardware implementations.

Tetsu Nagamatsu (M’86) received the B.S. degree in applied physics from Waseda University, Tokyo, Japan, in 1984, and the M.S. degree in energy science from Tokyo Institute of Technology, Tokyo, Japan, in 1986.

He joined Semiconductor Device Engineering Laboratory in 1986. He has been engaged in the research and development of BiCMOS ASIC’s, BiCMOS memory macrocells, and a CMOS DCT macrocell.

Mr. Nagamatsu is a member of the Institute of Electronics, Information and Communication Engineers of Japan.

Yoshinori Watanabe was born in Mie, Japan, on May 11, 1961. He received the B.S. degree in electronic control engineering from Tokai University, Japan, in 1984.

He joined Toshiba Microelectronics Corporation, Kawasaki, Japan, and moved to Toshiba’s Semiconductor Device Engineering Laboratory in 1986. He has been engaged in the research and development of BiCMOS ASIC and BiCMOS memory macrocells. He has currently developed a DCT macrocell for video compression/decompression LSI’s.

Akihiko Chiba was born in Hokkaido, Japan, on May 19, 1967. He received the B.S. degree in mechanics from Hokkaido Institute of Technology, Japan, in 1990.

He joined Toshiba Microelectronics Corporation, Kawasaki, Japan, and then moved to Toshiba’s Semiconductor Device Engineering Laboratory, where he has been engaged in the research and development of BiCMOS macrocells and LSI testing research. He has currently developed a DCT macrocell for video compression/decompression LSI’s.

Kouji Matsuda was born in Tokyo, Japan, on December 20, 1959. He received the B.S. degree in computer science from Shonan Institute of Technology, Kanagawa, Japan, in 1982. In 1982 he joined the Toshiba Microelectronics Corporation, Kawasaki, Japan, where he has been engaged in VLSI testing research and the development of BiCMOS ASIC’s and an MPEG2 decoder LSI.

Yoshiharu Uetani received the B.S.E.E. degree in 1982 from Himeji Institute of Technology, Hyogo, Japan.

In 1982, he joined the Research and Development Center, Toshiba Corporation, Kawasaki, Japan, where he has been engaged in the development of video compression systems.

Mr. Uetani is a member of the Institute of Television Engineers of Japan.
Takayasu Sakurai (S’77–M’78) was born in Tokyo, Japan, on January 10, 1954. He received the B.S., M.S., and Ph.D. degrees in electronic engineering from the University of Tokyo, Tokyo, Japan, in 1976, 1978, and 1981, respectively. His Ph.D. work was on electronic structures of a Si-SiO$_2$ interface.

In 1981 he joined Semiconductor Device Engineering Laboratory, Toshiba Corporation, Japan, where he was engaged in the research and development of CMOS dynamic RAM and 64 Kbit, 256 Kbit SRAM, 1 Mbit virtual SRAM, cache memories, and BiCMOS ASIC’s. During the development, he also worked on the modeling of interconnect capacitance and delay, new memory architecture, hot-carrier resistant circuits, arbiter optimization, gate-level delay modeling, alpha $\nu$-th power MOS model, and transistor network synthesis. From 1988 through 1990, he was a visiting scholar at University of California, Berkeley, doing research in the field of VLSI CAD. He is currently back at Toshiba, managing multimedia LSI development. His present interests include low-power designs, DSP’s, FPGA’s and video compression/decompression LSI’s. He is also a visiting lecturer at the University of Tokyo.

Dr. Sakurai serves as a program committee member for the Symposium on VLSI Circuits, the CICC, the International Conference on VLSI and CAD, and the ACM FPGA Workshop. He is a member of the Institute of Electronics, Information and Communication Engineers of Japan and the Japan Society of Applied Physics.