# 16.482 / 16.561: Computer Architecture and Design 

Fall 2014<br>Midterm Exam Solution

1. (8 points) Evaluating instructions

Assume the following initial state prior to executing the instructions below. Note that the result of each instruction may depend on prior instructions.

- \$t4 = 0x0000000A, \$t5 = 0x00000004, \$t6 = 0x00101000
- Contents of memory (all values are in hexadecimal)

| Address | Lo |  | Hi |  |
| :---: | :---: | :---: | :---: | :---: |
| 0x00101600 | AA | 14 | 92 | 44 |
| 0x00101604 | 08 | 22 | 11 | 13 |

For each instruction below, list the changed register or memory location(s) and its new value.
add \$s0, \$t4, \$t5
$\$ \mathrm{~s} 0=\$ \mathrm{t} 4+\$ \mathrm{t} 5=0 x 0000000 \mathrm{~A}+0 x 00000004=\underline{0 x 0000000 \mathrm{E}}$
ori \$s0, \$s0, 0xCC11
\$s0 = \$s0 OR 0xCC11 = 0x0000000E OR 0x0000CC11 = 0x0000CC1F
sh \$s0, 0x606(\$t6)
Address $=0 \times 606+\$ t 6=0 x 00000606+0 x 00101000=0 x 00101606$
mem[0x00101606] = lowest halfword in \$s0 = 0xCC1F $\rightarrow$ mem[0x00101604] $=0 x 0822 C C 1 F$ (changed bytes underlined)

1b \$s1, 0x600(\$t6)
Address $=0 x 600+\$ t 6=0 x 00000600+0 x 00101000=0 x 00101600$
\$s1 = sign-extended byte from mem[0x00101600]
$\operatorname{mem}[0 x 00101600]=0 x A A \rightarrow \$$ si $=0 x F F F F F F A A$
slt \$s2, \$t4, \$t5
\$s2 = 1 if (\$t4 < \$t5), \$s2 = 0 otherwise
$\$ \mathrm{t} 4=0 \mathrm{x} 0000000 \mathrm{~A}$ and $\$ \mathrm{t} 5=0 x 00000004 \rightarrow \$ \mathrm{t} 4>\$ \mathrm{t} 5 \rightarrow \$ \mathrm{~s} 2=0$
bne \$s2, \$zero, L
Branch to L if (\$s2 $\neq$ \$zero)
Since $\$ \mathrm{~s} 2=\$$ zero $=0$, branch is not taken
sra \$s1, \$s1, 8
\$s1 = \$s1 >> 8 (keep sign) = 0xFFFFFFAA >> 8 = 0xFFFFFFFF
L: addi \$s1, \$s1, -1
\$s1 = \$s1 - 1 = 0xFFFFFFFF - 1 = 0xFFFFFFFE
2. (16 points) Binary multiplication
a. (6 points) The two multipliers discussed in class are shown below:


Describe one major advantage of the optimized multiplier (on the left), and one major advantage of the tree multiplier (on the right).

Solution: The descriptions below assume the multiplication of two n-bit values:
The optimized multiplier uses less hardware than the tree multiplier-it requires only one adder, an n-bit register for the multiplicand, a 2n-bit shift register for the product/multiplier, and control logic. The tree multiplier also uses a $2 n$-bit register (to store the final result), the initial stage consists of $n 2$-bit AND gates, and the multiplier tree contains $n-1$ adders.
The tree multiplier is faster than the optimized multiplier-it takes $\mathrm{O}\left(\log _{2}(\mathrm{n})\right)$ cycles, as opposed to $\mathrm{O}(\mathrm{n})$ cycles for the optimized multiplier.
b. (10 points) You are given $A=-6$ and $B=3$. Assume each operand uses four bits. Show how the binary multiplication of $A * B$ would proceed using the optimized multiplier.

Solution: First of all, please note that this question was poorly formulated-I essentially asked you to do signed multiplication in a multiplier that can't properly handle signed multiplication. If you simply plug in $-6=1010_{2}$ and $3=0011_{2}$, you end up with 30 as your result if you follow the appropriate steps. Sign-extending your product/multiplier at each step doesn't help, either-your result will be 14 in that case.

The only way to therefore get a correct answer is to treat both values as positive and correct the signs after the fact. I've shown that solution below. However, because of the issues inherent in the problem, the grading was handled very, very leniently.

Initially, product/multiplier $=00000011$ (underlined bit determines next step)
Step 1: LSB of product/multiplier $=1 \rightarrow$ add multiplicand into left half of register $\&$ shift right 00000011

| $+\quad 0110$ |
| :--- |

01100011
$00110001 \leqslant$ Product/multiplier after shift
Step 2: LSB = $1 \rightarrow$ add multiplicand into left half of register \& shift right 00110001

| $+\quad 0110$ |
| :--- |

10010001
$01001000 \leqslant$ Product/multiplier after shift
Step 3: LSB $=0 \rightarrow$ shift right
0010010 $\quad \leftarrow$ Product/multiplier after shift
Step 3: LSB $=0 \rightarrow$ shift right
$00010010 \leqslant$ Final result $=18$
After negation, $-\left(00010010_{2}\right)=11101101_{2}+1=11101110_{2}=-18$

## 3. (20 points) IEEE floating-point format

Multiply the two IEEE single-precision floating-point values 0x40900000 and 0xc1480000. For full credit, you must show all work, including:

- Convert the two values into binary
- Perform the multiplication in binary, not decimal
- Re-encode the result in IEEE single-precision format

Solution: We break each given value into the three fields of a single-precision floating-point value: sign (1 bit), biased exponent (8 bits), and fraction (23 bits), then convert each value:

```
0x40900000 = 0100 0000 10010000 0000 0000 0000 00002
    Sign = 0 (positive value)
    Biased exponent = 100000012 = 129
        Actual exponent = [Biased exponent] - bias = 129-127 = 2
    Fraction = 0010000000000000000 00002 = 0012
0x40900000 = 1.0012 }\times\mp@subsup{2}{}{2}=4.51
0xC1480000 = 1100 00010100 1000 0000 0000 0000 00002
    Sign = 1 (negative value)
    Biased exponent = 100000102 = 130
        Actual exponent = [Biased exponent] - bias = 130-127=3
    Fraction = 100 100000000000 0000 00002 = 10012
0xC0980000 =-1.10012 }\times\mp@subsup{2}{}{3}=-12.51
```

To multiply these numbers:

- Add exponents
o $2+3=5$
- Multiply significands:
o $1.001_{2} * 1.1001_{2}=1.1100001_{2}$
- Renormalize if necessary (not necessary in this case)
- Determine sign
o Positive * negative = negative
- To convert back to single-precision format:
o Sign = 1 (negative value)
o Biased exponent = actual exponent + bias $=5+127=132=10000100_{2}$
o Fraction $=1100001 \ldots 000_{2}$
Therefore, the final result is $\mathbf{- 1 . 1 1 0 0 0 0 1} \times \mathbf{2}^{5}=0 \times 2610000$ in single-precision format $=$ $-56.25_{10}$


## 4. (16 points) Datapaths and pipelining

## For all parts of this problem, show all work for full credit.

a. (4 points) You are given two simple processors-one using a single-cycle design, the other a basic 5-stage pipeline-in which the instruction stages take the following amount of time:

- Fetch (IF): 20 ns
- Decode (ID): 25 ns
- Execute (EX): 30 ns
- Memory access (MEM): 40 ns
- Write back (WB): 25 ns

How much time does each processor take to execute a single instruction, from start to finish?
Solution: In the single-cycle design, the cycle time-which is the time required to execute a single instruction-is the sum of all component stages: $20+25+30+40+25=\mathbf{1 4 0} \mathbf{~ n s}$

In the pipelined design, the cycle time is determined by the longest stage ( 40 ns , for MEM), and each instruction takes 5 cycles, so the time for a single instruction is $5 * 40=\mathbf{2 0 0} \mathbf{n s}$.
b. (4 points) Now, assume each processor is redesigned with a faster ALU that reduces the time for the EX stage to 20 ns . How much time does a single instruction take on the redesigned processors? Explain your answer.

Solution: Changing a single stage affects the time for the single-cycle design-since the EX stage is now 10 ns faster, the time per instruction is also 10 ns faster, making it $\mathbf{1 3 0} \mathbf{n s}$.

Changing a single stage only affects the pipelined design if that stage is the longest cycle. Since the longest cycle in this processor is the MEM stage, the change to the EX stage does not affect the overall time for a single instruction-it's still $200 \mathbf{n s}$.

4 (continued)
c. (8 points) Consider the following code sequence:

```
lw $t0, 0($s1)
add $t2, $t0, $s0
lw $t1, 8($s1)
add $t3, $t1, $s0
xor $t4, $t2, $t3
sw $t4, 16($s1)
and $t5, $t2, $t3
sw $t5, 16($s1)
```

Determine the time required to execute this sequence on the pipelined processor from part (b), assuming that processor uses a 5-stage pipeline with forwarding. Express your answer in ns, not cycles.

Solution: Recall that forwarding in a 5-stage pipeline resolves all potential data hazards except those due to a load instruction followed by a dependent add, which requires a single stall cycle between those instructions. This code sequence has two such pairs of instructions (the lw instructions followed by dependent adds), which essentially adds two no-op instructions.

To determine the number of cycles, you could draw a pipeline diagram, or remember that a program with N instructions running on an M -stage pipeline takes $\mathrm{M}+(\mathrm{N}-1)$ cycles. In this case, $M=5$ and $N=10$ ( 8 original instructions +2 no-ops), giving a total of $5+(10-1)=14$ cycles.

Since each cycle takes 40 ns on this processor, the total execution time is $14 * 40=\mathbf{5 6 0} \mathbf{n s}$.

## 5. (21 points) Dynamic branch prediction

a. (14 points) Say you are executing a program that contains two branches, as shown below. You are given the addresses of each branch in both decimal and hexadecimal. Note that these branches are inside a loop, but neither one controls the number of loop iterations.


Your processor contains a sixteen-entry, 2-bit branch history table. Initially, entries 0-7 (the first eight lines of the table) all have the state 10, and entries 8-15 (the last eight lines of the table) all have the state 01.

Complete the table below to show which BHT entry is used to predict each branch, what predictions are made based on that entry, and how the state of each BHT entry changes throughout the program. You are given the actual outcome for each branch.

Solution: Note that, to determine the BHT entry number in the 16-entry table, you must use the 4 lowest-order address bits that actually change-the lowest two bits of every instruction address are always 0 . Therefore, the BEQ at address $148=10 \underline{010100_{2}}$ accesses entry 5, and the BNE at address $236=11101100_{2}$ accesses entry 11 .

Students were responsible for completing the table entries with underlined, bold-faced font.

| Loop <br> Iteration | Branch | BHT <br> Entry <br> $\#$ | BHT <br> Entry <br> State | Pred. | Actual <br> Outcome | New BHT <br> Entry <br> State |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | BEQ | $\underline{\mathbf{5}}$ | $\underline{\mathbf{1 0}}$ | $\underline{\mathbf{T}}$ | T | $\underline{\mathbf{1 1}}$ |
| 1 | BNE | $\underline{\mathbf{1 1}}$ | $\underline{\mathbf{0 1}}$ | $\underline{\mathrm{NT}}$ | NT | $\underline{\mathbf{0 0}}$ |
| 2 | BEQ | $\underline{\mathbf{5}}$ | $\underline{\mathbf{1 1}}$ | $\underline{\mathbf{T}}$ | NT | $\underline{\mathbf{1 0}}$ |
| 2 | BNE | $\underline{\mathbf{1 1}}$ | $\underline{\mathbf{0 0}}$ | $\underline{\mathrm{NT}}$ | T | $\underline{\mathbf{0 1}}$ |
| 3 | BEQ | $\underline{\mathbf{5}}$ | $\underline{\mathbf{1 0}}$ | $\underline{\mathbf{T}}$ | NT | $\underline{\mathbf{0 0}}$ |
| 3 | BNE | $\underline{\mathbf{1 1}}$ | $\underline{\mathbf{0 1}}$ | $\underline{\mathrm{NT}}$ | NT | $\underline{\mathbf{0 0}}$ |
| 4 | BEQ | $\underline{\mathbf{5}}$ | $\underline{\mathbf{0 0}}$ | $\underline{\mathrm{NT}}$ | T | $\underline{\mathbf{0 1}}$ |
| 4 | BNE | $\underline{\mathbf{1 1}}$ | $\underline{\mathbf{0 0}}$ | $\underline{\mathrm{NT}}$ | T | $\underline{\mathbf{0 1}}$ |

## 5 (continued)

b. (4 points) How many entries are in an $(8,2)$ correlating branch predictor if the predictor uses 6 bits from each branch's address to choose the appropriate row within the predictor? Show all work to justify your answer.

Solution: A correlating predictor is essentially a two-dimensional array, where the row is chosen by the appropriate bits from the instruction address and the column is chosen by the global history. The number of instruction and global history bits therefore allows you to determine the size of the predictor:

- Since 6 bits are used to determine the row within the predictor, there are $2^{6}=64$ rows.
- The $(8,2)$ notation tells you that there are 8 global history bits and each entry in the predictor contains 2 bits. Given 8 global history bits, there are $2^{8}=256$ columns.

Therefore, the total number of entries in the predictor is the product of the number of rows and columns:

$$
64 * 256=2^{6} * 2^{8}=2^{14}=16384 \text { entries }
$$

c. (3 points) Explain the purpose of a branch target buffer (BTB).

Solution: A BTB stores the target addresses of recently taken branches. When a branch instruction is fetched, the BTB is checked to see if that branch's target has been previously calculated. If the branch is predicted as taken, the target address stored in the BTB is used to determine the address of the next instruction to be fetched.

## 6. (19 points) Dynamic scheduling

a. (3 points) What potential problems with dynamic scheduling are prevented using register renaming?

Solution: Register renaming avoids issues that arise due to name dependences. Instructions that have no dataflow between them but use the same register names create potential hazards when those instructions are reordered. Specifically, anti-dependences, in which one instruction reads a register and a later instruction writes it, create potential write after read (WAR) hazards, while output dependences, in which two instructions write the same destination, create potential write after write (WAW) hazards. Both of these hazard types are prevented using register renaming.
b. (3 points) Explain why instructions in dynamically scheduled processors do not have to execute the exact same stages for each instruction (for example, only loads and stores access memory, branches do not have a write back stage, etc.).

Solution: In a basic 5-stage pipeline, all instructions use the exact same functional units as they progress through the five stages. However, in a dynamically scheduled processor, each instruction type essentially has its own execution pipeline. All instructions go through the same fetch (IF) and issue (IS) stages, but they are then issued to reservation stations, buffers associated with each type of functional unit that hold instructions waiting for their operands. Since the execution pipelines are separate, each one can be tailored to the specific types of operations that use that pipeline.
c. (3 points) In a dynamically scheduled processor with speculation, why must instructions commit in order, even though they are allowed to complete out of order?

Solution: Instructions are allowed to complete out of order to maintain the benefits of dynamic scheduling. However, each instruction is forced to commit (update the permanent state of the processor) in order so that the processor can easily recover from branch mispredictions. That process involves squashing all instructions that were incorrectly fetched after the mispredicted branch, which could not be done correctly if instructions were allowed to commit as soon as they finished computing their results.
a. (10 points) Complete the pipeline diagram below to show how the given code is executed on a dynamically scheduled processor without speculation. Assume the following latencies, which refer to the number of execution cycles unless otherwise noted:

- 2 cycles (1 EX, 1 MEM) for L.D and S.D
- 3 cycles for ADD.D and SUB.D
- 5 cycles for MUL.D
- 2 cycles for DADDUI
- 1 cycle for all other instructions

Also assume that the processor only contains one common data bus. Note that your solution may not use all 20 cycles shown below, but it should not use more than 20 cycles.

Solution: The full pipeline diagram is below. Note that a potential structural hazard forces the SUB.D to stall its WB stage for one cycle, thus delaying the ADD.D and first S.D as well. Both stores can compute their addresses (in EX) immediately, but must stall their memory stages until the values to be stored are available.

| Inst. | $\mathbf{1}$ | $\mathbf{2}$ | $\mathbf{3}$ | $\mathbf{4}$ | $\mathbf{5}$ | $\mathbf{6}$ | $\mathbf{7}$ | $\mathbf{8}$ | $\mathbf{9}$ | $\mathbf{1 0}$ | $\mathbf{1 1}$ | $\mathbf{1 2}$ | $\mathbf{1 3}$ | $\mathbf{1 4}$ |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| L.D <br> F2,0(R1) | IF | IS | EX | M | WB |  |  |  |  |  |  |  |  |  |
| MUL.D <br> F6,F2,F0 |  | IF | IS | S | E1 | E2 | E3 | E4 | E5 | WB |  |  |  |  |
| L.D <br> F4,8(R1) |  |  | IF | IS | EX | M | WB |  |  |  |  |  |  |  |
| SUB.D <br> F8,F4,F0 |  |  |  | IF | IS | S | E1 | E2 | E3 | S | WB |  |  |  |
| ADD.D <br> F10,F6,F8 |  |  |  |  | IF | IS | S | S | S | S | E1 | E2 | E3 | WB |
| S.D <br> F10, $-8(R 1) ~$ |  |  |  |  |  | IF | IS | EX | S | S | S | S | S | M |
| S.D <br> F8,16(R1) |  |  |  |  |  |  | IF | IS | EX | S | M |  |  |  |
| DADDUI <br> R1,R1,32 |  |  |  |  |  |  |  | IF | IS | E1 | E2 | WB |  |  |
| SLTI <br> R2,R1,640 |  |  |  |  |  |  |  |  | IF | IS | S | EX | WB |  |
| BNE <br> R2,R0,Loop |  |  |  |  |  |  |  |  |  | IF | IS | S | EX |  |

