CS220 Final Exam Study Guide Solutions

CS 220 - Final Exam Study Guide Solutions

Final Exam Topics

The topics covered since the first exam are:

Integer Multiplication (3.3)
1. Review study questions for quiz 6 from day 16
2. Review quiz 6
Floatintg-Point representation, programming the floating-point registers and instructions.
1. Write a floating-point program that reads a temperature in fahrenheit from the console and prints its celsius equivalent. My Solution
2. Don't review homework for computing $e^x$. That is too complicated for an exam.
The Single Cycle Datapath (4.1 - 4.4)
1. Review single cycle problems and solutions from day 20
2. Review quiz 7
The MIPS Pipeline (4.5)
1. Review pipeline problems and solutions from day 22
2. Review quiz 8
Caches (5.1 - 5.2)
1. Review cache problems and solutions from day 24
2. Review quiz 9
Virtual Memory (5.4)
Disk Storage and RAID (6.1 - 6.3, 6.9)
I/O (6.6)

Here are some questions

True/False. The single cycle datapath has a CPI of one, so it has better performance than the MIPS pipelined datapath which has a CPI of five. Explain your answer.
False. While it is true a five stage pipeline takes five cycles to execute a single instruction it is executing 5 instructions at the same time, so the average CPI is much less. Also, the single cycle datapath requires a very long cycle time whereas the cycle time of the pipelined version is much smaller.
Assuming current technology, give an approximate and reasonable estimate of the access time for cache? main memory? disk?
First level cache should run ideally at the same speed as the processor clock, which is 1 cycle. In reality it is probably more like 2-3 clock cycles which might be 2-4 ns. Main memory is tens of nanoseconds, and disk is tens of milliseconds.
What is a write-through cache? What is a write-back cache? We didn't go over in class, see discussion on pages 466-468 in book.
See margin notes on page 467.
Explain the concept of temporal locality? How does a cache exploit it? Give an example of program that would have good data temporal locality.
Temporal locality is the property that a memory address accessed at time $t$ will be accessed soon after. A cache exploits it by holding values of recently used addresses. An example program that exhibts data temporal locality would be, for example, a program where a variable stored in memory is reused in a loop.
```
	y = 0
	while cond:
	   y = y + x **2  # <-- y and x are used over and over
  
```
Explain the concept of spatial locality? How does a cache exploit it? Give an example of a program that would have good data spatial locality.
Spatial locality is the property that a memory address nearby one recently accessed will be accessed in the near future. When a processor accesses a value at address $x$ it pre-fetches values at nearby addresses such as $x+4$ and $x+8$. An example program that exhibits data temporal locality would be one that runs through an array. For example,
```
	vec = [8,5,1,6,8]
	sum = 0
	for x in vec:
	   sum = sum + x
  
```
What is a comparator and where are they used in cache implementations?
A comparator is a combinational circuit that compares two n-bit values and is 1 if they are the same and 0 if they are not the same. They are used in a cache to compare tags.
Why is a fully associative cache more expensive to implement than a direct mapped cache?
A fully associative cache requires a comparator on every cache block. There also needs to be extra hardware to implement the LRU replacement strategy.
Explain the difference between direct-mapped, set-associative, and fully associative cache designs.
In a direct-mapped cache a memory address maps to only one cache block. In a fully-associative cache a memory address maps to any cache block. In a set-associative cache a memory address maps to a particular set of cache blocks.
If a cache has one set it is the same as a fully associative cache.
If a cache with $n$ blocks has $n$ sets it is the same as a direct mapped cache.
Express the floating-point number -7.3125 in single precision IEEE floating-point format. Write your answer in hexadecimal.
answer: c0ea0000
If a multiplexor has $n$ inputs how many selector wires does it require?
answer: $log_2(n)$
If n bits of an address are used to specify a block in a direct mapped cache how many blocks does the cache have?
answer: $2^n$
If a set associative cache has n sets how many bits are needed in the address to specify the set the address maps to?
answer: $log_2(n)$
Why does the MIPS need exactly five bits to specify the destination register in an R-type instruction?
There are 32 registers and we need 5 bits to distinguish between 32 different objects: $log_2(32) = 5$ or $2^5 = 32$.
If a decoder has n inputs, how many outputs does it have?
answer: $log_2(n)$
What is a page table?
A page table maps logical pages to physical pages.
Assume we have 32 bit addresses and a page size of 8K bytes. Mow many entries can the page table contain?
answer: $2^{32}/2^{13} = 2^{19} = 512K$ page table entries at most.
What floating-point number does 0xC2078000 represent when interpreted as a sinlge precision IEEE floating-point number? Express your answer in decimal. You must show work to receive credit.
answer: $-33.875$
What is a page fault?
See margin note on page 493.
Is the cycle time of the pipelined MIPS more or less than the cycle time of the single cycle MIPS? Explain your answer in one sentence.
The cycle time of a pipelined MIPS processor is much less than the single cycle MIPS because in the single cycle MIPS the cycle must be long enough to be able to execute the critical path of the single cycle datapath.
Name four problems virtual memory solves.
answer: 1) code sharing between programs, 2) fragmentation, 3) programmer/assembler/compiler use logical addresses as opposed to physical address giving the program the illusion that it has the entire address space to itself, 4) have more logical memory than physical memory 5) saves memory as unused portions of program get swapped out or delayed being loaded until needed.
Are memory accesses in a system with virtual memory faster or slower than in a system without virtual memory? Briefly explain your answer. Describe one technique used to address performance problems in virtual memory.
Slower because every memory access requires an additional memory access to the page table. The TLB is supposed to cache recently used page table entries thereby reducing memory accesses to the page table.
Consider the single cycle MIPS datapath on page 322 of the book. What instruction(s) would not work if the MemToReg control line between the main control unit and the Mux was stuck-at-0?
The lw instruction would be broken because we could never get data out of memory.
Suppose a program has a miss rate of 6% for data accesses and 4% for instruction accesses; a miss penalty of 40 cycles; the frequency of lw/sw instructions is 35%; a CPI of 2 if the cache's hit ratio was 100%. What is the speedup for this program with a system that has a cache over a system that has no caches at all?
1. Speedup = CPI_nocache÷ CPI_withcache
2. CPI_nocache = 2 + MissCPI where the miss rate is 100%
3. MissCPI_nocache = (DataMissCycles_nocache + InstMissCycles_nocache)/IC
4. DataMissCycles_nocache = (IC)(0.35)(40)(miss rate)
5. InstMissCycles_nocache = (IC)(40)(miss rate)
6. MissCPI_nocache=(54)IC
7. CPI_nocache = 56
8. CPI_withcache = 2 + MissCPI where miss rates are from question descriptions
9. MissCPI_withcache = (DataMissCycles_withcache + InstMissCycles_withcache)/IC
10. DataMissCycles_withcache = (IC)(0.35)(40)(.06)
11. InstMissCycles_withcache = (IC)(40)(.04)
12. MissCPI_withcache=(2.44)IC
13. CPI_withcache = 4.44
14. Speedup = 56/4.44 = 12.61
Consider the following MIPS code fragment
```
lw $t3, 0($t0)    ; instruction 1
add $t3, $t3, $t3 ; instruction 2
sw $t3, 0($t0)    ; instruction 3    
             
```
1. List all of the dependencies above by describing where the depenencies occur, on which registers, and the kind of dependency.
  answer:2 depends on 1, 3 depends on 1, 3 depends on 2
2. Which dependencies result in a stall in the five stage MIPS pipeline that has EXE forwarding.
  answer: Only the dependency between 1 and 2 requires a stall. The others are taken care of by forwarding.
Consider a direct mapped cache with 2K blocks, four words per block, and 32-bit addresses. Also assume that addresses must be on a word boundary and that a word is four-bytes.
1. Show how the bits in an address determine which word in the cache the address maps to.
  1. bits 0-1 in the address are always zero
  2. bits 2-3 in the address specify the offset in the cache block
  3. bits 4-14 specify the block in the cache
  4. bits 15-31 specify the tag
2. Which block and word in the block does the address 0x0F0FB004 map to?
  1. Writing out the 32 bits we have
```
00001111000011111 10000000000 01 00
```
  2. So the word in the block is 1
  3. And the block is 1024
3. What is the size of the cache in the total number of bits?
  1. Each block in the cache contains a valid bit (1 bit), a tag (17 bits), and four words (each word is 32 bits) for a total of 1+17+32(4)=146 bits per block
  2. There are 2¹¹ blocks so there are 2¹¹(146)=299,008 bits in the cache
Explain the purpose of a TLB and why all modern processors have them.
A TLB is a fully-associative cache specifically designed to cache page table entries.
Do problems 6.3.1 and 6.3.3.
1. 6.3.1a: seek + rotate + transfer = 10ms + (60/7500/2)1000 + (1024/90e6)1000 = 14.01ms
2. 6.3.1b: seek + rotate + transfer = 7ms + (60/10000/2)1000 + (1024/40e6)1000 = 10.025ms
3. 6.3.3 parts a and b: The dominant factor is the seek time, although RPM makes a significant contribution as well.
Do problems 6.14.1, 6.15.1, 6.15.2, 6.15.4
1. 6.14.1 a and b: Striping allows for parallel I/O to occur on multiple disks. However in these applications it is unlikely that they require a lot of disk I/O, so striping will probably not improve their performance by much. For an inline video service network bandwidth (a fast internet connection) is probably more important. Striping would improve the performance of a file server (such as csfile2) where many users are accessing the disks simultaneously.
2. For questions 6.15.1 and 6.15.2 you need to look at figure 6.13 on page 603. To the left of the equal sign in each expression below is the python expression to do the bitwise exclusive or (the ^ operator).
  1. 6.15.1a: hex(0x7453 ^ 0xaabb ^ 0x0098 ^ 0x549c) = 0x8aec
  2. 6.15.1b: hex(0xf245 ^ 0xdd25 ^ 0xaabb ^ 0xfefe) = 0x7b25
  3. 6.15.2a: hex(0x7453 ^ 0xAB9C ^ 0x2FFF) = 0xf030
  4. 6.15.2b: hex(0xf245 ^ 0x7453 ^ 0xfeff) = 0x78e9
  5. 6.15.4: RAID 5 distributes parity blocks throughout the disk array rather than on a single disk. This eliminates the parity disk as a bottleneck during disk access. For applications with high numbers of concurrent reads and writes, RAID 5 will be more efficient.

I/O - Polling and Interrupts

Explain the concept of memory mapped I/O.

See the margin note and text on page 588.