-
True/False. The single cycle datapath has a CPI of one, so it has better
performance than the MIPS pipelined datapath which has a CPI of five.
Explain your answer.
False. While it is true a five stage pipeline takes
five cycles to execute a single instruction it is executing 5 instructions at
the same time, so the average CPI is much less. Also, the single cycle
datapath requires a very long cycle time whereas the cycle time of the
pipelined version is much smaller.
-
Assuming current technology, give an approximate and reasonable
estimate of the access time for cache? main memory? disk?
First level cache should run ideally at the same speed as the
processor clock, which is 1 cycle.
In reality it is probably more like 2-3 clock cycles which might
be 2-4 ns. Main memory is tens of nanoseconds, and disk is tens
of milliseconds.
-
What is a write-through cache? What is a write-back cache?
We didn't go over in class, see discussion on pages 466-468 in book.
See margin notes on page 467.
-
Explain the concept of temporal locality?
How does a cache exploit it? Give an example of program
that would have good data temporal locality.
Temporal locality is the property that a memory address
accessed at time $t$ will be accessed soon after.
A cache exploits it by holding values of recently
used addresses. An example program that exhibts
data temporal locality would be, for example,
a program where a variable stored in memory is reused in
a loop.
y = 0
while cond:
y = y + x **2 # <-- y and x are used over and over
-
Explain the concept of spatial locality?
How does a cache exploit it?
Give an example of a program that would have good data
spatial locality.
Spatial locality is the property that a memory address
nearby one recently accessed will be accessed in the near future.
When a processor accesses a value at address $x$ it pre-fetches values at
nearby addresses such as $x+4$ and $x+8$. An example program
that exhibits data temporal locality would be one that runs
through an array. For example,
vec = [8,5,1,6,8]
sum = 0
for x in vec:
sum = sum + x
-
What is a comparator and where are they used in cache implementations?
A comparator is a combinational circuit that compares two
n-bit values and is 1 if they
are the same and 0 if they are not the same.
They are used in a cache to compare tags.
-
Why is a fully associative cache more expensive to implement than a direct
mapped cache?
A fully associative cache requires a comparator on every cache block.
There also needs to be extra hardware to implement
the LRU replacement strategy.
-
Explain the difference between direct-mapped,
set-associative, and fully
associative cache designs.
In a direct-mapped cache a memory address maps to only one cache block.
In a fully-associative cache a memory address maps to any cache block.
In a set-associative cache a memory address maps to a particular
set of cache blocks.
-
If a cache has one set it is the same as a
fully associative
cache.
- If a cache with $n$ blocks has $n$ sets it is
the same as a direct mapped cache.
-
Express the floating-point number -7.3125 in single precision IEEE floating-point
format. Write your answer in hexadecimal.
answer: c0ea0000
-
If a multiplexor has $n$ inputs how many selector wires does it require?
answer: $log_2(n)$
-
If n bits of an address are used to specify a
block in a direct
mapped cache how many blocks does the cache have?
answer: $2^n$
-
If a set associative cache has n sets how many bits are
needed in
the address to specify the set the address maps to?
answer: $log_2(n)$
-
Why does the MIPS need exactly five bits to specify the
destination register
in an R-type instruction?
There are 32 registers and we need 5 bits to distinguish
between 32 different objects: $log_2(32) = 5$ or $2^5 = 32$.
-
If a decoder has n inputs, how many outputs does it have?
answer: $log_2(n)$
-
What is a page table?
A page table maps logical pages to physical pages.
-
Assume we have 32 bit addresses and a page size of 8K bytes.
Mow many entries can the page table contain?
answer: $2^{32}/2^{13} = 2^{19} = 512K$ page table entries at most.
-
What floating-point number does 0xC2078000
represent when interpreted as a sinlge
precision IEEE floating-point number?
Express your answer in decimal. You must show work to
receive credit.
answer: $-33.875$
-
What is a page fault?
See margin note on page 493.
-
Is the cycle time of the pipelined MIPS more or less than the
cycle time of the single cycle
MIPS? Explain your answer in one sentence.
The cycle time of a pipelined MIPS processor is much less than the
single cycle MIPS because in the single cycle MIPS the cycle
must be long enough to be able to execute the critical path
of the single cycle datapath.
-
Name four problems virtual memory solves.
answer:
1) code sharing between programs,
2) fragmentation,
3) programmer/assembler/compiler use logical addresses
as opposed to physical address giving the program the illusion that
it has the entire address space to itself,
4) have more logical memory than physical memory
5) saves memory as unused portions of program get swapped out or delayed
being loaded until needed.
-
Are memory accesses in a system with virtual memory
faster or slower than in a system
without virtual memory? Briefly explain your answer.
Describe one technique used to address
performance problems in virtual memory.
Slower because every memory access requires
an additional memory access to the page table. The TLB is
supposed to cache recently used page table entries
thereby reducing memory accesses to the page table.
-
Consider the single cycle MIPS datapath on page
322 of the book. What instruction(s) would not work
if the
MemToReg
control line between
the main control unit
and the Mux was stuck-at-0?
The lw
instruction would be broken because
we could never get data out of memory.
-
Suppose a program has a miss rate of 6% for data accesses and
4% for instruction accesses; a
miss penalty of 40 cycles; the frequency of lw/sw instructions is
35%; a CPI of 2 if the cache's hit
ratio was 100%. What is the speedup for this program with a system
that has a cache over a
system that has no caches at all?
- Speedup = CPInocache÷ CPIwithcache
-
CPInocache = 2 + MissCPI where the miss rate is 100%
- MissCPInocache = (DataMissCyclesnocache + InstMissCyclesnocache)/IC
- DataMissCyclesnocache = (IC)(0.35)(40)(miss rate)
- InstMissCyclesnocache = (IC)(40)(miss rate)
- MissCPInocache=(54)IC
- CPInocache = 56
- CPIwithcache = 2 + MissCPI where miss rates are
from question descriptions
- MissCPIwithcache = (DataMissCycleswithcache
+ InstMissCycleswithcache)/IC
- DataMissCycleswithcache = (IC)(0.35)(40)(.06)
- InstMissCycleswithcache = (IC)(40)(.04)
- MissCPIwithcache=(2.44)IC
- CPIwithcache = 4.44
- Speedup = 56/4.44 = 12.61
-
Consider the following MIPS code fragment
lw $t3, 0($t0) ; instruction 1
add $t3, $t3, $t3 ; instruction 2
sw $t3, 0($t0) ; instruction 3
-
List all of the dependencies above by describing where
the depenencies occur, on which
registers, and the kind of dependency.
answer:2 depends on 1, 3 depends on 1,
3 depends on 2
-
Which dependencies result in a stall in the five stage
MIPS pipeline that has EXE forwarding.
answer: Only the dependency between 1 and 2 requires
a stall. The others are taken care of by forwarding.
-
Consider a direct mapped cache with 2K blocks,
four words per block, and 32-bit addresses. Also
assume that addresses must be on a word boundary and
that a word is four-bytes.
-
Show how the bits in an address determine which word
in the cache the address maps to.
- bits 0-1 in the address are always zero
- bits 2-3 in the address specify the offset
in the cache block
- bits 4-14 specify the block in the cache
- bits 15-31 specify the tag
-
Which block and word in the block does the address
0x0F0FB004 map to?
- Writing out the 32 bits we have
00001111000011111 10000000000 01 00
- So the word in the block is 1
- And the block is 1024
- What is the size of the cache in the total number of bits?
- Each block in the cache contains a valid bit (1 bit),
a tag (17 bits),
and four words (each word is 32 bits) for a total
of 1+17+32(4)=146 bits per block
- There are 211 blocks so there are
211(146)=299,008 bits in the cache
-
Explain the purpose of a TLB and why all modern processors have them.
A TLB is a fully-associative cache specifically designed
to cache page table entries.
-
Do problems
6.3.1
and 6.3.3
.
- 6.3.1a:
seek + rotate + transfer = 10ms +
(60/7500/2)1000 + (1024/90e6)1000 = 14.01ms
- 6.3.1b:
seek + rotate + transfer = 7ms +
(60/10000/2)1000 + (1024/40e6)1000 = 10.025ms
-
6.3.3 parts a and b: The dominant factor is
the seek time, although
RPM makes a significant contribution as well.
-
Do problems
6.14.1
, 6.15.1
, 6.15.2
,
6.15.4
-
6.14.1 a and b: Striping allows for parallel I/O to
occur on multiple disks. However in these applications
it is unlikely that they require a lot of disk I/O, so
striping will probably not improve their performance by much.
For an inline video service network bandwidth
(a fast internet connection) is probably
more important. Striping would improve the performance
of a file server (such as csfile2) where many users
are accessing the disks simultaneously.
-
For questions 6.15.1 and 6.15.2 you need to look
at figure 6.13 on page 603. To the left
of the equal sign in each expression below is the python
expression to do the bitwise exclusive or (the ^ operator).
-
6.15.1a:
hex(0x7453 ^ 0xaabb ^ 0x0098 ^ 0x549c) = 0x8aec
-
6.15.1b:
hex(0xf245 ^ 0xdd25 ^ 0xaabb ^ 0xfefe) = 0x7b25
-
6.15.2a:
hex(0x7453 ^ 0xAB9C ^ 0x2FFF) = 0xf030
-
6.15.2b:
hex(0xf245 ^ 0x7453 ^ 0xfeff) = 0x78e9
-
6.15.4: RAID 5 distributes parity blocks throughout the
disk array rather than on a single disk. This eliminates
the parity disk as a bottleneck during disk access.
For applications with high numbers of concurrent reads
and writes, RAID 5 will be more efficient.
See the margin note and text on page 588.