Logistics

Syllabus:

Binary Abstraction
- the 1’s and 0’s
Combinational circuits
- operations on the binary data
Sequential circuits
- storage of data
Instruction set architecture (ISA)
- basically the interface between software and hardware
- a rudimentary programming language that hardware chips can execute
Processors
- basically implementations of the ISA, using circuits
High Performance Processors
- efficient, faster processors
- pipelining and caching

Reflections and Practice problems (20% and 40%)
- reflections are usually due Mondays, can include what you have learnt and what you don’t understand
- practice problems are for self-assessments. Grade will not be recorded, but completeness will.
Quizzes (40%)
- each quiz covers the previous week’s practice sheet
- 6 quizzes, one for each module
- live during class time
Office Hours
- needs to be scheduled

Week 1 - Binary Integers

Binary Number System

Our familiar decimal digits look like:

where:

every time when we move up, we multiply by 10

Binary system looks like:

where:

every time when we move up in digits, we multiply by 2
binary digits, $0,1$, is also called a bit

Extension:

Question:

Why do computers use binary?

Answer:

This is related to how data are basically transferred. Since we use electricity, we detect voltage and transform from analogue to digital signal (continuous to discrete) to send and receive data.

Definitions

Most significant bit and least significant bit:
a byte is 8 bits
- e.g.
  - $10101010$
a nibble is basically 4 bits of $1$. Basically $1111$.
a word the nature data unit size for a machine
- e.g. 64 bits

Adding Binary Numbers

For example:

where:

notice the carry over becomes $2$, instead of $10$
in this case, we also get an overflow since we expect a 6 bits

Hexadecimal and Octal

Octal basically means base 8
- this means it can compress binary numbers:
Hexadecimal means base 16:
- again, also compressing data, but preferable since it is easier to read
- it is also exactly half a byte (which has 8 bits)
- however, since

Hexadecimal is usually preferred since it is easier to parse for humans

where:

0x is a prefix for hexadecimal

Negative Numbers and 2’s Complement

2’s compliment basically means the MSB has a negative contribution
- for example:

Intuitively, it means shifting half of your numbers to the left:

this means that the sequence of counting changes to:

Negation with 2’s Compliment

This is the process that when you

flip the bits and add one

you will get the negative version of your number.

For example:

This works because (suppose having a hexadecimal number):

if you just flip the bits and add the two number, you get: $11111 = (-16+8+4+2+1)$

hence flipping the bit means computing $(-16+8+4+2+1)-whatever\,you\,had = -1 - whatever\,you\,had$

therefore, you see where the additional $-1$ come from, hence to get a negative sign, you need to add the $1$ back

Extension:

If you use 1’s compliment, then you flip each number to get its negative:

where you take the negative sign of the non-one digit

Addition/Subtraction with 2’s Compliment

Subtraction basically becomes adding a negative number

For example:

where:

though you have 5 bits coming out, it is “not” an overflow, since if we only parse the 4 digits, we get the correct answer of 2

This works because (assuming you don’t have an overflow):

once adding up to the MSB, you either clear the $1$ to remove the negative sign, or you didn’t clear it meaning that the negative still persists.

Representing Other Information

Other number formats are represented:

where:

fixed point means splitting the number of bits into two parts, one of them will represent the integer part, and the other will represent it as decimals
- the deciding cutoff between integer and decimal part is fixed
floating point means using

Representing Characters and Strings

A common standard is the ASCII table:

For example, converting to Hexadecimal:

How video streaming works:

where:

the bitrate is of course capped by internet connections

bitrate is intuitively proportional to:

FPS

#pixel per frame

#bits per pixel

relates to the quality of a pixel/frame

Overflow

Overflow is when the magnitude of the of a number exceeds representing range.

this is an inherit, fundamental problem since we are representing unbounded numbers by some boundaries

For example, adding $9+9$ in hexadecimal gets:

where:

we get an overflow since we are expecting a 4 bit result, and parsing this 4 bits gives the wrong answer

Boolean Logic

Every operation shown here has a physical corresponding in hardware.

Axioms

Most of the axioms can be easily understood using Venn Diagrams as an image.

$XY=YX$, and $XX = X$
- basically $\cap$
$X+Y=Y+X$, and $X+X=X$
- basically $\cup$

$Z(X+Y)=ZX+ZY$
- distributing union and intersections

In summary, you get:

where:

$+$ means or, which is equivalent as union
$\cdot$ means and, which is equivalent as intersection
$\bar{}$ means not, which is equivalent as not in

Simplifying Boolean Expressions

For example,

Consider the following expression:

Now, you can manipulate:

\[X+(\bar{X}\cdot Y) = (X+\bar{X})\cdot(X+Y) = X+Y\]

where:

it means you are either born in New York, or lived here ten years.

For example:

Question:

Simplify $X+Y(X+Z)+XZ$
Solution:
\[X+YX+YZ+XZ = X(1+Y)+YZ+XZ= X+YZ+XZ\]
where:
- $\cdot$ is omitted
- obviously $X+YX=X$, which you can think of using sets
If we continue:
\[X+YZ+XZ = X(1+Z)+YZ = X+YZ\]
where:
- the same procedure as above is done

Another example:

For example:

Question:

Simplify:
\[XYZ+X(\bar{Y}+\bar{Z})\]
where:
- again $\cdot$ is omitted
Solution
\[XYZ+X(\bar{Y}+\bar{Z}) = X(YZ+\bar{Y}+\bar{Z})=X(YZ+\bar{Y}+(\bar{Z}+\bar{Z}Y))\]
since:
- $\bar{Z}=\bar{Z}\cdot 1=\bar{Z}(1+Y)=\bar{Z}+\bar{Z}Y$
continuing:
\[X(YZ+\bar{Y}+(\bar{Z}+\bar{Z}Y))=X(Y(Z+\bar{Z})+\bar{Y}+\bar{Z})\\ = X(Y+\bar{Y}+\bar{Z})\\ = X(1+\bar{Z})\\ = X\]

Notice:

the precedence concurs with our normal mathematics intuition (e.g. $\cdot$ would have precedence over $+$), which you can just see it as a convention.

the use of this is also enormous, as the number of operations is proportional to the number of hardware/logic gates you need.

We see some non-intuitive steps above, which hints at a need to systematically check whether if two given expression are the same.

obviously, a brute force approach to test each input/output pair would work

Quiz Question

Question

Are the following expression the same?
\[\bar{A}+C;\,\,\bar{A}+AC\]
Solution

This is counter-intuitive, but they are the same!
\[\begin{align*} \bar{A}+C &= \bar{A}(C+\bar{C})+C \\ &= \bar{A}C+\bar{A}\bar{C}+CA+C\bar{A} \\ &= \bar{A}C+\bar{A}\bar{C}+C\bar{A}+AC \\ &=\bar{A}(C+\bar{C}+C)+AC \\ &=\bar{A}(1)+AC \\ &=\bar{A}+AC \\ \end{align*}\]

Circuits

In principle, every operation we do boils down to the three operations:

and / $\cdot$
not / $\bar{}$
or / $+$

Notice:

though the diagram above looks simple, only involving one device, but that device also has some non-trivial implementations using transistors and etc. to change the voltage.

At this point, you may wonder how the concept of set works in this context of hardware

Ask a Question:

What is the graphical analogy of intersection in the actual bit operations? Set operations ignores the content, it doesn’t care whether it is $1$ or $0$, but logic gates will. So to me it’s almost magical how axioms derived using sets work in bit operations.

Schematics Def:

a schematic is a gate-level diagram/abstraction of a circuit, showing how those gates are connected to perform a certain operation.

Other gates, such as XOR, will be based on those three gates.

Drawing a Schematic from an Expression

Consider the following two expressions:

We draw their schematics.

For the first expression

where:

when wires cross, they are only electrically connected only if that junction is drawn with a solid dot
the last or gate took three inputs. In reality, some of the gates can accept multiple inputs (probable due to their associativity property)

For the second expression

Notice:

for performance issues, obviously the total number of gate matters, as information has to transfer through them.

this is in fact known as gate size

also, the depth of gates matters as well, which means the maximum number of gates that an information has to go through/longest path

this is also known as gate delay/critical path

sometimes there is a trade-off between the two

Week 2- Canonical Forms

Canonical Forms are basically standard forms of writing an expression.

There are basically three canonical forms for an expression.

Truth Table
Logical Gates Schematics
Sum of Minterms/Product of Maxterms

Truth Table

Typically, you write truth tables in numerical orders.

For example, if you have an expression $X+YZ$, the truth table can look like:

where:

you basically cut the table in half and put the $1$’s and $0$’s, then start filling with $0$

For example,

Question:

Use a truth table to prove that
\[\bar{XY} \neq \bar{X}\bar{Y}\]
Solution:

Logic Gates

Obviously, in the end we need to implement the expression with logic gates:

which means:

sometimes you can figure out the expression given the truth table only

In fact, there are also other operators that are useful in reality, but not yet mentioned before:

where:

BUF stands for buffer, which currently does nothing
used for strengthening electrical signals
we basically express those “not”s with a little circle/bubble
- which is why BUF looks like NOT gate with bubble taken away
- this means that you can also abbreviate expressions containing “not”s such as:

Lastly, we have those X gates

X stands for exclusive

where:

XOR means, it is true when there is exclusively/only one true
XNOR would be the complement/inverse of XOR

Note:

The gates defined above, by definition, can have more than 2 inputs.

Extension:

Due to properties of a transistor, the NAND gate is actually the physically smallest gate you can have.

Replacing Gates with NAND

You can implement a

NOT gate using a NAND gate:

AND gate built from only NAND gates:

OR gate with only NAND gates:

where:
- you basically converted $\bar{X}\,or\,\bar{Y}=\bar{(X\,and\, Y)}=NAND(X,Y)$

Again, sometimes people do this in reality because NAND gates are physically really small.

De Morgan’s Theorem

So simply put, distributing NOT operator changes:

AND to OR
OR to AND

To remember De Morgan’s Theorem

where:

you first push the bubble through the gate
change the gate from AND to OR, or vice versa

This means that you can use this theorem to convert schematics directly into their equivalent form.

if there are no bubbles, you can insert pairs of them and then convert

For example:

Question:

Convert the following schematic to only include NAND gates.
Solution:

where:
- you start by putting some NAND gates, and then try to convert the other non-trivial ones to NAND

Maxterm/Minterms

Some terminologies:

Complement
- basically the not operator
Literal
Implicant
- a product term (basically and)
- e.g.
  - $A\bar{B}C$
Minterm
- a product (and) of each variable or its complement
  - in fact, each variable is complemented if the value assigned to it is 0, and uncomplemented if it is 1
- e.g. for inputs $A,B,C,D$
  - $A\bar{B}CD$, and etc
Maxterm
- a sum (or) of each variable or its complement
- e.g. for inputs $A,B,C,D$
  - $A+\bar{B}+C+D$, and etc

Theorem:

Each row of a truth table has an associate 1-minterm, such that the minterm is 1/true only in that row/case.

For example:

where:

the name of a minterm $m_i$ depends on the sequence of your truth table (e.g. which row it is on)
finding a minterm is trivial as you can just flip all the $0$ entry to be $1$
finding a minterm could involve writing out a Truth table and figuring it out

Similarly, for maxterms:

Theorem

Each row of a truth table has an associate 0-maxterm, such that the maxterm is 0/false only in that row/case

For example:

Sum of 1-Minterms

Since Minterms are only true at that specific condition, you can decompose any expressions into sum of some minterms.

An sample usage is:

\[XOR(x,y)=\bar{x}y+x\bar{y}\]

For example

Question:

Find an equivalent expression of the below:
Solution:

where:
- $f(x,y,z)=m_1+m_3+m_4+m_5=\sum m(1,3,4,5)$

This means that you can pretty easily construct a function once you know the behavior of outputs and inputs.

Product of 0-Maxterms

Since maxterms/0-maxterms evaluate to $0$ only in that case and $1$ in all the other case:

any Boolean function can be expressed as a product (AND) of its 0-maxterms

For example:

where:

the idea is reverse as compared to the sum of minterms, here we drive all the others to $0$ once we hit a $0$.

Sum of Products and Product of Sums

In short, you have:

where:

SoP means sum of products
PoS means product of sums
SoMin means sum of minterms
PoMax means product of maxterms
a star denote expressions that are unique, namely PoMax and SoMin is unique for each function
- which is why they are canonical terms

Note:

$x+y$ is neither a SoMin nor PoMax, because we have three variables/literals, but the expression only involves two.

this is because a min/maxterm needs each variable/literal to be involved exactly once

Converting between SoP and PoS

Basically, to convert between sum of product expression to a product of sum, you need to first add a negation, so that and and or can be switched:

For example, converting any SoP:

In general, you always need to add a negation so that and and or can be flipped, hence you can convert.

where:

the caveat is of course that the result you get would be opposite.

Karnaugh Map (K-maps)

A truth table redrawn, such that adjacency becomes meaningful.

In the end, it helps you convert certain circuits/outputs and inputs to an expression.

where:

the four results becomes a matrix
basically just redrawing a truth table

Simplification via K-map

A property of a K-map is:

When two cells share an edge and are both $1$, those two terms can be combined to form a single, simpler term.

where:

this works simply because their can only be 2 possible values for a variable. If a continuous row is $1$, then it means that variable determines the state
always try to get prime implicants covering all entries, but not essential prime implicants

Rules on a K-map

In the end, it helps you construct the function by using a Sum of products

Add circles until all $1$s on the k-map are circled
- for each circle, there is a corresponding expression
Circle contiguous groups of $1$s (since the system is binary, sizes of circle must be power of $2$).
- the bigger the circle, the simpler the term
Get Prime Implicants
- a Prime Implicant is a circle that can not be drawn any bigger. Namely, you produce the simplest minterm
Get Essential Prime Implicants
- an Essential Prime Implicant is a circle that uniquely covers a $1$.
  - basically, removing that circle causes some $1$ to be uncovered

Note:

The above works the same if you are doing a Product of Sums, where you just circle the 0s.

3-Variable Karnaugh Map

Theoretically, you need to go in 3-dimension. However, this can be simplified by using a gray ordering:

gray ordering on edges means ordering of bits such that exactly value of one bit change at a time

This is because, if you actually imagine a 2x2x2 cube, every neighboring cell has the property that they have only one bit difference

This also means that, since we are technically in 2-D:

two terms are adjacent if they differ only in one variable (namely this map wraps around)

Because in 2x2x2 cube, each cell has 3-neighboring cell. This can be achieved via horizontal wrapping so that edge cells have 3 neighboring cells.

For example, your k-map looks like:

where:

the top row has only one bit change at a time
- instead of $01 \to 10$, which changed two bits

And the circling becomes:

where:

purple (size 4) and blue (size 2) is the optimal grouping
- $\bar{A}B+\bar{C}$
two green (size 2) and blue (size 2) is a correct grouping, but not optimal
- $\bar{A}B+\bar{BC}+B\bar{C}$

Notice:

The number of variables in an term corresponds to $k:2^{k}\times size = 2^{n}$

where

$size$ is the size of a circle, which has to be a power of 2

$n$ is the total number of variables

For example, the purple circle above has size 4, and the corresponding term only involved 1 term.

K-Map Design Example: Full Adder

Basically we want to do addition for two bits, and then in the end implement multi-bit addition.

First, we think about an actual addition:

where:

we see that we always need a carry-in and a carry-out in addition to out inputs

STEP 1: Therefore, this would be the look of our design/interface:

where:

we don’t know how to build the full adder part yet

STEP 2: But, we do know the inputs, hence we can build a truth table:

STEP 3: Use K-map to build and simplify the logic expression:

where

the sum is actually an XOR combination: $a\,XOR\,b\,XOR\,c_{in}$
\[(a\,XOR\,b)\,XOR\,c_{in}\]

K-Map Design Example: Voting Circuit

Suppose we have four voters voting on 2 choices, we could build a circuit to decide a tie or there is a winner

if $T=1$, it means there is a tie, $T=0$ if otherwise
if $W=1$, it means choice $1$ wins, and $W=0$ means choice $0$ wins.
- notice that if you have a tie, $W$ does not matter anymore

STEP 1: Skipped

**STEP 2: ** the truth table design:

where:

$X$ don’t cares comes in naturally
if you have a don’t care, then there is usually a “validator” output, which in this case is $T$

STEP 3: the K-map and the actual logic expression:

and the tie $T$ K-map is skipped for your own practice.

Don’t Cares

Don’t cares also appear in truth table outputs where the output value is unimportant or the corresponding input combination can never be achieved.

basically, we do not care whether if that output is 1 or 0, both are fine.

However, since such outputs (don’t cares) can be treated as either 0’s or 1’s at the designer’s discretion, we can use it to simplify our K-maps.

For example:

Week 3 - Delay Model

First realize that computation takes time, because:

electricity travels with a velocity
flipping from $0$ to $1$ also has time lag due to electrical properties such as capacitance.

Delay Time

Hence we need to use some sort of a delay model:

propagation delay $t_{pd}$
- the maximum delay from any input wire to any output wire
contamination delay $t_{cd}$
- the minimum delay from any input wire to any output wire

Obviously, in reality, for most computations, all we care is the propagation delay. But when it comes that we are dealing with computer memories, we might need to deal with contamination delay.

Critical (Long) Path & Short Path

critical path
- is the longest delay from any input to any output
short path
- is the shortest delay from any input to any output

For example:

where:

critical path uses $t_{pd}$ and short path uses $t_{cd}$

Glitches

A glitch is a transient unwanted pulse on a circuit’s output.

where:

this means that we cannot read the value during the glitch time, somehow we need to make sure to avoid it
triggered when one critical input falls before others being able to recover it
- see the example in the next section

Glitch Example

Consider having $A=0$, $B$ had a glitch, and $C=1$ for the following schematics:

and suppose we have a transient $B$, such that we would have a timing diagram:

where:

the horizontal axis would be time, since it is a timing diagram
and we see that there are delays, which propagates throughout the diagram
glitch happens since your shortest path switches/glitches to $0$ before any other path recovers

Consider the other case where $B$ rises from $0$ to $1$. In this case, no glitch happens.

Preventing Single Input Glitches

The idea is to somehow:

keep the behavior of the original circuit
- have the same k-map
add an “input” that is independent of the glitching bit to smooth the signal

STEP 1: reverse engineer the necessary part of the K-map:

where:

we realize that the glitch was with $B$ jumping between the blue circle
in order to not change the original schematic, we simply add a circle on top of the original design
- notice this blue circle is independent of $B$

STEP 2: add the additional circle to the schematic:

and we have fixed the glitch as the blue part will be glitch-free and keeps the original design intact.

Week 4 - Combinational Building Blocks

Ripple Carry Adder

The most basic algorithm for addition

where:

you have a critical path being $O(n)$, where $n$ is the $n-$bit adder we are having.
- basically from the right most adder to the left most adder/output

Overflow Detection

Though it is simple when you have an unsigned bit, for a sign bit(s) additions, we need to look at their MSBs (A,B):

where:

if you have a positive number and a negative number, in a 2s-complement system you can never overflow
if you have both positive or both negative, only the red parts are giving wrong answers, hence overflow.

By inspection, we see that:

Therefore, our overflow detection for ripple carry adder would be the adding the $XOR$

Adder/Subtractor Variant

Consider we would like to have a unit being able to do both add and subtract:

$sub=0$ means $A+B$
$sub=1$ means $A-B=A+(-B)$
- this will be done by negation, namely flip all the bits and add one
- to flip the bits:
  
  where:
  - we see that this is basically a $XOR$ gate.

Therefore, our implementation looks like:

However, notice that the order of flipping and adding matters:

$1011 - 0100 - 0101$ (flip then add; the correct way)
$1011 - 1100 - 0011$ (add then flip)

Carry Lookahead Adder

This has a shorter critical path, but uses more gates.

Consider the Carry outs: $$

\begin{align} C_{i+1}&=A_iB_i+A_iC_i + B_iC_i
&= A_iB_i + C_i(A_i + B_i)
&= G_i + C_i P_i \end{align}

\[where: - **Generate $G$**: sets carry out regardless of carry in - **Propagate $P$**: sets carry out if there is a carry in - basically ***pre-determines if there would be a carry out*** **This will have a delay time of $O(log_2(n))$** ### Expanding the Carry Logic <img src="\lectures\images\typora-user-images\image-20200929230600594.png" alt="image-20200929230600594" style="zoom: 80%;" /> where: - we have made the carries only dependent on $A$ and $B$ inputs, but **not the previous carry-ins**. - this will generate a parallel ***tree*** of gates, but they adds the ***width*** rather than ***depth*** ### Schematic of the Adder ![Carry Lookahead Adder in VHDL and Verilog with Full-Adders](https://www.nandland.com/vhdl/modules/images/carry-lookahead-adder-4-bit.png) where: - the **critical path reduced from $O(n)\to O(log_2(n))$** ## Decoders Decoders are like adders, where of course you have many implementations of decoders based on size of inputs and needs. The general functionality is: - $k$ bit input - $2^k$ bits output - **one-hot** outputs: only one output will be $1$, all others are $0$ - you can imagine this as ***minterms***, so that **each unique combination has a unique output** - **the input** determines which output will be $1$ General Design: <img src="\lectures\images\typora-user-images\image-20200929231627156.png" alt="image-20200929231627156" style="zoom:80%;" /> For example: <img src="\lectures\images\typora-user-images\image-20200929231544315.png" alt="image-20200929231544315" style="zoom:80%;" /> where: - you turned a number into a ***spatial value*** ### 2:4 Decoder How do we implement a $2:4$ Decoder? First, a $1:2$ decoder would be: ![image-20200929232459689](\lectures\images\typora-user-images\image-20200929232459689.png) Then, we compose a $2:4$ Decoder: <img src="\lectures\images\typora-user-images\image-20200929232042799.png" alt="image-20200929232042799" style="zoom:80%;" /> ### 3:8 Decoder Now, you can **implement this recursively** based on the $2:4$ Decoder and the $1:2$ Decoder: <img src="\lectures\images\typora-user-images\image-20200929232353258.png" alt="image-20200929232353258" style="zoom:80%;" /> where: - the concept of ***minterms*** apply nicely, so that **each wire/output corresponds to a unique combination of their inputs.** - therefore, they can be used easily for building on to produce a larger decoder ### Logic Implementation Using Decoder Consider the following truth table: ![image-20201002000443161](\lectures\images\typora-user-images\image-20201002000443161.png) We can **use a decoder to implement it**: ![image-20201002000344972](\lectures\images\typora-user-images\image-20201002000344972.png) where: - we are basically using the **sum of minterms** representation of the expression - and we know that sum of minterms are not physically ideal themselves - since decoders actually implements minterms for all outputs (i.e. 8) of them, they are redundant because we only need 4 in the end (in this case) So this is usually not a resource efficient approach ## Encoders Reversing the decoder: $2^k$ bits one-hot input, $k$ bits output. But what if you are not given one-hot inputs, but any random bit patterns? One way to "deal" with it is to use a **priority encoder** ### 4:2 Priority Encoder Basically, a priority encoder only **deal with $1$ at the MSB**: ![image-20201002001306381](\lectures\images\typora-user-images\image-20201002001306381.png) where: - outputs are $O_1$ and $O_0$ - so **that $O_1O_0$ tells you the position of the MSB of $1$.** - a validator/valid bit is $V$, which tells us whether if output is correct - inputs are $I_3,I_2,I_1,I_0$ The corresponding expression for the three outputs (including $V$) would be: ![image-20201002001700535](\lectures\images\typora-user-images\image-20201002001700535.png) ## Multiplexer (Mux) **Select between one of $N$ inputs/wires**, which one to use. - this in fact takes $log_2n$ delay For example: ![image-20201002002407432](\lectures\images\typora-user-images\image-20201002002407432.png) where: - **$S$ selects which input to be let through** If we write out the truth table logic for this case, we have: ![image-20201002002532331](\lectures\images\typora-user-images\image-20201002002532331.png) And you can figure out the schematic implementation from it. But there is **a natural way to implement it with a decoder ** (see [4:1 Mux](#4:1 Mux)). ### 4:1 Mux Basically, this means we need ![image-20201002002710218](\lectures\images\typora-user-images\image-20201002002710218.png) where: - two select bits $S$ are needed A natural implementation would be **using a decoder** for the select bit: - so that basically the value of $S_1S_0$ determine which one-hot value to have: ![image-20201002002929171](\lectures\images\typora-user-images\image-20201002002929171.png) where: - $S=S_1S_0$ is 2 bits. ### Logical Implementation using Mux Again, this is not a resource efficient way to do it, but it is achievable. Consider the following truth table: ![image-20201002003321361](\lectures\images\typora-user-images\image-20201002003321361.png) And using MUX, we can have **a 8 input Mux**: <img src="\lectures\images\typora-user-images\image-20201002003411885.png" alt="image-20201002003411885" /> where: - $S=ABC$ Or a **4 input Mux:** ![image-20201002003710761](\lectures\images\typora-user-images\image-20201002003710761.png) where: - $S=AB$ If we continue, we can also build a **2 Input Mux** by having $S=A$. ## Shifters Basically, they implement the bitwise shift operation, so that I can **shift input bits to the left or to the right**. In general, there are two variants: - Barrel: control bit indicate how far to left or right - L/R with enable: control bit enables shifting, another bit indicate direction And to deal with the bits falling off, you may either: - just fill with either $1$ or $0$ - wrap around ### 4-bit Barrel Left Shifter with Wraparound Basically, consider a **2-bit $dist$ control bit** to tell you how far to shift. - this can be implemented using 4-Input Mux ![image-20201002005137365](\lectures\images\typora-user-images\image-20201002005137365.png) # Week 5 ## Holding State: Bistable Elements Here, we are trying to implement a circuit that can retain $1$ or $0$: The below two circuits are identical: <img src="\lectures\images\typora-user-images\image-20201002010648176.png" alt="image-20201002010648176" style="zoom:80%;" /> where: - the **right hand side one is the more conventional one** Hence, just **by feeding $1$ or $0$ into $Q$, we would be able to hold its state:** <img src="\lectures\images\typora-user-images\image-20201002010811557.png" alt="image-20201002010811557" style="zoom:80%;" /> However, the problem is that **both $Q$ and $\overline{Q}$ are (the same) outputs**. So we **cannot control their states, as they are basically "`final`" variables.** ## RS Latch RS Latch means **Reset-Set Latch**. It stores a **single** **state, called $Q$.** This latch would be a way to **control your state, while persisting it**. - We will have two inputs: $S$ and $R$ to control the behavior of $Q$, such that: <img src="\lectures\images\typora-user-images\image-20201002012153156.png" alt="image-20201002012153156" style="zoom:80%;" /> and it will have the following implementation: <img src="\lectures\images\typora-user-images\image-20201002012253322.png" alt="image-20201002012253322" style="zoom:80%;" /> with the following abstraction: <img src="\lectures\images\typora-user-images\image-20201002012309550.png" alt="image-20201002012309550" style="zoom:80%;" /> To understand how it works, look at its subsections. ### RS Latch - Set The set command that has: - $S=1$ - $Set=1$ will **set $Q \to 1$.** - $R=0$ Basically, we want to show that in this case, you will have: - $Q=1$ - $\bar{Q}=0$ Basically, this row: <img src="\lectures\images\typora-user-images\image-20201002011900421.png" alt="image-20201002011900421" style="zoom:80%;" /> Now, consider **putting in the $S$ and $R$ value in the circuit**: <img src="\lectures\images\typora-user-images\image-20201002011624227.png" alt="image-20201002011624227" style="zoom:80%;" /> where: - the green $0$ indicate that those are ***implied by the truth table*** <img src="\lectures\images\typora-user-images\image-20201002011730722.png" alt="image-20201002011730722" style="zoom:67%;" /> so that **whenever $S=1$, the result $\overline{Q}$ has to be $0$.** And the rest follows. ### RS Latch - Reset The reset command that has: - $S=0$ - $R=1$ - $Reset=1$ will **reset $Q \to 0$.** Basically, we want to show that in this case, you will have: - $Q=0$ - $\bar{Q}=1$ <img src="\lectures\images\typora-user-images\image-20201006222447920.png" alt="image-20201006222447920" style="zoom:80%;" /> And this schematic becomes: ![image-20201006222528207](\lectures\images\typora-user-images\image-20201006222528207.png) where: - the green $0$ indicate that those are ***implied by the truth table*** ### RS Latch - Hold This represents the **persistence** usage of RS Latch, it holds the previous value: The hold command that has: - $S=0$ - $R=0$ <img src="\lectures\images\typora-user-images\image-20201006222659479.png" alt="image-20201006222659479" style="zoom:80%;" /> Therefore, the schematic: <img src="\lectures\images\typora-user-images\image-20201006222948051.png" alt="image-20201006222948051" style="zoom:80%;" /> where: - the value of $Q$ is essentially the **previous value of $Q$,** which is the only source/chance of having a value. ### RS Latch - Invalid State This is the undesirable state, which could cause a bug to your $Q$. Consider: - $S=1$ - $R=1$ - so basically set and reset <img src="\lectures\images\typora-user-images\image-20201006224224837.png" alt="image-20201006224224837" style="zoom: 80%;" /> Then you have a bug: <img src="\lectures\images\typora-user-images\image-20201006224248353.png" alt="image-20201006224248353" style="zoom: 80%;" /> where: - obviously you cannot have $Q=\bar{Q}$. ## D Latch Still only stores one bit state $Q$, but it **avoids the invalid case of RS Latch.** - basically, makes that input combination ***never achievable***. There will be two inputs, still: - $CLK$ which means clock/enable/control - $D$ which means data input. This means that: - $CLK=1$ means **$D$ is enabled to be passed through** to $Q$. (Transparent) - $CLK=0$ means $Q$ **holds its previous value** regardless of $D$. (Opaque) Commonly, its schematic: <img src="\lectures\images\typora-user-images\image-20201006224602535.png" alt="image-20201006224602535" style="zoom:80%;" /> ### D Latch - Update/Transparent Basically, when: - $CLK=1$, you are allowed to pass in $D$. So: <img src="\lectures\images\typora-user-images\image-20201006225129482.png" alt="image-20201006225129482" style="zoom:67%;" /> where: - we utilized an RS Latch to implement the D Latch. - and $D=Q$ So the schematic looks like: <img src="\lectures\images\typora-user-images\image-20201006225209915.png" alt="image-20201006225209915" style="zoom:67%;" /> where: - notice that $1$ AND $anything$ gives $anything$. ### D Latch - Hold/Opaque Basically, you are not allowing $D$ to set/through: - $CLK=0$ - basically, this forces both $R$ and $S$ to be $0$. This correspond to: <img src="\lectures\images\typora-user-images\image-20201006225533512.png" alt="image-20201006225533512" style="zoom: 67%;" /> And the schematic: <img src="\lectures\images\typora-user-images\image-20201006225556089.png" alt="image-20201006225556089" style="zoom: 67%;" /> ### D Latch Toggle Frequency This is **one problem** of this D Latch. Consider this setup: <img src="\lectures\images\typora-user-images\image-20201006230448664.png" alt="image-20201006230448664" style="zoom: 67%;" /> where: - we place in $\bar{Q}_{prev}$ into $D$. Therefore this schematic will cycle. - however, since there are **many gates involved ($t_{pd}$)**, how do we know the **rate of change of** $Q$? - in other words, ***if there is a large influx/change of $D$, we might not be able to predict/control what $Q$ would be.*** ## D Flip-Flop Still holds a single bit $Q$. **Same input**, the same: - $CLK$ - $D$ But its mechanism is different: - it **samples $D$** on **rising edge** of $CLK$, so ($0 \to 1$ only) - therefore, it will not continuously allow updates, which prevents unpredictability of $Q$. - you hence have a **controlled rate of change/sampling rate** of $Q$, by the bit $CLK$ - so we can avoid glitches as well - this is in fact the Hertz that a CPU has. For example, a **2GHz CPU means it can sample $2*2*10^{30}$ times per second** - **otherwise**, $Q$ holds its previous value It;s actual design uses two D Latches, $L1$ and $L2$: <img src="\lectures\images\typora-user-images\image-20201006231702599.png" alt="image-20201006231702599" style="zoom:67%;" /> where: - $L1$ and $L2$ just denotes that we **have two D Latches connected.** - the **actual output read** is of course the ones on the **rightest hand side** and its symbols are: ![image-20201006232540815](\lectures\images\typora-user-images\image-20201006232540815.png) ### D Flip-Flop - Rising CLK Consider first the case when $CLK=0$: <img src="\lectures\images\typora-user-images\image-20201006231911397.png" alt="image-20201006231911397" style="zoom:67%;" /> where: - $L1$ and $L2$ just means that we are using 2 latches to implement this. - the data $D$ is **stuck at $N1$** Now, if we switch $CLK=1$, we have: <img src="\lectures\images\typora-user-images\image-20201006232003945.png" alt="image-20201006232003945" style="zoom:67%;" /> where: - now, the data $D$ stuck at $N1$ **passes through to $Q$.** **Therefore, when $CLK$ rises, then $D$ passes through to $Q$.** ### D Flip-Flop - Timing Diagram Here, we assume that gates have no delay: <img src="\lectures\images\typora-user-images\image-20201013223139860.png" alt="image-20201013223139860" style="zoom:50%;" /> where: - obviously, when $CLK=0$, $N_1$ will listen exactly to $D$. and completing the other part: <img src="\lectures\images\typora-user-images\image-20201013223325850.png" alt="image-20201013223325850" style="zoom:50%;" /> where: - now, when you have $CLK=1$, except for the beginning part, you will just have $N_1$ having its previous output out. Similarly, looking at $Q$, you have: <img src="\lectures\images\typora-user-images\image-20201013223742289.png" alt="image-20201013223742289" style="zoom:50%;" /> where: - when $CLK=1$, the output of $Q$ will basically **let previous $N_1$** through Lastly, when $CLK=0$: <img src="\lectures\images\typora-user-images\image-20201013223854556.png" alt="image-20201013223854556" style="zoom:50%;" /> where: - when $CLK=0$, $Q$ will be its previous value in fact, the output of $Q$ is **clearly dependent on the sampling edges**: <img src="\lectures\images\typora-user-images\image-20201013224058904.png" alt="image-20201013224058904" style="zoom:50%;" /> # Week 6 ## D Latch vs D Flip-Flop <img src="\lectures\images\typora-user-images\image-20201013224521829.png" alt="image-20201013224521829" style="zoom:50%;" /> where: - you see Flip-Flop responding to half of the time where a Latch responded ## Additional Flip-Flop Controls - **Enable** Flip-Flops - basically, you can have $EN$ input that **controls when new data is passed or held back** in addition to $CLK$ - $EN=1$ means $D$ passed through to $Q$ on clock edge - $EN=0$ means flip-flop state does not change - **Resettable** flip-flop - have additional input to **force state to particular value** - $PRE=1$ (active high) or $\overline{PRE}=0$ (active low) - A signal is '**active low**' means that signal will be **performing its function when its logic level is 0**. - A signal is '**active** **high**' means that signal will be **performing its function when its logic level is 1**. - $CLR=1$ or $\overline{CLR}=0$ - and there are also variations to those, as you can make: - synchronous resets, which happen at clock edge only - asynchronous resets, which happens immediately ## Examples of Using Flip-Flops ### Parity Machine Design - **Question:** $even=1$ when there has been an even number of $1$s on $x$ since last $\overline{RST}$: - **Solution:** we need to start with a $1$ <img src="\lectures\images\typora-user-images\image-20201013225513443.png" alt="image-20201013225513443" style="zoom:50%;" /> where: - basically, all we needed to figure out was the $D$ input, which we figured out to be $even(XOR)x$ - $CLK$ is something **not for us to control**, it is the sampling rate controlled externally - $\overline{RST} = 0$ gives $PRE=1$. So after last reset, we **assume that there is zero/an even number of $1$.** - it is actually a choice based on the designer, viewed as the **initial condition** - this is basically to satisfy the requirement "since last $\overline{RST}$" - last but not least, that little circle below the $\overline{PRE}$ is just a convention ### Pattern Recognizer: State Transition Diagram - **Question:** $y=1$ whenever the last four bits on $x$ have been $1101$ - **Solution:** 1. First, let us start with a DFA: <img src="\lectures\images\typora-user-images\image-20201013231330702.png" alt="image-20201013231330702" style="zoom:50%;" /> 2. Now, to make things towards our design, we could have: <img src="\lectures\images\typora-user-images\image-20201013231853449.png" alt="image-20201013231853449" style="zoom: 67%;" /> 3. Then, this means we just need to have: ![image-20201013232153697](\lectures\images\typora-user-images\image-20201013232153697.png) 4. Now, we can just represent **each state as the triplet flip-flop result $Q$**: <img src="\lectures\images\typora-user-images\image-20201013232501691.png" alt="image-20201013232501691" style="zoom:67%;" /> where: - the **key idea/implementation step** would be: - the triplet $Q_2 Q_1 Q_0$ would represent a **state** - $000$ would be our **start state**, which we can customize using $\overline{PRE}$ - input $X$ would represent the **alphabet** - a **transition** would be having a gate logic on making $\delta(Q_2 Q_1 Q_0,X) \to Q_2'Q_1'Q_0'$ - and **since $D_i$ are exactly the elements that can change $Q_i$**, we are basically figuring out $\delta(Q_2 Q_1 Q_0,X) \to D_2 D_1 D_0$ in the above table - practically, it means **how to generate the triplet output $D_2 D_1 D_0$ from the quartet input $Q_2 Q_1 Q_0 X$** 5. Now, we just need to figure out the expression using a K-Map: - since there are **three outputs**, we will just have three K-Maps: <img src="\lectures\images\typora-user-images\image-20201016003737927.png" alt="image-20201016003737927" style="zoom: 67%;" /> ![image-20201016004632938](\lectures\images\typora-user-images\image-20201016004632938.png) <img src="\lectures\images\typora-user-images\image-20201016005139460.png" alt="image-20201016005139460" style="zoom:50%;" /> 6. Lastly, we figure out the actual output we need, which is, whenever you are at state $100$, it is true. - obviously this refers to the **accepting state** - and we can simply have $OUT=Y=Q_2$ in this case, because only that state has $Q_2 =1$ 7. Assembling all the above in a schematic, which now becomes trivial as we have figured out all the logics: - **accept state** $\to Y=1$ is just $Q_2$ - **start state** $\to 000$ can be set by $\overline{PRE}$ or $CLR$ - **transition** $\to$ the logical expressions we have figured out on step (5). - **input string** $\to X$, which we read bit by bit (actual rate and content is controlled by the $CLK$). <img src="\lectures\images\typora-user-images\image-20201016010956145.png" alt="image-20201016010956145" style="zoom: 50%;" /> where: - for saving the spaghetti wires, we assumed that all the $CLR$ and $CLK$ are connected. ### Template for Designing State Transition Diagram Basically you can approach this by thinking about the DFA: 0. Identity the input (alphabet) 1. Draw the DFA 2. Write out the state transition table - which contains $r_i, a, r_t, accept$, where $r_i$ is the from state, and $r_t$ is the target state, and $a$ is the symbol processed 3. Select state encodings 4. Augment the transition table with state encodings 5. Figure out the logical expression for the transitions 6. Figure out the logical expression for the output 7. Draw the schematic - don't forget to **configure start state and the $CLK$** ### Pattern Recognizer Implementation: Shift Register Another design would only work in specific cases, and it happens that it fits well in the question where the last 4 bit is $1101$: - this utilized the design that a Flip-Flop can hold its **last state**: <img src="\lectures\images\typora-user-images\image-20201016013514783.png" alt="image-20201016013514783" style="zoom:50%;" /> Therefore, a **shift register** design is natural to this question: <img src="\lectures\images\typora-user-images\image-20201016013316629.png" alt="image-20201016013316629" style="zoom:50%;" /> where: - notice that the **"order" is reversed**, where $1101$ as the last four bit will become $1011$ in the diagram. - this ***only works for specific questions*** - the above is also called a **shift register**, you basically have **each bit shifting in** - essentially the ***same as the vending machine mechanism*** ### CLK Counter: Frequency Divider This is a tricky one, where we can implement a FSM that has **output simulating the $CLK$ but with a different frequency.** Consider the task to divide a given frequency by 3: ![image-20201017183434163](\lectures\images\typora-user-images\image-20201017183434163.png) where: - we see only one rise for **every three periods** of $CLK$. This is solved by a state transition that can be thought of only dependent on the rising $CLK$ and its previous state: <img src="\lectures\images\typora-user-images\image-20201017183614181.png" alt="image-20201017183614181" /> Then the **transition table** becomes trivial: ![image-20201017183908840](\lectures\images\typora-user-images\image-20201017183908840.png) using a ***binary encoding*** for each state ![image-20201017183655757](\lectures\images\typora-user-images\image-20201017183655757.png) then the output is trivially:\]

Y=\overline{S_0}\overline{S_1}

\[- since $Y=1$ only at state $S_0$, which has **encoding** $00$. and the **circuit** becomes: ![image-20201017183823662](\lectures\images\typora-user-images\image-20201017183823662.png) ## Multibit Storage: Registers For example, if we need to store a $4$ bit number: Basically, since each bit takes its place, **we will just have $4$ Flip-Flops**: <img src="\lectures\images\typora-user-images\image-20201016013148135.png" alt="image-20201016013148135" style="zoom:50%;" /> where: - the condensed symbol on the right is often used - **register is basically a bunch of flip flops** ## Finite State Machines Abstraction Therefore, the fundamental building blocks of a finite state machine would be: - a **register** (flip-flops) representing the **state** - a **next state logic** representing the **transition** - an **output logic** representing the **logic for accepting states** <img src="\lectures\images\typora-user-images\image-20201016013919247.png" alt="image-20201016013919247" style="zoom:50%;" /> <img src="\lectures\images\typora-user-images\image-20201016013928582.png" alt="image-20201016013928582" style="zoom:50%;" /> and usually, you will have: <img src="\lectures\images\typora-user-images\image-20201016014133881.png" alt="image-20201016014133881" style="zoom:50%;" /> where: - if the input for the **output logic** is just dependent on current states, then it is called a **Moore Finite State Machine** - basically the standard DFAs - if the input for the **output logic** is dependent on current states **and input**, then it is called a **Mealy Finite State Machine** ## Moore vs. Mealy FSM Again, they have exactly the same building blocks but its accepting behavior is different. Therefore, you **can always transform a Moore FSM to a Mealy FSM.** - **Moore FSM:** - outputs depend only on current state <img src="\lectures\images\typora-user-images\image-20201016014540624.png" alt="image-20201016014540624" style="zoom:50%;" /> - this would be the standard **state based design** - **Mealy FSM:** - outputs depend on current state and inputs <img src="\lectures\images\typora-user-images\image-20201016014603891.png" alt="image-20201016014603891" style="zoom: 50%;" /> - this would be an **edge/transition based design** - the timing for updating output might be different from ### Example: Mealy Parity Check Machine The task is to check if there has been an even number of $1$: The state transition diagram will becomes: <img src="\lectures\images\typora-user-images\image-20201016015015329.png" alt="image-20201016015015329" style="zoom:50%;" /> where: - notice the **"transition" becomes a tuple** $(input,output)$ - this means we know the output **without technically reaching the accepting state** Implementation: ![image-20201106003646495](\lectures\images\typora-user-images\image-20201106003646495.png) where: - **notice that** the output is "out-of-sync" of the clock, such that **output could update technically *before* $Q$ updates** ## Traffic Light Design This is the really canonical design you need to know. Consider the traffic at a crossing: ![image-20201020223117321](\lectures\images\typora-user-images\image-20201020223117321.png) where: - we have the following rules: whenever a street has at least one car $T=1$, and let the light $L$ be green. Once there is no car at $T$, the light transitions to yellow and then red. Basically, this is the **interface** we need to implement: <img src="\lectures\images\typora-user-images\image-20201020222210801.png" alt="image-20201020222210801" style="zoom: 67%;" /> - **Solution:** First, we do the **state transition diagram**: <img src="\lectures\images\typora-user-images\image-20201020222709697.png" alt="image-20201020222709697" style="zoom:50%;" /> where: - the obvious problem of this is that, if there is always some traffic on $A$, or $B$, then the **light will never switch**. But this is a **simplified model** anyway. Now, we need to do the state transition table: - here, we changed the encoding to be **one-hot encoding**, where basically 4 bits encodes to 2 bits, representing its position: <img src="\lectures\images\typora-user-images\image-20201020223407620.png" alt="image-20201020223407620" style="zoom:80%;" /> Next, we implement the **transition logic, which now becomes easy**, because there is exactly only **one 1 in each column:** <img src="\lectures\images\typora-user-images\image-20201020223739323.png" alt="image-20201020223739323" style="zoom:50%;" /> but also notice that: - those logical expressions correspond **exactly to the transition diagram we had** - e.g. for $D_3$, it means we enter $Q_3$ whenever we where in $Q_2$ and had $T_B=0$, which is **exactly what we had on the diagram**. Last but not least, the output logic table is, according to the state diagram: <img src="\lectures\images\typora-user-images\image-20201020224131928.png" alt="image-20201020224131928" style="zoom:50%;" /> And the **output logic becomes**: <img src="\lectures\images\typora-user-images\image-20201020224218435.png" alt="image-20201020224218435" style="zoom:50%;" /> # Week 7 <mark>TODO: Look at the class PowerPoint and do the MIPS tutorial</mark> ## Synchronous Logic Design In general, it means having: 1. **Flip-flops (or registers) contain the state** of the system 2. State **changes synchronously at clock edge** (system synchronized to clock) This could be either the FSM machine we had, or, another similar design would be, a **Pipeline**: <img src="\lectures\images\typora-user-images\image-20201020224705695.png" alt="image-20201020224705695" style="zoom:50%;" /> where: - remember, registers are just the group of flip-flops - Pipeline and a FSM essentially is just the same thing, just that you **have two registers instead of one** ## Timing Constraints ### Flip-Flop Input Timing Constraints Basically, we need to make sure that data do not come in **during the rise** of the $CLK$. <img src="\lectures\images\typora-user-images\image-20201020224919913.png" alt="image-20201020224919913" style="zoom:67%;" /> where: - **setup time** $t_{setup}$: time before clock edge that data must be stable - **hold time** $t_{hold}$: time after clock edge that data must be stable - the particular time for the above ***depends on how a flip-flop is designed/implemented*** ### Flip-Flop Output Timing Constraints However, since **flip-flops themselves also have delays**, we also need to make sure that data is read in the correct time: ![image-20201020234606427](\lectures\images\typora-user-images\image-20201020234606427.png) where: - **propagation delay** $t_{pcq}$: time after the clock edge, such that **after it** the (new) $Q$ is **guaranteed to be stable** - **contamination delay $t_{ccq}$**: time after the clock edge, such that **before it** the (old) $Q$ is **guaranteed to be stable** ### Total Setup Time Constraint Basically, we want to have your **data ready, at latest, before your** $t_{setup}$: - this is again to comply with the flip-flop input timing constraint <img src="\lectures\images\typora-user-images\image-20201020235100069.png" alt="image-20201020235100069" style="zoom: 50%;" /> This means:\]

T_{clock}> t_{pcq} + t_{pd}+t_{setup}

\[where: - your **lasted data will be guaranteed to be latest and correct** only after the above period of time. - the **best case scenario** is of course that you **can get data passed in each clock**, but if you have a huge CL design, then you might need to slow down/split the sampling. Or, you want to **minimize your combinational logic delay** such that:\]

t_{pd} < T_{clock}-( t_{pcq} +t_{setup})

\[### Total Hold Time Constraint Basically, we **don't want $D$ to be changed too soon** as well, i.e. the ***new data don't come in too fast***: - so you make sure the **least delay of a (next) data only comes in *after* the hold time**: <img src="\lectures\images\typora-user-images\image-20201020230239606.png" alt="image-20201020230239606" style="zoom:50%;" /> and:\]

t_{hold} < t_{ccq}+t_{cd}

\[or, you need to try to design your combinational logic to be:\]

t_{cd} > t_{hold}-t_{ccq}

\[### Clock Skew This is the problem that **clock themselves does not arrive at all registers the same time**. - this could make data be read-in at different times, and could be catastrophic - **skew $t_{skew}$:** the time difference between two clocks. ### Total Setup Time Constraint w/ Skew Again, we need to make sure that **data, at latest, comes in before the setup time of the second register w/ skew**: <img src="\lectures\images\typora-user-images\image-20201020230955069.png" alt="image-20201020230955069" style="zoom: 50%;" /> and the **constraint** equation would be:\]

T_{clock}> t_{pcq} + t_{pd}+t_{setup}+t_{skew}

\[### Total Hold Time Constraint w/ Skew Again, you need to make sure the **latest data does not come in too fast**: - again, it means **your least delay must be longer than the sampling time of $CLK2$ $+$ $Skew$:** <img src="\lectures\images\typora-user-images\image-20201020232142207.png" alt="image-20201020232142207" style="zoom:50%;" /> So we have:\]

t_{hold} + t_{skew}< t_{ccq}+t_{cd}

\[as the **constraint**. # Week 8 Short Tutorial on MIPS: https://minnie.tuhs.org/CompArch/Resources/mips_quick_tutorial.html Extensive Tutorial on MIPS: http://ellard.org/dan/www/Courses/cs50-asm.ps ## MIPS Basically, MIPS will be a language that you can communicate with your processor. However, not all processers have interfaces for MIPS. Therefore, we might need to use a software simulation. In general, this is what happens **when you program**: - C $\to$ Assembly Code $\to$ Machine Code <img src="\lectures\images\typora-user-images\image-20201027222336775.png" alt="image-20201027222336775" style="zoom:50%;" /> where: - You **write MIPS code** (**human readable assembly code**), which will be first translated to **assembly code** on the bottom right - essentially, two codes are the same, but the top right is more human - Then, **assembly code** (e.g. MIPS code) is ***converted into executable machine code*** by a utility program referred to as an **assembler**. The conversion process is referred to as *assembly*, as in *assembling* the source code. - An **assembler** program creates object code by translating combinations of mnemonics and syntax for operations and addressing modes into their numerical equivalents. ### RISC vs. CICS Architecture - An **instruction set, also called ISA** (instruction set architecture), is part of a computer that pertains to programming, which is more or less machine language. The instruction set provides commands to the processor, to tell it what it needs to do (**an implementation of an ISA would be a CPU**). - The instruction set consists of addressing modes, instructions, native data types, registers, memory architecture, interrupt, and exception handling, and external I/O. **MIPS is a Reduced Instruction Set Computer (RISC)**. Others include ARM, PowerPC, SPARC, HP-PA, and Alpha.A - RISC is a computer with a small, highly optimized set of instructions (see definition above), rather than the more specialized set often found in other types of architecture, such as in a complex instruction set computer (CISC). **Complex Instruction Set Computer (CISC)** is one alternative. **Intel’s x86** is the most prominent example; also Motorola 68000 and DEC VAX. RISC’s underlying principles, due to Hennessy and Patterson: - Simplicity favors regularity - Make the common case fast - Smaller is faster (e.g. less delay time) - Good design demands good compromises **Difference between RISC and CISC**: <img src="https://2.bp.blogspot.com/-PAHTUSBJA8s/Wzck5pNEhbI/AAAAAAAAAN4/NvbTkpo17dwrH6LM6XmKZEJtkYUINBxLQCLcBGAs/s1600/Capture.PNG" alt="Difference Between RISC and CISC Architectures and its Applications - PRO-Coder" style="zoom: 80%;" /> where: - **one key difference for us** is that, to deal with data in the memory, you need to explicitly `load` and then `store` using RISC (e.g. MIPS), whereas for CISC, it will probably be just one line. ### Example: GCD Algorithm The algorithm is as follows (not as efficient as the one had for AP, which uses `MOD`): <img src="\lectures\images\typora-user-images\image-20201028001937947.png" alt="image-20201028001937947" style="zoom: 67%;" /> And the set of instructions would be as follows" 1. Call the two numbers a and b 2. If a and b are equal, stop: a is the greatest common divisor 3. Subtract the smaller from the larger 4. Repeat steps 2–4 Then the **actual MIPS code for the above** look like <img src="\lectures\images\typora-user-images\image-20201028002121255.png" alt="image-20201028002121255" style="zoom:50%;" /> #### MIPS Opcode <img src="\lectures\images\typora-user-images\image-20201027224835017.png" alt="image-20201027224835017" style="zoom:50%;" /> where: - `beq` means **branch on equal** - `bne` means **branch not equal** - e.g. `bne $v0,$zero, .L1` means, if you find value ***at register*** `$v0,$zero` not equal, go to the code with label `.L1` #### MIPS Arithmetic <img src="\lectures\images\typora-user-images\image-20201027235016562.png" alt="image-20201027235016562" style="zoom: 50%;" /> #### MIPS Operands <img src="\lectures\images\typora-user-images\image-20201027235128342.png" alt="image-20201027235128342" style="zoom:50%;" /> #### MIPS Labels <img src="\lectures\images\typora-user-images\image-20201027235226320.png" alt="image-20201027235226320" style="zoom: 67%;" /> #### MIPS Comments <img src="\lectures\images\typora-user-images\image-20201027225059208.png" alt="image-20201027225059208" style="zoom: 50%;" /> #### MIPS Control Flow <img src="\lectures\images\typora-user-images\image-20201027225202137.png" alt="image-20201027225202137" style="zoom:50%;" /> where: - **by default, like all other programs, it runs from top to bottom**, if you do not use control flow (e.g. `beq`, which is like `if` statements.) ### MIPS ISA Overview MIPS is a 32-bit architecture (which means only works with 32-bits at a time). (there is also a 64-bit version, but not what we are using). This means we can **only use/have the following**: - **registers are 32-bit wide** - **all instructions** are 32-bits wide - e.g. each line of code must be 32-bits wide - **word size (data size)** is 32-bits - i.e., natural hardware operand size is 32-bits As explained above, MIPS is a **load/store architecture** - instructions manipulate operands in ***registers or immediates*** - immediates are basically places for constants, which will be addressed in [Immediate Operands](#Immediate Operands) - to **manipulate data in memory**, programmer must write **load/store instructions that copy data from memory to register** (and back) - much easier to implement in hardware (H&P: simplicity ⇒ speed) > For CISC, such as an intel-86 processor, which does not need to load the data, they tend to operate much faster. However, the hardware is more difficult to implement. ### MIPS Register Set - Registers have `$` before **name** (used by programmers), and name indicates the purpose. - the numbers will be translated at compile time - Register usage governed by convention. Only `$0` behaves differently. <img src="\lectures\images\typora-user-images\image-20201027225742415.png" alt="image-20201027225742415" style="zoom:80%;" /> where: - you have registers $0-31$ (i.e. **there are 32 registers you can manipulate**), and **each of them can store 4 Bytes=32 bits (i.e. 32 Flip-Flops)** - though it appears that **each type data seems to be bounded** (e.g. only 4 registers for storing function arguments), but in fact, **you can just use pointers (which is exactly 8 bytes=32 bits) inside those registers**, and those pointers could point to an array of arguments ### 3-Operant MIPS Instruction 3-Operand would mean (here): - 2 inputs/source registers - 1 output/destination register <img src="\lectures\images\typora-user-images\image-20201027230710787.png" alt="image-20201027230710787" style="zoom: 67%;" /> ### Immediate Operands - **Immediate** - a small constant encoded directly in instruction, so you can use directly when coding - basically, immediate just means constant - they are used often since they spare a register <img src="\lectures\images\typora-user-images\image-20201027230924833.png" alt="image-20201027230924833" style="zoom: 67%;" /> where: - `addi` has an `i`, which specifically means `immediate` as specified above > **Note:** > > - immediates get **16-bit field in instruction (built-in)**, consequently they can range from -32768 to 32767 However, what if you need to use constant with in 32-bit? - then you need to **use one register** with the following **two steps**: 1. load half of the data (16-bits) into the upper half of the register 2. fill in the lower half, by `OR`ing the above data with another 16 bits data. <img src="\lectures\images\typora-user-images\image-20201027231350877.png" alt="image-20201027231350877" style="zoom: 67%;" /> - remember, the `ori` works because, if you OR `1100` and `11`, you just get filled with `1111` - in the end, you get the value ``` 0xCAFE0B0E ``` > **This means that if you want to store a `long`**, which would be 64-bits, you would need to have **two registers**, each storing half of the data. ### Pseudo-Instruction Basically, they are expressions that are expanded by assembler - so that instead of writing all the code out, some work gets done by the assembler - however, this means that a processer, when directly pseudo instructions like this, will have no idea what it is talking about. - a processor only knows/implements real instructions as specified by the ISA ***For example:*** - the below code will do the same thing as the two steps done above <img src="\lectures\images\typora-user-images\image-20201027231907288.png" alt="image-20201027231907288" style="zoom:67%;" /> where: - the example must be pseudo code, since this command itself exceeded length 32 bits. ### Memory Operands Basically, if you don't have enough space in registers for data, you **could store in memory, and then have a pointer to it**. <img src="\lectures\images\typora-user-images\image-20201027232408974.png" alt="image-20201027232408974" style="zoom:50%;" /> where: - since we are dealing with 32-bit system, it means each data/word uses **4 bytes=4 memory cells**. #### Address Format in MIPS Basically, we only have one way to represent a pointer/address:\]

\text{address} = \text{base address (in a register)} + \text{offset (in an immediate)}

\[***For example:*** - the below **address/pointer** in MIPS: ```mips 4($t1) ``` would mean: ```mips address = reg[$t1] + 4 ``` **graphically**: <img src="\lectures\images\typora-user-images\image-20201030005315233.png" alt="image-20201030005315233" style="zoom:50%;" /> > **Note:** > > - There could be negative offsets as well, so that you can move back. > - However, since we are using an intermediate, it is signed and being 16 bits. #### Loading and Storing Data from Memory In short, to load/store, you will need: - `lw` = **load** word (from memory) - `sw` = **store** word (to memory) <img src="\lectures\images\typora-user-images\image-20201030005809870.png" alt="image-20201030005809870" style="zoom: 50%;" /> where: - your register file is 128B because you have **32 registers, each with capacity of 4 bytes** (32 bits). #### Loading Bytes Basically, your register has 4 bytes (32 bits), but what if you just want to load **one byte from your address?** Here, MIPS give you two choices: - `lbu` = load byte unsigned - load by filling the upper remaining bits to be `0` - `lb` = load byte (signed) - load by filling the upper remaining bits to be the same as the MSB ***For example:*** - <img src="\lectures\images\typora-user-images\image-20201030010721988.png" alt="image-20201030010721988" style="zoom: 67%;" /> where: - the data at memory was `0xF0` (which is the max of 1 byte = $2^8=16\times16$ bit) ### Control Transfer Instructions We need to be able to: - jump to a different place of the program **on some condition** (e.g. `if`) - this is usually called **branching** - jump to a different place of the program **unconditionally** (e.g. `goto`) - this is usually called a **jump** #### Jump Instruction This is unconditional, and hence it is simple: ```mips j mylabel ``` where: - once hit this line, it will jump to the code with `myLabel:` <img src="\lectures\images\typora-user-images\image-20201030011251452.png" alt="image-20201030011251452" style="zoom: 67%;" /> but another way is to: ```c jr $t3 ``` where: - `t3` would be the register containing the **address of your program** <img src="\lectures\images\typora-user-images\image-20201030011446102.png" alt="image-20201030011446102" style="zoom:50%;" /> where: - because **every instruction your are coding** is itself some data, which will have an **address**. #### Jump and Link The above two jumps just jumps straight to that position, and will not come back. However, sometimes you need to jump to some place, execute some code, and **come back**. - this is achieved by **using `jal <label>`** = jump and link, **and comes back with `jr $ra`** - so that **`jal` would be the instruction to use for calling a sub-function** <img src="\lectures\images\typora-user-images\image-20201030011921919.png" alt="image-20201030011921919" style="zoom:50%;" /> where `jal` does two things: - store the **address of your next line of code into register `$ra`** - jump to `myfunc` **Therefore, it is always used with `jr $ra`**, so that you come back. #### Control Instructions Here, you basically have another condition to check: - `bne` = branch if not equal - `beq` = branch if equal <img src="\lectures\images\typora-user-images\image-20201030012320852.png" alt="image-20201030012320852" style="zoom:50%;" /> where: - it takes **two registers to compare for equality**, and **jump to `myloop` label** > **Note**: > > - There is actually another way, which is to use `b <mylabel>`. This is a pseudo code in MIPS. > - this is basically the same as jump, since the compiler expands to `beq $0$0 <mylabel>` ## MIPS Syntax This section is written by myself. It contains summary of the tutorial I found online. ### Data Types/Literals **Data types**: - Instructions are all 32 bits - byte(8 bits), halfword (2 bytes), word (4 bytes) - a character requires 1 byte of storage - an **integer** requires 1 word (4 bytes) of storage **Literals**: - **numbers** entered as it is. e.g. 4 - **characters** enclosed in **single quotes**. e.g. 'b' - **strings** enclosed in **double quotes**. e.g. "A string" ### Registers Basically, we have: - 32 general-purpose registers - see the section [MIPS Register Set](#MIPS Register Set) - register preceded by `$` in assembly language instruction, which have two formats for addressing: - using register number e.g. `$0` through `$31` - using equivalent **names** e.g. `$t1`, `$sp` - stack grows from high memory to low memory ### Program Structure Basically, a program would just be a plain text file. - you use the extension `.s` so that SPIM could run it **Data Declarations** - placed in section of program **identified with assembler directive `.data`** - **declares variable names** used in program; storage allocated in main memory (RAM) **Code** - placed in section of text **identified with assembler directive `.text`** - contains **program code** (instructions) - **starting point for code execution given label `main:`** - ending point of main code should use exit `system call` (see below under System Calls) **Comments** - anything following `#` on a line - e.g. `# This stuff would be considered a comment` ***For example:*** ```assembly # Comment giving name of program and description of function # Template.s # Bare-bones outline of MIPS assembly language program .data # variable declarations follow this line # ... .text # instructions follow this line main: # indicates start of code (first instruction to execute) # ... # End of program, leave a blank line afterwards to make SPIM happy ``` #### Data Declaration Example Format for declarations: ```assembly # after .data: name: storage_type value(s) ``` where you can: - create storage for variable of **specified type** with given **name** an**d specified value** - value(s) usually gives initial value(s); for storage type `.space`, **gives number of spaces to be allocated** ***For example:*** ```assembly #after .data var1: .word 3 # create a single integer variable with initial value 3 array1: .byte 'a','b' # create a 2-element character array with elements initialized # to a and b array2: .space 40 # allocate 40 consecutive bytes, with storage uninitialized # could be used as a 40-element character array, or a # 10-element integer array; a comment should indicate which! ``` ### Common Instructions #### Load/Store This is used because: - RAM access only allowed with `load` and `store` instructions - **all other instructions use register** operands `load`: - ```assembly lw register_destination, RAM_source #copy word (4 bytes) at source RAM location to destination register. ``` - ```assembly lb register_destination, RAM_source #copy byte at source RAM location to low-order byte of destination register, # and sign-extend to higher-order bytes ``` `store` word: - ```assembly sw register_source, RAM_destination #store word in source register into RAM destination ``` - ```assembly sb register_source, RAM_destination #store byte (low-order) in source register into RAM destination ``` `load` immediate: - ```assembly li register_destination, value #load immediate value into destination register ``` ***For example:*** ```assembly .data var1: .word 23 # declare storage for var1; initial value is 23 .text __start: lw $t0, var1 # load contents of RAM location into register$t0: $t0 = var1 li $t1, 5 #$t1 = 5 ("load immediate") sw $t1, var1 # store contents of register$t1 into RAM: var1 = $t1 done ``` #### Indirect and Based Addressing This is basically the pointer equivalent of `C`: `load` address: - ```assembly la $t0, var1 ``` - copy RAM address of `var1` (presumably a label defined in the program) into register `$t0` - equivalent of having `*t0 = var1`, where **`var1` would be an address** indirect addressing: - ```assembly lw $t2, ($t0) ``` - load word at RAM address contained in `$t0` into `$t2` - equivalent of having `*t2 = t0`, where `t0` **contains an address** (**source**) - ```assembly sw $t2, ($t0) ``` - store word in register `$t2` into RAM at address contained in `$t0` - equivalent of having `*t0 = t2`, where `t0` **contains an address** (**destination**) Therefore, you could also do the **pointer arithmetic** equivalent in assembly: - the only difference is that `+1` in `C` of a typed pointer will advance number of bytes dependent of the type, but assembly has no idea. - Therefore, assembly always advances in the unit of `1 Byte` ***For example*** - ```assembly lw $t2, 4($t0) ``` - load word at RAM **address (`$t0`+4)** into register `$t2` - basically advances 4 bytes from the address of `$t0` - e.g. 4 bytes could mean an integer **In general**, this is useful for: - **arrays**; access elements as offset from base address - **stacks**; easy to access elements at offset from stack pointer or frame pointer ***For example*** ```assembly .data array1: .space 12 # declare 12 bytes of storage to hold array of 3 integers .text __start: la $t0, array1 # load base address of array into register$t0 li $t1, 5 #$t1 = 5 ("load immediate") sw $t1, ($t0) # first array element set to 5; indirect addressing li $t1, 13 #$t1 = 13 sw $t1, 4($t0) # second array element set to 13 li $t1, -7 #$t1 = -7 sw $t1, 8($t0) # third array element set to -7 done # now you have initialized the array `array1` with three integers ``` #### Arithmetic Instructions - **most** use **3 operands** - **all operands are registers**; no RAM or indirect addressing - operand size is word (4 bytes) - since each register stores 4 bytes Most of the common ones are listed below: ```assembly add $t0,$t1,$t2 #$t0 = $t1 +$t2; add as signed (2's complement) integers sub $t2,$t3,$t4 #$t2 = $t3 -$t4 addi $t2,$t3, 5 # $t2 =$t3 + 5; "add immediate" (no sub immediate) addu $t1,$t6,$t7 #$t1 = $t6 +$t7; add as unsigned integers subu $t1,$t6,$t7 #$t1 = $t6 +$t7; subtract as unsigned integers mult $t3,$t4 # multiply 32-bit quantities in $t3 and$t4, and store 64-bit # result in special registers Lo and Hi: (Hi,Lo) = $t3 *$t4 div $t5,$t6 # Lo = $t5 /$t6 (integer quotient) # Hi = $t5 mod$t6 (remainder) mfhi $t0 # move quantity in special register Hi to$t0: $t0 = Hi mflo $t1 # move quantity in special register Lo to$t1: $t1 = Lo # used to get at result of product or quotient move $t2,$t3 # $t2 =$t3 ``` where: - **`mult` and `div` would be always used with `mfhi` and `mflo`** - because data goes in the special register `Hi` and `Lo` - **`move` is basically assigning "variables"** #### Control Structures Examples are shown in the other section, and here would be a summary of common branching syntax: ```assembly b target # unconditional branch to program label target, equivalent as goto beq $t0,$t1,target # branch to target if $t0 =$t1 blt $t0,$t1,target # branch to target if $t0 <$t1 ble $t0,$t1,target # branch to target if $t0 <=$t1 bgt $t0,$t1,target # branch to target if $t0 >$t1 bge $t0,$t1,target # branch to target if $t0 >=$t1 bne $t0,$t1,target # branch to target if $t0 <>$t1 ``` where: - **`target` would be a label** For subroutine calls (sub-function) calls, use: - "jump and link" instruction ```assembly jal sub_label # "jump and link" ``` #### Using Stacks This is very useful for recursive function calls, since you need to **store your `$ra` in nested functions** so that `$ra` will not get lost. - to do this, you need to use the **stack pointer register `$sp`** <img src="\lectures\images\typora-user-images\image-20201031140647946.png" alt="image-20201031140647946" style="zoom:67%;" /> where: - The stack **grows downward** in terms of memory addresses. - The address of the **top element** of the stack **is stored** (by convention) in the “stack pointer” register, `$sp`. So, to use a stack: - **To *push* elements onto the stack**: 1. **Move the stack pointer `$sp` down** to make room for the new data. 2. Store the elements into the stack. - **Access elements on the stack** - You can access any element in the stack (not just the top one) if you know where it is **relative to `$sp`** - **To *pop* elements from the stack** - pop, or “erase,” elements simply by **adjusting the stack pointer upwards**. ***For example:*** - Pushing elements in `$t1` and `$t2`: - as they are stored in register, they **must be 4 bytes data** ```assembly sub $sp,$sp, 8 sw $t1, 4($sp) sw $t2, 0($sp) ``` <img src="\lectures\images\typora-user-images\image-20201031141050698.png" alt="image-20201031141050698" style="zoom:50%;" /> - Accessing elements ```assembly lw $s0, 4($sp) ``` <img src="\lectures\images\typora-user-images\image-20201031141140959.png" alt="image-20201031141140959" style="zoom: 50%;" /> - Popping elements ```assembly addi $sp,$sp, 4 ``` <img src="\lectures\images\typora-user-images\image-20201031141214515.png" alt="image-20201031141214515" style="zoom: 50%;" /> #### System Calls - used to read or print values or strings from **input/output window**, and indicate **program end** - use `syscall` operating system routine call To use `syscall`, you need to specify the follows: - first **supply** appropriate values in registers **`$v0` and `$a0`-`$a1`** - both are used by `syscall` <img src="\lectures\images\typora-user-images\image-20201031141522720.png" alt="image-20201031141522720" style="zoom:67%;" /> where: - The `print_string` service expects the address to start **a null-terminated** character string. - The type `.asciiz` creates a null-terminated character string. (see [Example: Implement `strlen`](#Example: Implement `strlen`)) - The `read_int`, `read_float` and `read_double` services read **an entire line of input up to and including the newline character**. - The `read_string` service has the same semantics as the UNIX library routine `fgets`. - It reads up to **n-1 characters into a buffer** and **terminates the string with a `null` character**. - If fewer than n-1 characters are in the current line, it reads up to and including the newline and terminates the string with a null character. - The `sbrk` service returns the address to a block of memory containing n additional bytes. This would be used for dynamic memory allocation. - The `exit` service **stops a program from running**. > **Note**: > > - `$v0` is used to specify which of the `syscall` to call ### Subroutines with Stack A subroutine is code that, after executing, **resumes whatever invoked it**. This will be very useful for: - Code reuse (recurring computations, function libraries) - Isolation/Abstraction (function callers don’t need to understand the implementation) - Recursion #### Calling Conventions Sine registers are preserved across function calls, while some are not. ![image-20201103235752697](\lectures\images\typora-user-images\image-20201103235752697.png) where: - for example, data stored in `$s0`-`$s7` should be "made intact" after a subroutine call - this means that the data there **before a subroutine function** call should be the **same as the data after the call** - for data that are not "preserved" **by convention**, you can just use them freely in your subroutine without restoring them - since this is a **convention. You can choose not to obey it**. However, it will This means: <img src="\lectures\images\typora-user-images\image-20201104002138261.png" alt="image-20201104002138261" style="zoom:80%;" /> where: - for `$ra` especially, within **one level of `jal`, it will be fine**. - e.g. if **`bar` above *does not* alter `$ra`**, then you **don't need to do anything in `bar`** - however, if you have **more than one level of subroutine**, e.g. recursive call, you need to **use your stack to save it** before it gets altered #### Example: Recursive `strlen` Basically, it looks like: <img src="\lectures\images\typora-user-images\image-20201104002850291.png" alt="image-20201104002850291" style="zoom: 50%;" /> where: - here, in the `strlen`, since it is recursive, there could be **more than 1 level of subroutine (itself) being called**. - i.e. the first `jal` works as expected, since it first stores your `$ra`. But the **second `jal` will overwrite the previous** `$ra` - therefore, imagine yourself in the second `jal` function. Then you need to **save where you need to return, and proceed** - in the `strlen_basecase`, since it **does not call more than 1 level of subroutine or alter any preserved value**, you don't need to save anything. # Week 9 ## Using SPIM Basically, the software QtSpim is used for simulating writing MIPS and talk to a CPU. - since MIPS is a RISC, most CPU you have in your laptop won't talk to it. <img src="\lectures\images\typora-user-images\image-20201030013630223.png" alt="image-20201030013630223" style="zoom:50%;" /> where: - the address of **each instruction is exactly 4 bytes** apart (32 bits apart) - rightest column is the **program's code you wrote** - the second column from right is the **program's code interpreted by the compiler** > **Note**: > > - For most editors, if you name your **program ending with `.s` as extension**, then you will get the correct syntax highlighting for MIPS code. ### Example: Implement `strlen` This is basically the context: <img src="\lectures\images\typora-user-images\image-20201030013938188.png" alt="image-20201030013938188" style="zoom:50%;" /> where: - `"Hello w.."` is the string stored in your memory, ending with `\0`. - each character is 1 Byte. - the register `$a0` is representative of your **function argument** - assume that the pointer to that string is passed into the argument `$a0` - and we **need** to **return the length in a register `$v0`** ```assembly #inside strlen.s .text # directive to the assembler, that instructions starts after this line strlen: li $v0, 0 jr $ra main: # like most compiler, program starts with main # push $ra onto stack addi $sp,$sp, -4 sw $ra, 0($sp) # invoke strlen that we need to write, on the hello la $a0, hello # take the pointer to hello, and put it in register$a0 jal strlen # call strlen function, which will take that $a0 # print result to screen move $a0,$v0 # get the data from $v0, and put it into$a0 li $v0, 1 syscall # prints the data using the argument $a0 # restore $ra from stack lw $ra, 0($sp) addi $sp,$sp, 4 jr $ra .data # directive to the assmebler, that data comes after this line hello: .asciiz "Hello, world!!!" ``` and to correctly implement the `strlen` function, we need to implement `strlen`. - the idea is that we take a counter. - If the next byte of string is not `0`, we: - increment the counter - move the pointer to the next byte - done when we get a `0` ```assembly strlen: # initialize count li $v0, 0 # remember, program runs from top to bottom strlen_top: # load character from mem lbu $t0, 0($a0) # check if done beq $t0,$0, strlen_return # increment counter & advance pointer addi $v0,$v0, 1 addi $a0,$a0, 1 # go back to the top of the loop b strlen_top # pseudo instruction strlen_return: jr $ra ``` where: - if you simply had `b strlen`, then you **will have re-initialized your counter** # Week 10 ## MIPS Memory Layout <img src="\lectures\images\typora-user-images\image-20201110231726919.png" alt="image-20201110231726919" style="zoom: 50%;" /> where: - it is basically what we know from `C` - the `text segment` is indicative of the **`.text` declaration** - where our code goes - the `static data` is indicative of the `.data` declaration - where our static variables go - the `stack + heap` is the place for dynamic data - `stack` is used by the `$sp` (by user) , also used for general subroutines calls (by MIPS) - `heap` is used when you do `syscall` for memory allocation (similar as `malloc` in C) ## Interpreting a Call Stack Essentially, the **information stored in a call stack** would be identical to the **current path** in our **call tree**. Consider a simple code that does: ```pseu main(){ # looping 10 times for i=1,2,...10: foo() } foo(){ bar() } ``` and the call tree looks like: <img src="\lectures\images\typora-user-images\image-20201110233713605.png" alt="image-20201110233713605" style="zoom:50%;" /> where: - at any point, you call stack stores information identical to the structure of the current path in your call tree ## More Recursive Examples ### Recursive `array_min` function Suppose you have an array of integers, and you need to find the minimum value of the elements in the array - think about it, it would be easy to solve using **tail recursion** ```c int array_min(len, ptr){ if len==0 return MAXVAL; else return min(*ptr, array_min(len-1, ptr+4)); } ``` MIPS: ```assembly .text array_min: # if len is zero beqz $a0, array_min_basecase addi $sp,$sp, -8 sw $ra, 0($sp) sw $s0, 4($sp) # s0 should be kept unchanged as well, since it is preserved # recurs # load first int in array to s0 (because we have a pointer) lw $s0, 0($a1) # altered s0, which could happen recursively addi $a0,$a0, -1 # len=len-1 addi $a1,$a1, 4 # ptr += 4 jal array_min # subroutine call # return min(s0, v0) slt $t0,$s0, $v0 # if s0 is smaller, t0=1 bnez $t0, array_min_s0_larger array_min_return_v0: # return v0 # restore lw $ra, 0($sp) lw $s0, 4($sp) addi $sp,$sp, 9 jr $ra array_min_s0_larger: move $v0,$s0 b array_min_return_v0 array_min_basecase: li $v0, 0x7ffffff jr $ra main: addi $sp,$sp, -4 sw $r0, 0($sp) li $a0, -43 la $a1, test_array jal array_min move $a0,$v0 jal print_int lw $r0, 0($sp) addi $sp,$sp, 4 jr $ra print_int: li $v0, 10 syscall jr $ra .data test_array: .word 3,5,-9,3,1 ``` ## Instruction Encodings Basically, this talks about how we get instructions into machine code. - this gets encoded by the **assembler** - the machine code is finally **read and executed** by your processor ***For example*** <img src="\lectures\images\typora-user-images\image-20201113022535876.png" alt="image-20201113022535876" style="zoom:50%;" /> where: - notice that **each instruction** has the size of 4 bytes, which is again, 32 bits. - and **PC** stands for the program counter, which basically is the **address of the current instruction** - this is sometimes also referred to as instruction pointer (**IP**) ### MIPS Instruction Formats All MIPS instructions are **encoded** into one of the three formats. - **R-Type** - register operands - **I-Type** - immediate operands - **J-Type** - for jumping where: - remember from the above that each encoding/encoded instruction is also 32 bits wide #### R-Type The structure looks like this: <img src="\lectures\images\typora-user-images\image-20201113024158904.png" alt="image-20201113024158904" style="zoom: 33%;" /> where: - `op` means the code for the operation you are doing - works together with `funct` field - `rs`/`rt` would be source registers - `rd` would be destination register - notice each of them has 5 bits, which makes sense since $2^5 = 32$ registers in total - e.g. `$ra` would represent `register id=31` - `shamt` is used **only** for the shift instructions. - it will be `0` for all the other operations - `funct` can be understood as a little **extension** to the opcode field, to tell the processor what operation ***For example:*** Consider the assembly code: ```assembly add $s0,$s1, $s2 sub $t0,$t3, $t5 ``` Has the following values: <img src="\lectures\images\typora-user-images\image-20201113024332905.png" alt="image-20201113024332905" style="zoom:50%;" /> where: - e.g. `$s0` has register value of `17` - e.g. `funct` field for `add` instruction is `op=3` **WITH** `funct=32` - e.g. `shamt` is `0` because we are not shifting Lastly, encoding into binary: <img src="\lectures\images\typora-user-images\image-20201113024543119.png" alt="image-20201113024543119" style="zoom: 50%;" /> where: - `000000 10001 10010 10000 00000 100000` is in **hex** `0x 02328020` #### I-Type Those are basically instructions that involves **immediates**: <img src="\lectures\images\typora-user-images\image-20201113024919050.png" alt="image-20201113024919050" style="zoom: 33%;" /> where: - only `op` is enough for identifying the operation - both `rs` and `rt` could be source/destination - `imm` would store the **immediates** that you have for instructions - e.g. `addi $t0,$t0, 1`, so `1` is stored in the `imm` - in fact, the `branch` instructions also pertain to this category ***For example:*** <img src="\lectures\images\typora-user-images\image-20201113025217788.png" alt="image-20201113025217788" style="zoom:50%;" /> where: - notice that for `lw` and `sw`, the **offset for pointer** is an immediate that is also stored in `imm` - for the line `lw $t2, 32($0)` has: - `$t2` = `rt` (destination) - `$0` = `rs` (source) - but the line `sw $s1, 4($t1)` has: - `$s1`=`rt` (source) - `$t1`=`rs` (source) - because you need to read the address contained in this register, ***not*** write to this register then, the encoding becomes: <img src="\lectures\images\typora-user-images\image-20201113025243753.png" alt="image-20201113025243753" style="zoom:50%;" /> #### J-Type When you jump: <img src="\lectures\images\typora-user-images\image-20201113030113150.png" alt="image-20201113030113150" style="zoom:33%;" /> where: - `addr` contains the **address** of the **next instruction** <img src="\lectures\images\typora-user-images\image-20201113030421025.png" alt="image-20201113030421025" style="zoom:50%;" /> ## Architecture vs Microarchitecture - MIPS (IS) Architecture means the **Hardware/software interface** - e.g. the programs we write - MIPS Microarchitecture means the **implementation of the interface** - e.g. the processor that executes the ISA - so that when processors get better, the idea is that the same ISA would still be supported (hence no change in softwares) Here, we cover two microarchitectures: 1. **Single Cycle Processor** 2. **Pipeline Processor** which supports the following interface/(IS): <img src="\lectures\images\typora-user-images\image-20201113030752518.png" alt="image-20201113030752518" style="zoom:50%;" /> ### Arithmetic Logic Unit (ALU) This component will be used by the processors. This component performs varies **arithmetic and logical computations**. - every processor will have at least one of this. <img src="\lectures\images\typora-user-images\image-20201113031405777.png" alt="image-20201113031405777" style="zoom:50%;" /> where: - each input/output will be `N` bits - the `3` bits `F` is basically the control signal > **Implementation** > > - Up to this point, you should be able to figure how exactly how each abstraction works, and how the circuit works: > > <img src="\lectures\images\typora-user-images\image-20201113031550228.png" alt="image-20201113031550228" style="zoom:50%;" /> ### State Elements of MIPS Processor Other important components we need to know is: <img src="\lectures\images\typora-user-images\image-20201113031744468.png" alt="image-20201113031744468" style="zoom:50%;" /> where, on the high level: - the left-most `PC` contains the address of current instruction #### Instruction Memory and Data Memory The idea is: - **Address Input** produces **Data Output** - where, the **data** could be instructions/actual data <img src="\lectures\images\typora-user-images\image-20201113032027310.png" alt="image-20201113032027310" style="zoom:50%;" /> where: - **Instruction Memory** can only read: - gets an address from `A` - output the instruction encoding at that address **to** `RD` - **Data Memory** can both **read and write:** - gets an address to read/write from `A` (controlled by the `WE`/write enable control) - if read `WE=0`, read content goes out **to** `RD` - if write `WE=1`, content to write goes in **from** `WD` #### Register File The storage of **all the registers** you are using in MIPS - contains 32, `32`-bit registers <img src="\lectures\images\typora-user-images\image-20201113032729498.png" alt="image-20201113032729498" style="zoom:50%;" /> where: - remember that the address of registers are just `0-31`, hence only needs `5` bits - since the implementations takes 3 registers at most, we could: - **read** two **registers** using `A1` and `A2` address, with read data outputted to `RD1` and `RD2` - **writing** contents to a destination **register** with `A3` address, with data `WD3` to write to. - note, if you are reading and writing in the **same cycle**, you should always ensure that **write happens after read** #### Program Counter A register that holds a pointer (address) to the **current** instruction. <img src="\lectures\images\typora-user-images\image-20201113034347078.png" alt="image-20201113034347078" style="zoom: 67%;" /> where: - the **next instruction** would be the `PC'` - usually, if we have a linear execution without branching, it will just be `PC+4` to move on to the next instruction # Week 11 ## Single Cycle Processor Implementation The final processor will be the **union of all the below instruction implementations**. - there could be a more efficient way to do it, but this approach is simple and straightforward ### Executing the `lw` Instruction First, remember we are **executing** something like: <img src="\lectures\images\typora-user-images\image-20201113034728490.png" alt="image-20201113034728490" style="zoom: 67%;" /> So: 1. **Fetch** this **instruction** using the address from the **instruction memory** <img src="\lectures\images\typora-user-images\image-20201113034826372.png" alt="image-20201113034826372" style="zoom: 50%;" /> note: - technically, fetching the instruction from instruction memory does not take a clock cycle. 2. Read `rs` register from **register file**: - the **address** of `rs` would be the `25th - 21th` bit of the instruction - since we are dealing with `lw`, `rs` stores the target pointer to load <img src="\lectures\images\typora-user-images\image-20201113034941036.png" alt="image-20201113034941036" style="zoom: 50%;" /> 3. Sign-extend the **immediate** (16 bits) to become 32 bits <img src="\lectures\images\typora-user-images\image-20201113035235122.png" alt="image-20201113035235122" style="zoom: 50%;" /> 4. **Compute** the memory address - calculate the base address (`rs`) + offset (from intermediate) <img src="\lectures\images\typora-user-images\image-20201113035433847.png" alt="image-20201113035433847" style="zoom: 50%;" /> 5. Use the computed memory address to **read data from Data Memory**, back to register file for writing - remember, we use the content of data to write to/save to another register `rt` - to write, **give `WE=1`** - the target register to write is stored in `rt`, with `20th-16th` bit from the instruction <img src="\lectures\images\typora-user-images\image-20201113035607562.png" alt="image-20201113035607562" style="zoom:50%;" /> 6. Now, the computation is completed, and we need to **decide** on the **next instruction**: - in this case, we simply add `4` to get the **address of the next instruction** (since `lw` is ***not*** jump) <img src="\lectures\images\typora-user-images\image-20201113040236783.png" alt="image-20201113040236783" style="zoom:50%;" /> > **Note**: > > - The above did not discuss the **`op` code**, which actually **contributes to the control signals** at the point when instruction is read: > > <img src="\lectures\images\typora-user-images\image-20201113040506224.png" alt="image-20201113040506224" style="zoom:50%;" /> > > - The initiation of `PC` is set by the OS (not for us to control) ### Executing the `sw` Instruction <img src="\lectures\images\typora-user-images\image-20201117233130952.png" alt="image-20201117233130952" style="zoom: 67%;" /> where for `sw`: - we are executing `sw $rt, imm($rs)` <img src="\lectures\images\typora-user-images\image-20201117233016706.png" alt="image-20201117233016706" style="zoom: 67%;" /> where: 1. Assuming this circuit knows that we are doing `sw` 2. Fetch **instruction data** from **Instruction Memory** by address given from `PC` 3. Extract `rs`, into `A1` of the **Register File**, **Sign Extend** the `imm` into 32 bits - we first compute `imm($rs)` 4. **Add** the data from `A1` and `imm`, obtained a computed address 5. **Use** the above address for knowing **where** to write to in **Data Memory** 6. **Load** data from `$rt` to `A2` of **Register File**, and write to `WD` of **Data Memory** - now you need to have `write=1` enabled for data memory Notice: - for `sw`, we are writing to memory. So there is **no** change for contents in your register file/no loop back from data to regfile. ### Executing R-Type Instructions Imagine the below for an `add` instruction: <img src="\lectures\images\typora-user-images\image-20201117233737737.png" alt="image-20201117233737737" style="zoom: 67%;" /> where: - An example of R-Type would be: `add $rd,$rs, $rt` <img src="\lectures\images\typora-user-images\image-20201117233904678.png" alt="image-20201117233904678" style="zoom: 67%;" /> where: 1. F - the multiplexer comes in because sometimes the addition needs an immediate, but sometimes it comes from a register. So we chose to MUX it. - however, remember that for **R-types**, you don't have intermediates, so that MUX will **always** be `0`. - also, since R-Type instructions don't need to talk to the memory at all, we take another MUX such that it **goes back to Register File** for **writing**. - the **address of register to write** to is contained in the `rd` field of R-Type, being the `15:11` digits. - lastly, since we don't have a jump/branch, the clock/next instruction just **increases** by `4` as before. ### Executing `beq` Instructions <img src="\lectures\images\typora-user-images\image-20201118014343617.png" alt="image-20201118014343617" style="zoom: 80%;" /> where: - `beq $rs,$rt, imm`, `imm` would be the address - Determine whether `rs` and `rt` are equal, and jump/branch to that address (inside `imm`) if true. > **Note:** > > - The **calculation for the next address** is entirely different from `beq` and `j` > - for `beq`, next `PC ← PC + 4 + SignExt18b({imm, 00})` > - for `j`, next `PC ← {(PC + 4)[31:28], address, 00}` <img src="\lectures\images\typora-user-images\image-20201118014300231.png" alt="image-20201118014300231" style="zoom: 67%;" /> where: - to determine if `rs` and `rt` are equivalent, we **subtract** them - this means for later ALU control, it needs to send `110` as control signal. - the `beq` jump instruction in the above is calculated by `PC ← PC + 4 + SignExt18b({imm, 00})` 1. first we sign extend `imm` the order between this and the next step does not matter) 2. then we first shift left by `4` bits - here, `4` bits in address means 4 bytes of data - (remember, 1 bit = 1 memory cell = 1 byte of data) 3. Add it to the current `PC` + `4`. - whether or not to branch depends on the result of subtraction. Hence we have a **MUX right before `PC'`** ### Implementing the Controller This is the piece that decides **which components to activate for each instruction**. - remember that all the above instructions executes in 1 cycle Components of the controller include: #### Single Cycle Controller <img src="\lectures\images\typora-user-images\image-20201118015642556.png" alt="image-20201118015642556" style="zoom:50%;" /> where: - remember, for certain I-Type and J-Type, ALU decoder don't need to decide since it will be always adding/subtracting - e.g. **always add** for `lw`/`sw` for computing the address with `imm` - e.g. **always subtract** for `beq` for computing the difference between `$rs` and `$rt` ##### Main Decoder In short, we need to be able to: - **get** the `OP` and `FUNC` field - **figure out** what to set to control signals such as `WriteEnable`, `ALU`, etc. Reminder: ![image-20201118020417263](\lectures\images\typora-user-images\image-20201118020417263.png) ***For R-Types:*** <img src="\lectures\images\typora-user-images\image-20201117235850462.png" alt="image-20201117235850462" style="zoom: 67%;" /> ***For `lw`***: <img src="\lectures\images\typora-user-images\image-20201118000427655.png" alt="image-20201118000427655" style="zoom:67%;" /> where: - The ALU Op in this case basically says to ALU: ignore the `funct` field an just `add` whatever you get ***For `sw`***: <img src="\lectures\images\typora-user-images\image-20201120065747554.png" alt="image-20201120065747554" style="zoom:80%;" /> where: - `MemToReg` is don't care because we don't have anything output from memory. as we are just saving a word ***For `beq`***: <img src="\lectures\images\typora-user-images\image-20201118000843868.png" alt="image-20201118000843868" style="zoom: 67%;" /> where: - `branch` control signal would be `1`, but whether or not to branch **depends** on the result of subtraction. - i.e. **not to branch** if `$rs` is not equal to `$rt` ##### ALU Decoder <img src="\lectures\images\typora-user-images\image-20201118001038195.png" alt="image-20201118001038195" style="zoom: 50%;" /> where: - `00` and `01` means "ignore the `funct` field and do add/subtract". - Therefore, **ALU Control** basically **only** **cares** about `func` when **`ALU OP=10`** #### Extension: ##### Support `addiu` <img src="\lectures\images\typora-user-images\image-20201118001152073.png" alt="image-20201118001152073" style="zoom:67%;" /> ##### Support `jump` <img src="\lectures\images\typora-user-images\image-20201118001237332.png" alt="image-20201118001237332" style="zoom:67%;" /> ## Single Cycle Processor Performance In short, we are measuring **how long do we have to wait** for each program. <img src="\lectures\images\typora-user-images\image-20201118001816195.png" alt="image-20201118001816195" style="zoom:50%;" /> where: - $instruction$ refers to the **dynamic instruction count**, which is when it actually **executes**, how many instructions it will have. - it is different from the number of instruction in the code you write. - Our $CPI$ would be `1`, since we take exactly `1` clock cycle for each instruction - Clock period would be the things such as `5.5GHz` CPU you see in the market ### Critical Path and Clock Period <img src="\lectures\images\typora-user-images\image-20201118001932259.png" alt="image-20201118001932259" style="zoom: 50%;" /> where: - this would be the **MAX** clock period you can have (obviously, you need to wait for propagation delays) - $t_{mem-I}$ means memory read time for **instructions**, and $t_{mem-D}$ means memory read time for **data**, which in essence are all memory reads. ### Program Execution Time For the above **single-cycle processor**: <img src="\lectures\images\typora-user-images\image-20201118002108296.png" alt="image-20201118002108296" style="zoom:50%;" /> where: - in fact, we could execute it faster if we increase the $\frac{\text{Clock Cycles}}{\text{Instruction}}$ ratio, which means we need to execute more than one instruction per cycle. - this could be achieved if we use the **pipeline MIPS processor** ## Pipelined MIPS Processor Implementation This is **another processor** that implements the MIPS ISA, and it uses pipelining to **increase** the $\frac{\text{Clock Cycles}}{\text{Instruction}}$ ratio. - while this would make program execution time shorter, the design is **more complicated** > **Intuition:** > > - In short, the single cycle processor has the problem of limiting the clock to the slowest instruction execution (i.e. `lw`): > > <img src="\lectures\images\typora-user-images\image-20201120070838851.png" alt="image-20201120070838851" style="zoom:50%;" /> > > - In fact, this could be solved using the **multi-cycle processor**, such that: > > - we break instruction into **smaller components**, and execute them in a cycle > - therefore, we can save time to not execute unneeded instructions > > <img src="\lectures\images\typora-user-images\image-20201120071056383.png" alt="image-20201120071056383" style="zoom:50%;" /> > > - **Pipeline** would work even faster than the above two, by: > > - once a **single unit of execution** finished, we **start** executing the **next** instruction > - on the **long run**, we would have **$\approx 1$ instruction done per cycle** > - this introduces complexities because we need to manage the dependencies > > <img src="\lectures\images\typora-user-images\image-20201120071454998.png" alt="image-20201120071454998" style="zoom:50%;" /> > > the analogy of pipeline processor in real life is: > > <img src="\lectures\images\typora-user-images\image-20201120072031580.png" alt="image-20201120072031580" style="zoom:33%;" /> In fact, this is exactly what happens in a pipelined processor: <img src="\lectures\images\typora-user-images\image-20201120073416493.png" alt="image-20201120073416493" style="zoom:50%;" /> where: - a **potential hazard** you see would be: what if the first instruction is a `beq`? Then we don't know which address next fetch instruction to take. - these are also things that we need to manage in our design - notice that some components are left-aligned, where some are right-aligned. This is because we cannot read **and** write to regfile at the **same time**: <img src="\lectures\images\typora-user-images\image-20201120073935060.png" alt="image-20201120073935060" style="zoom:33%;" /> ### Pipeline Datapath <img src="\lectures\images\typora-user-images\image-20201120074157599.png" alt="image-20201120074157599" style="zoom:50%;" /> where: - the **highlighted vertical bars** in the pipeline processor are **registers** themselves, such that we can do the **pipelining** in each clock cycle. - **Decode** just means **read data from register file** > **Naming for each stage:** > > - Sometimes we use **acronyms** for each stage: > - `Fetch`=`F` > - `Decode`=`D` > - `Execute`=`E` or `X` > - `Memory`=`M` > - `Writeback`=`W` --- ***For example*** - Consider the `add` instruction: <img src="\lectures\images\typora-user-images\image-20201120074714420.png" alt="image-20201120074714420" style="zoom: 67%;" /> where: - this means we would need $5$ **clock cycles** to finish executing `add` - for `add`, the `Memory` stage is optional: since we are not doing anything with memory, we could have skipped it. However, **forcing every instruction** to do the $5$ stages has the **advantage** of: - making sure at each cycle, **only one instruction completes** - reduces design complexity --- ### Corrected Datapath One error in the above design is the write register operation: <img src="\lectures\images\typora-user-images\image-20201120075652403.png" alt="image-20201120075652403" style="zoom: 67%;" /> where: - `A3`, the **register** to write to, comes in **after `Decode` stage** - `WD3`, the **data** to write to, comes in **after `Writeback` stage** - having them **out-of-sync** means you might have data from another instruction written into a register of the next instruction To **fix** this, we simply do: <img src="\lectures\images\typora-user-images\image-20201120080056357.png" alt="image-20201120080056357" style="zoom: 50%;" /> ### Pipeline Control The idea of control is the same, but the only difference is that we also pipe them into the stage registers: <img src="\lectures\images\typora-user-images\image-20201120080844160.png" alt="image-20201120080844160" style="zoom:50%;" /> where: - we **compute** the control signals of an instruction **at once** - **pipe** them into the registers - **use** them when needed > **Note:** > > - The **last character** of the name of control signal corresponds to the **data at different stage**, which will be **different**: > - `RegWriteD` is the data at the `Decode` stage > - `RegWriteE` is the data at the `Execute` stage > - `RegwriteM` is the data at the `Memory` stage > - etc. ### Abstraction for Pipeline Timing <img src="\lectures\images\typora-user-images\image-20201120081711876.png" alt="image-20201120081711876" style="zoom:67%;" /> where: - the **shading** indicates which **resource is in use** - half shading is shown such that `write` to regfile would happen before `read` - since we have 5 stages, we could **execute 5 "computations "at a time** ## Hazards in Pipelined Processor ### Data Hazards One problem of the above design is due to the fact that we **Writeback** to register file in the final step. This means if we do: <img src="\lectures\images\typora-user-images\image-20201120082041027.png" alt="image-20201120082041027" style="zoom:50%;" /> where: - doing `add` means `$s0` will only get **updated** in the **fifth cycle** - but `and` needs to **read** the computed `$s0` on the **third cycle** In fact, both `and` and `or` would not work: <img src="\lectures\images\typora-user-images\image-20201120082141163.png" alt="image-20201120082141163" style="zoom:50%;" /> #### Remedies for Data Hazards One approach is to simply insert `nop` commands: <img src="\lectures\images\typora-user-images\image-20201120082400454.png" alt="image-20201120082400454" style="zoom:50%;" /> where: - we therefore **force the spacing** between instructions to be correct - but this causes a **lower** processor performance The preferred approach is called **forwarding/bypassing**: <img src="\lectures\images\typora-user-images\image-20201120082826402.png" alt="image-20201120082826402" style="zoom: 50%;" /> where: - the idea is that we **insert the correct value in manually**, since the computation of the data is always one cycle earlier - the only problem for implementing would be to know exactly **which register** is having the `r/w` conflict - in fact, we do this by **only** checking the data on the **memory stage and the writeback stage** #### Implementation of Forwarding/Bypassing Since the data hazard only happens when data is ready in stage 3 but not written in stage 4 or 5, we just need to check - data on stage 4=`Memory` - data on stage 5=`Writeback` - unless you are doing `lw`, whose data would only be there on stage 5. - this will be address in the section [`lw` Data Hazards](#`lw` Data Hazards) Therefore, the data **after the read from register file** could come from: 1. `00` normal case, data from register file 2. `01` forwarding case, data from `Memory` stage 3. `10` forwarding cases, data from `Writeback` stage <img src="\lectures\images\typora-user-images\image-20201120084458456.png" alt="image-20201120084458456" style="zoom:50%;" /> And the control signals in total would **need information** from: - which register in regfile is being **read** (potentially causing hazard) - which register in regfile is being **written** to - in `Memory` stage - in `Writeback` stage - remember we also need to know if we are **actually writing** to a register, hence need `RegWrite` control signal as well <img src="\lectures\images\typora-user-images\image-20201120084638810.png" alt="image-20201120084638810" style="zoom:67%;" /> where: - `10` means $\text{register read is none zero } \& \text{ register read == register written at Memory stage } \& \text{ register actually written in Memory}$ - `01` means $\text{register read is none zero } \& \text{ register read == register written at Writeback stage } \& \text{ register actually written in Writeback}$ - `00` otherwise #### `lw` Data Hazards The above work assuming data being ready at both `Memory` stage and `Writeback` stage, but there is one exception: `lw` - now, since `lw` **only has data ready** at the last cycle, this means we cannot prepare the data for the next instruction - therefore, we have **no choice** but to **stall** - or inserting `nop`, at this point it is equivalent <img src="\lectures\images\typora-user-images\image-20201120090155810.png" alt="image-20201120090155810" style="zoom:50%;" /> #### Implementation of Stalling Basically, we place a "**bubble**" in between the `lw` and the `and` instruction in the above example. - it is essentially the ***same as we delay the instruction by one `nop`*** <img src="\lectures\images\typora-user-images\image-20201124232624511.png" alt="image-20201124232624511" style="zoom: 67%;" /> where: - basically, instead of just doing nothing, we **repeat** what has been done at the previous stage. - just ***imagine sticking in a `nop` between `lw` and `and`,*** so that the rest of the instructions flow correctly - the **net result** is that each **instruction after `lw` is delayed** for one stage. This is visibly seen for the instruction `sub`: - `sub` starts on cycle `4` on the diagram for section [`lw` Data Hazards](#`lw` Data Hazards) - `sub` starts on cycle `5` after the fix **Remember,** for `lw` hazard, we have need to check: - whether if the **next** instruction **reads data** (`rs` or `rt`) from the **register `lw` is writing to `rt`** > **Recall that:** > > - `lw $rt, offset($rs)` > - `add $rd,$rs, $rt` **To implement it, we have:** <img src="\lectures\images\typora-user-images\image-20201124233245849.png" alt="image-20201124233245849" style="zoom: 80%;" /> where: - essentially we check: - $\text{Actually Writing from Memory to RegFile && (Register Read by Next instruction==Register Written To)}$ - and since we have two possible register to read from: $\text{(Register Read by Next instruction<mark>Register Written To)=(rs at Decode/Read</mark>rt writing to)\vert \vert (rt at Decode/Read==rt writing to)}$ - if we have a `lw`, then we have a `MemToRegE=1`. - since we need to **stall** the **next two instructions** (to be equivalent of inserting a `nop`), we need to have: - `StallD` to stall/redo the same decode stage (by letting the `D` **clock** not advance) - `StallF` to stall/redo the same fetch stage (by letting the `F` **clock** not advance) - `FlushE` means make the data in the register all `0`. - This signifies a `nop`, since essentially, all `Execute` (ALU calculation) **starting from the stall** pushed forward for one stage, meaning there is **no calculation** going on at this cycle # Week 12 ## Hazards in Pipelined Processor (Continued) ### Control Hazards The problem comes from **branching**: - in short, we **cannot determine** whether of not to branch before the **beginning of cycle 4** - as a result, we might need to **flush/"undo" next three instructions** <img src="\lectures\images\typora-user-images\image-20201125013008262.png" alt="image-20201125013008262" style="zoom: 67%;" /> where: - remember to know if we are branching (`beq`), we do two things: - do the **subtraction** to see if `rs` and `rt` are equal - if we are branching, **control signal** of `branch=1` #### Early Branch Resolution Instead of flushing all three next instructions, we could: - **determine** the branching in **`Decode` stage** - then you would only need to **flush one instruction** <img src="\lectures\images\typora-user-images\image-20201124234836906.png" alt="image-20201124234836906" style="zoom:67%;" /> To **determine branching in the decode stage**, we need to **change our processor** a bit: <img src="\lectures\images\typora-user-images\image-20201125013254681.png" alt="image-20201125013254681" /> where: - while this would work, we have **additional data hazards** - remember, what if **data read from register (stage 2)** was being written by the **previous instruction (stage5)** - for example, consider the worst case of first `lw` (updating ***only on*** **cycle 5**), then `beq` for the same register (reading at **cycle 3**) - hints at the need of **stalling twice** to get the data ready if we have `lw` then `beq` #### Handling Data Hazards in Early Branch Resolution In short, we forward/bypass again. - Before, we implemented forward/bypass to: - get correct data to `Execute` (`i+1`th instruction) from `Memory` (`i`th instruction) - get correct data to `Execute` (`i+2`th instruction) from `Writeback` (`i`th instruction) - Now, we need to also forward/bypass to: - get correct data **to `Decode`** (`i+1` instruction after stall=`i+2` instruction) **from `Memory`** (`i`th instruction) - i.e. from **beginning of `memory` to late stage of `execute`** - we **don't** need `W-D` forwarding, because, remember, **`W` (write) happens before `D` (read)!** <img src="\lectures\images\typora-user-images\image-20201124235034034.png" alt="image-20201124235034034" style="zoom:80%;" /> where: - Therefore, to check `Memory` and `Decode` stage, we check: <img src="\lectures\images\typora-user-images\image-20201125020743896.png" alt="image-20201125020743896" style="zoom:67%;" /> - if **register read `rs` at `D` stage** is the same as **register written at `M` stage**, then we forward - if **register read `rd` at `D` stage** is the same as **register written at `M` stage**, then we forward **However, we also need to deal with the problem that:** - since `beq` needs data in register at `Decode` stage, we need to check: <img src="\lectures\images\typora-user-images\image-20201125025758083.png" alt="image-20201125025758083" style="zoom:67%;" /> - if **`beq`** reads **data from `rs/rt` at `D` stage**, and that data **would** be the **ALU calculated data from `E` stage** (of previous instruction) - e.g. `add` followed by `beq` - if **`beq`** reads **data from `rs/rt` at `D` stage**, and that data **would** be the **data read from Memory at `M` stage** (of previous instruction) - e.g. `lw` followed by `beq` **Therefore, the total stall condition would be**: <img src="\lectures\images\typora-user-images\image-20201125025826679.png" alt="image-20201125025826679" style="zoom: 67%;" /> - technically **`FlushE`** has the **extra condition of `PCSrcD=1`**. Because if we have decided to branch, then we need to **flush the next instruction** as well. > **In general, we see that those problem happen because**: > > - we have a **eager consumer** e.g. `beq` > - we have a **later producer** e.g. `lw` #### Fully Bypassed Processor ![image-20201124235547114](\lectures\images\typora-user-images\image-20201124235547114.png) ### Stall/Bypass Summary for Data Hazards <img src="\lectures\images\typora-user-images\image-20201125000311098.png" alt="image-20201125000311098" style="zoom: 50%;" /> where: - `other` include instructions such as `add`, `sub`, etc - then normal forwarding would suffice (without `stall`) - once we have a `beq`, the previous instruction needs to be **stalled** (if `beq` consumes it) - because `beq` needs data at late `Decode` stage, so we need data to be prepared from the **previous instruction** at latest at **start of `Decode` stage** - remember, we can only forward at the **start** of `Decode/Execute/...` due to our **implementation**. - additionally, for `beq`, we will often **flush** the next instruction (if `beq` is `True` and we need to branch) - because we cannot figure out whether or not to branch in time for **`Fetch` stage of next instruction**. Therefore, if it turns out we need to branch, we need to discard the one instruction we have just computed - since `Decode` stage and `Fetch` stage ***only has 1 stage difference*** - once we have a `lw`, the next instruction needs to be **stalled** (if the next instructions consumes it) - because data from `lw` cannot be **ready for `Execute` stage** of next instruction - since `Execute` stage and `Memory` stage ***only has 1 stage difference*** - `bubble` means a **single stall** > **The twice stall comes from:** > > - If you have `lw $s0 ...`, then `beq$s0 ...`, but **stalled only once**: > > <img src="\lectures\images\typora-user-images\image-20201125022213958.png" alt="image-20201125022213958" style="zoom:67%;" /> > > - since the forwarding is **in this case only** at early stage of `Writeback` (no forwarding at late `Memory`), we need to **stall twice**. ## Pipelined Processor Performance First, we consider how our CPI has changed. The ideal CPI would be 1, but since we have **stall**, what would be our CPI now? **On average in real life,** we have the following **dynamic instructions per program**: - $54\%$ `R-Type` - $25\%$ `load` - $10\%$ `store` - $11\%$ `branch` If we have $40\%$ of the loads used immediately by the next instruction, and $25\%$ of branches being taken (actually jumped), then the **average** CPI would be: <img src="\lectures\images\typora-user-images\image-20201125001302693.png" alt="image-20201125001302693" style="zoom:67%;" /> where: - one **ignored** calculation was the scenario of having **`lw` and then immediately `beq`**, which would have **two stalls**: `CPI=3` and the average CPI would be the **weighted sum of the above**:\]

\text{Overall CPI}=0.541.0 +0.251.4+0.101.0+0.111.25=1.13

\[### Pipelined Processor Critical Path For pipelined processor, we just need to think about the **longest computation time of each *stage***. <img src="\lectures\images\typora-user-images\image-20201125002055006.png" alt="image-20201125002055006" style="zoom:67%;" /> where: - the factor of $2$ for `Decode` and `WriteBack` will be **useful** (but sometimes not necessary) for **avoiding more hazards**, because we needed to **read** from register ***after*** **writing** to register. - this implementation usually won't take up much time, but for a **safe estimation**, we use a factor of $2$ > **Extension:** > > - In fact, one implementation to achieve the **read after write** would be: > > - if **register** I am **reading** is the **same** as **register** I am **writing**, then I **output data from `WD3`** directly instead of register > > <img src="\lectures\images\typora-user-images\image-20201125003702342.png" alt="image-20201125003702342" style="zoom:50%;" /> > > where: > > - this would be implemented inside the register file, and in fact, it will probably not need $2 \times$ the time of `Decode`/`Writeback` stage. ### Pipelined Performance Calculation <img src="\lectures\images\typora-user-images\image-20201125002649852.png" alt="image-20201125002649852" style="zoom:67%;" /> ## Pipelined Speedup Now, we can **compare** the relative performances of **different processors**. - **Speedup:** a measure of relative performance\]

\text{Speedup} = \frac{\text{Runtime Baseline}}{\text{Runtime Optimized}}

\[***For example:*** <img src="\lectures\images\typora-user-images\image-20201125030013900.png" alt="image-20201125030013900" style="zoom:67%;" /> # Week 13 ## Caches This is another way of optimizing our processor. In short, the problem comes as the **performance of memory** might be sub-optimal. - for example, it often happens that, since **memory is large**, we sometimes need **100 clock cycles** to read a data, rather than 1. - remember, the design of memory is separate from our CPU: we use them like an Interface <img src="\lectures\images\typora-user-images\image-20201201231831075.png" alt="image-20201201231831075" style="zoom:50%;" /> > **As a result:** > > - The **clock speed/performance** of instruction and data **memory** will ***limit* the performance of our CPU**. > > - In fact, the gap between the performance can be shown below: > > - where the term "**memory wall**" means that the performance your CPU will be upper bounded by the performance of that memory > > <img src="\lectures\images\typora-user-images\image-20201201232222455.png" alt="image-20201201232222455" style="zoom: 67%;" /> Therefore, **caches** are used to improve the performance, for many cases (but not all). ### Memory Technologies It turns out that our memory design can only satisfy **two** of the three below: <img src="\lectures\images\typora-user-images\image-20201201232541652.png" alt="image-20201201232541652" style="zoom: 67%;" /> #### Memory Hierachy The idea is that: - we can **combine** some small capacity but fast memories, with some large but slow memories, and ... - and to find a data, we check - first our Cache (`SRAM`) - then our Memory (`DRAM`) - finally, our Disk <img src="\lectures\images\typora-user-images\image-20201201233229162.png" alt="image-20201201233229162" style="zoom: 50%;" /> where: - **density** means number of giga bits we can store per `cm2` > **Note:** > > - Both `SRAM` and `DRAM` are fast, but volatile. > - this means once we unplug the power, data there is gone. > - For computers in real life: > - `RAM` usually refers to `DRAM` > - `L1-3 Cache` usually refers to `SRAM` --- ***For example:*** <img src="\lectures\images\typora-user-images\image-20201201233823781.png" alt="image-20201201233823781" style="zoom:67%;" /> where: - the entire thing is only **caches** - the bottom part will be the `L3 Cache` --- ### Deciding What to Cache ***For example:*** - Consider the program to find the substring `hip` in the string below <img src="\lectures\images\typora-user-images\image-20201201234908406.png" alt="image-20201201234908406" style="zoom:67%;" /> - It turns out that commonly: - **temporal locality**: data `X` used **recently** are likely to be used again - cache data that are used recently - **spatial locality**: if data `X` is used, then **data near** `X` in terms of memory address are likely to be used again - cache data that are near the data you touched #### Caches Exploit Locality Since cache is small: - Temporal Locality: - copy newly accessed data - Spatial Locality > **Note** > > - the above is just the average case. You could perfectly come up with a program that has nothing to do with temporal and spatial locality. ### Caching Technology Caches make up the highest level of memory hierarchy - caches are typically as fast as one clock cycle - with luck (if data needed is in the cache), then performance is fast Therefore, we would want to: - **design our cache such that it can supply most of our data needed** **Questions** to think about is: - What data should cache hold? - recently accessed, based on space+temporal locality - How should that data is found? - using address "hash" (address as key) - What data in cache should be replaced when cache is full? - least recently used data should be replaced #### Cache Performance - **Hit** - Data is found in the level of memory hierarchy - **Miss** - Data not found, need to look in next level <img src="\lectures\images\typora-user-images\image-20201202000333953.png" alt="image-20201202000333953" style="zoom:67%;" /> - **Expected Access Time** $E_{L}$ for a memory level $L$ with latency $t_L$ and **miss** rate $M_L$ is:\]

E_L = t_L + M_L \cdot E_{L+1}

\[where: - $t_L$ of the current level refers to the time needed to **check whether if data is here** ***For example:*** - **Question:** <img src="\lectures\images\typora-user-images\image-20201202000724578.png" alt="image-20201202000724578" style="zoom: 50%;" /> <img src="\lectures\images\typora-user-images\image-20201202000813419.png" alt="image-20201202000813419" style="zoom:50%;" /> - **Solution:** First, the hit rate is simple:\]

Hit = \frac{750}{1000} = 0.75

\[then we know that the miss rate is $M_L = 0.25$, hence:\]

E_L = 1.00 * 1+0.25*100 = 26

$$ where:

$1.00$ comes from the fact that, no matter where you look at, you will always look at/start from the top memory level

Direct-Mapped Cache

This is the simplest design of a cache:

for each address in main memory, there will be exactly only one location to store that data
therefore, there will be collisions

Note:

Since we are mapping 32 bits to 3 bits “hash”, we need to also store other bit of the address as well.

Direct-Mapped Cache Implementation

where:

tag is basically used to store the remaining of the addresses
V is the valid bit, telling you if the data there is valid before even checking the address tag
a row in the cache above is called a set
- for example, we have 8 sets above, indexed from 0-7
a column in the cache above is called a way
- basically, how many “caches” there are in series
- for example, the cache below has 2 ways and 6 sets:

Direct-Mapped Cache Behavior

For example:

Consider the program:

the lw reads from Memory, hence this is where cache comes in

Note that:

if you are doing operations that only involve registers in regfile, you don’t need to cache, obviously, as that is part of the CPU.

Since it started empty, we will miss the first three accesses, but the remaining are going to be hit

where:
- the miss rate would be $\frac{3}{15}$ (assuming only temporal locality)

Direct-Mapped Cache Conflict Misses

Now, consider the program:

however, notice the address lw used from memory are:

As a result, you will be repeatedly accessing the same set in the cache:

After first iteration:

After second iteration:

Therefore, the trend is that, for our memory accesses:

where:

the problem can be thought as each set is “a queue”, here, it will be 8 queues each of size 1 only
this could be improved if we use a 2-way set associative cache/fully associative cache
- so that you have 4 queues each of size 2, or 1 queue of size 8, preventing this problem
- but on average, the performance would be the same, as you increases the probability of colliding

General Schema for Cache

2-Way Set Associative Cache

In short, all the below caches will have a 4 byte space for data (as a shared trait), and a total of 32 bytes

direct-mapped cache
- in fact, as we know and the name suggests, it has only one way (8 sets), hence direct-mapped
2-way set associative cache
- this will have two ways (4 sets)
fully associative cache
- as you will see in later section, this has only one set (8 ways)

Note:

In fact, all the above will ignore spatial locality.

Therefore, 2-way set associative cache looks like:

where:

now, since we only have 4 sets, we can index it with 2 bits
therefore, the tag will contain (32-2)-2=28 bits

2-Way Set Associative Cache Implementation

The difference will just be:

need to compare twice since we have two possible blocks for each set
need to have a MUX that tells us, when hit, which one to put out
- here, it is simply Hit1

where:

since we have doubled the #ways, but the total size of cache is constant, we have only 4 sets
since we have 2 ways, we need to compare the tag twice (and v twice)

2-Way Set Associative Cache Behavior

Again, consider the program that we had problem with:

And both data will be stored

As a result:

however:

the collision rate/comparison rate has gone up.

Fully Associative Cache

Now, we have a “queue” of size 8:

where:

now, since we just have 1 set, we don’t need a set index anymore
therefore, we will have 30 bits for tag

Additionally:

this would make 0% conflict miss
only compulsory and capacity misses,
- compulsory misses - in the beginning, we haven’t even read the data once. Then this will of course be missed.
- capacity misses - if we are working with memory data that is larger than the total size of our cache
- for example, we have $8$ blocks of 4 Bytes. If we are working with 10 lw all from different location, then eventually we will miss because it exceeded our total capacity.
however, the #comparisons has gone up due to the associativity/collision rate

Direct-Mapped-16 Cache

This is the different with all the above, in which it stores 16 bytes per block (4 words can be stored)

and still a total of 32 bytes, for the sake of comparison
this adds spatial locality into consideration

Direct-Mapped-16 Cache Implementation

Now, we are dealing with:

A direct mapped cache, so just one way
but we are storing 16 bytes per block

where:

set index starts at the 5th bit of the address, instead of the 3 bit
block offset is used to select which sub-select in the cache, which data you want
tag will be (32-2)-1 for set index - 2 for subselecting = 27 bits

Additionally:

Added Spatial Locality:

When we get a miss:

we store 4 pieces of data that has the same tag+set number, with block offset either 00,01,10,11

Direct-Mapped-16 Cache Example

Consider the program:

and with spatial locality:

where:

the four data brought in had addresses:
- same tag+same set=0+00+00
- same tag+same set=0+01+00
- same tag+same set=0+10+00
- same tag+same set=0+11+00

As a result:

where:

in the end, it also reduced the compulsory misses in the beginning

Intel On-Chip Cache Trajectory

Over time, caches are:

growing in memory size
deeper cache hierarchy
physically closer to CPU
- which is good for faster CPU accesses

where:

in general, it has been found that by separating data and instruction storage into two caches speeds up the performance
trace cache caches a sequence of instructions before execution needs (by guess work)