# COMPUTER ARCHITECTURE

# 6 Central Processing Unit - CPU



## 6 Central Processing Unit – objectives and outcomes:

### • A basic understanding of:

- architecture (basic electronic circuits) and the operation of the CPU
- synchronization of circuits with clock signal
- Micro-programmed (SW) or Hard-wired (HW) implementation of the CPU
- Understanding of parallelism :
  - origins of existence
  - parallelisation on the instruction level
    - pipeline
- Understanding the execution of instructions in CPU



- □ Basic structure and operation of the CPU
- ARM CPU features summary
- □ <u>Structure of CPU ARM case</u>
- Execution of instructions
- Parallel execution of instructions
- □ <u>Pipelined CPU</u>
- □ <u>An example of a 5-stage pipelined CPU</u>
- □ <u>Multiple issue processors</u>



6.1 Basic structure and operation of the CPU

 CPU (Central Processing Unit or the CPU) is a unit that executes instructions, so its performance largely determines the performance of the whole computer.

In addition to the CPU, most computers have also other processors, mainly in the input/output part of the computer.

 Basic principles of operation for all types of processors are identical.

- CPU is a digital system (built from digital electronic circuits) specific types.
- Two groups of digital circuits:
  - Combinational digital circuits
    - Status output depends only on current state of the inputs





Q2

SET

>flip-flop 3

 $_{CLR}$  Q

Q

D

#### Central processing unit

- Memory (sequential) digital circuits
  - The state of the outputs depends on the current state of inputs and the previous states of the inputs
  - Memories remember the states
  - Previous states are usually characterized as internal states, that reflect the previous states of inputs

Example: 3-bit counter

SET

flip-flop 2

 $_{\rm CLR} \overline{Q}$ 

Q

D

<mark>()</mark> (2)

Q



**0**<sup>Q</sup>



- Memory (Sequential) circuit:
  - □ Flip-flop one-bit memory cell
  - □ Register
  - Counter
  - □ Memory



- Memory (sequential) digital circuits can be:
  - Asynchronous the state of the circuit is changed "Immediately" after the variation in input signals.
  - □ Synchronous the state of the circuit as a function of the input signals can only be changed at the edge of the clock signal.

### CPU is built from

- Combinational and
- □ Memory (sequential) synchronous digital circuits.
- The current state of the memory circuits presents the state of the CPU.



- The operation of the CPU at any time depends on the current state of the CPU inputs and the current internal state of the CPU.
- The number of possible internal states of the CPU depends on the size (capacity) of CPU.
- The number of bits, which represent the internal state of the CPU ranges from some 10 up to 10,000 or even more.
- Digital circuits that form a CPU today are usually on a single chip.



- The basic operation of the CPU in the Von Neumann computer was described using two steps:
  - □ 1. Taking instruction from memory (instruction-fetch cycle), the address of the instruction is in the program counter (PC)
  - $\Box$  2. Execution of the fetched instruction (execution cycle),

 Each of these two main steps can be divided on even simpler suboperations ( "Elementary" steps) ->



- The operation of the CPU in the Von Neumann computer was described using two steps:
  - □ 1. Taking instruction from memory (instruction-fetch cycle), the address of the instruction is in the program counter (PC)
  - □ 2. Execution of the fetched instruction (execution cycle), which can be divided to more sub-operations:
    - □ Analysis (decoding) the instruction
    - Transfer the operands in the CPU (if not already included in the CPU registers)
    - □ Execution of the instruction's specific operation
    - $\square$  PC  $\leftarrow$  PC + 1 or PC  $\leftarrow$  target address in branch instructions
    - □ Saving the result (if necessary)







Interrupts or traps:

- extra-ordinary events
- transparency important
- instead of next instruction, branch to first instruction of ISR (Interr. Service Routine) is executed.





- The address of the first instruction after switching on (RESET) is determined by a certain rule.
- Upon completion of Step 2, the CPU starts again with the first step, which is repeated, as long as the CPU operates.
- The exception is when there is an interrupt or trap request.
- On such request, instead of fetching the next instruction, the jump instruction is executed to the address that is determined by the mode of interrupt or trap operation.



- Each of these steps is composed of more elementary steps and realization of CPU is basically the realization of these elementary steps.
- Each elementary step is carried out in one or more periods of clock signal - CPU clock.







time t for example in [ns]

positive

edge

negative

edge



The frequency of the periodic signal f = number of periods (cycles) in 1 second

The unit of frequency is Hertz [Hz]: 1 Hz = 1 [Period/sec] = 1 [1/s] = 1[s<sup>-1</sup>]

The duration of one period T = 1 / f

$$f = 1,25[GHz] \Longrightarrow t = \frac{1}{f} = \frac{1}{1,25*10^9[1/s]} = \frac{1}{1,25}*10^{-9}[s] = 0,8*10^{-9}[s] = 0,8[ns]$$



- The state of the CPU, such as the states of all synchronous digital circuits, changing only at the edge of the clock signal (clock signal transition from one state to another).
- Edge, at which the changes happen in the CPU, is called active edge.
- CPU can also change the state at the positive and negative edges, this means that both edges are active. In one clock cycle, two changes of the CPU state can be performed.

Why is the clock signal needed at all? 2 points of view ->



- Clock signal -> synchronization of combinational circuits with various speeds
  - In synchronous digital memory (sequential) system clock signal (usually edge) provides a moment of change to the internal state of the memory digital circuit.
  - □ When the input signals in the memory circuit becomes stable, at the active edge the change of the internal state of the memory circuit can occur.





Clock signal -> synchronization of multi-speed operations in computer

□ For example, access to memory in one clock cycle (read operation):





- State of CPU changes on the edges of the internal clock. Shorter clock period (higher frequency) means faster performance of CPU.
- Shortening the clock period (increasing frequency) is determined by the speed of the digital circuits and the number of circuits (length of links) through which the signal propagates.
- The minimum duration of the elementary step in the CPU is one clock period (or even half-period, if both edges are active, but this requires more complex circuit).
- Fetch and execution cycles' duration is always an integer number of periods.
- Number of periods for the execution of the instruction can vary greatly.

# 6.2 ARM CPU – features summary

Much more related details explained on LAB sessions

- RISC architecture (some exceptions)
- 3-operand register-register (load/store) computer
  Access to the memory operands is only by using the LOAD and STORE
- 32-bit computer (FRI-SMS, ARM9, architecture ARMv5)
  - □ 32-bit memory address
  - □ 32-bit data bus,
  - □ 32-bit registers
  - □ 32-bit ALE
- 16 general purpose 32-bit registers
- Length of the memory operand 8, 16 and 32 bits
- Signed numbers are represented in two's complement
- Real numbers in accordance with standard IEEE-754 (in case of FP-unit)



- Composed memory operands are stored under the rule of little endian.
- The instructions and operands must be aligned in memory (stored on the natural addresses).
- All of the instructions are 32 bits long (4 bytes).
- ARM uses all three general addressing modes:
  - ADD R1, R1, #1 Immediate Direct (register) ADD r1, r1, r2
  - Indirect (register) LOAD/STORE

LDR r1, [r0]



- Instructions for conditional branches use PC-relative addressing.
- Example of format for ALU instruction:

| b31            | 20 19 | 9 16 | 15 12 | 11 4 | 3 | b0  |
|----------------|-------|------|-------|------|---|-----|
| Operation code |       | Rs1  | Rd    |      |   | Rs2 |



### 6.3 Structure of the CPU (example of ARM CPE LEGv8 & [Mini]MiMo)

• 6.3.1 Data path (unit)

 $\Box$  ALU

- □ software accessible registers
- 6.3.2 Control unit
  - □ Realization
    - Micro-programmed (SW) or
    - Hardwired (HW)







Central processing unit - structure

### 6.3.1 Data path (unit)

The simplified structure of the CPU data paths including instruction and operand memories



All data paths are M-bit, arrows indicate the direction of transfer

<sup>RA - 6</sup> A simplified version of the ARMv8 (Source: [Patt] Sec. 4)



- MUX multiplexer: the digital circuit, that selects one from multiple input signals and connects it to the output.
- Selection of the input signal is determined by control signal.



# Model of CPU: MiMo

Model of CPU implemented with logic gates in Logisim

MiMo – Microprogrammed Model of CPU (course OR VSP)





### **MiMo** – Microprogrammed Model of CPU FPGA implementation



Mini MiMo - Hardwired Simple CPU Model VO.6 EVO

Model of CPU: Mini MiMo (course RA VSP)

Model of CPU implemented with logic gates in Logisim

Mini MiMo – Simple Hardwired Model of CPU (16 instr., assembler in Excel, ...)



https://github.com/LAPSyLAB/RALab-STM32H7/tree/main/MiniMiMo\_HW\_CPE\_Model



#### Central processing unit - structure

#### Mini MiMo Datapath



All thicker data paths are more than 1-bit wide

### ALU – datapath and control signals



# ALU – datapath and control signals Case of Mini MiMo CPU








## 6.3.2 Control Unit (CU)

- Is digital circuit (memory + combinational), that on the basis of the content in the instruction (register) determines control signals.
- Control signals trigger elementary steps in the datapath and consequently the execution of this instruction.
- IR register = 32-bit instruction register in which the instruction is transferred during the instruction-fetch cycle: machine instruction is read from the memory.
  - IR ... "Instruction Register "
- □ 2 possible ways of CU implementation:
  - Micro programmed (SW: simple, slower)
  - Hard wired (HW: complex, faster)

### CPU: datapath, control unit, and control signals



RA - 6

A simplified version of the ARMv8 (Source: [Patt] Sec. 4)

### Control unit (Micro-programmed implementation – e.g. MiMo model)



### Control unit (Micro-programmed implementation –MiMo model)





### Control unit (Hard-wired)



![](_page_41_Picture_0.jpeg)

### Control unit (Hard-wired): case Mini Mimo

#### **Machine instruction XXX**

- 1. FETCH Control signals
- 2. EXECUTE Control signals

![](_page_41_Figure_5.jpeg)

Phase = 0..FETCH, 1..EXECUTE

![](_page_42_Figure_0.jpeg)

### CU Implementation approaches - Comparison

### Control unit (Micro-programmed)

### Control unit (Hard-wired)

![](_page_43_Figure_3.jpeg)

45

#### CPU: datapath, control unit, and control signals

![](_page_44_Figure_1.jpeg)

RA - 6

A simplified version of the ARMv8 (Source: [Patt] Sec. 4)

© 2024, Škraba, Rozman, FRI

![](_page_45_Figure_0.jpeg)

RA - 6

A simplified version of the ARMv8 (Source: [Patt] Sec. 4)

![](_page_46_Figure_0.jpeg)

RA - 6

A simplified version of the ARMv8 (Source: [Patt] Sec. 4)

© 2024, Škraba, Rozman, FRI

![](_page_47_Figure_0.jpeg)

RA - 6

A simplified version of the ARMv8 (Source: [Patt] Sec. 4)

© 2024, Škraba, Rozman, FRI

### CPU: datapath, control unit, and control signals

#### Execution of branch instructions

![](_page_48_Figure_2.jpeg)

RA - 6

A simplified version of the ARMv8 (Source: [Patt] Sec. 4)

![](_page_49_Picture_0.jpeg)

6.4 Execution of instructions

An example of execution of a typical instruction for ALU operation:

**ADD R10, R1, R3** 

 $@ R10 \leftarrow R1 + R3$ 

## Instruction Format:

| 31             | 20 1 | 9 16                 | 15 12 <sup>-</sup>      | 11     | 4 | 3 0                  |
|----------------|------|----------------------|-------------------------|--------|---|----------------------|
| Operation code |      | Source<br>register 1 | Destination<br>Register | unused |   | Source<br>register 2 |

## Machine instruction:

| 31        | 20 19 | 16 15   | 12 11    | 4 3 0 |
|-----------|-------|---------|----------|-------|
| 111000010 | 00000 | 1 1 0 1 | 00000000 | 0011  |

![](_page_50_Figure_0.jpeg)

![](_page_51_Figure_0.jpeg)

![](_page_52_Figure_0.jpeg)

![](_page_53_Figure_0.jpeg)

- Execution of the instruction ADD lasts for example 5 periods (CPI<sub>ALU</sub>= 5)
  - □ T1: Read instruction from memory
  - □ T2: Transfer of instruction from memory into the instruction register
  - □ T3: Decode the instruction and access to the operands in registers R1, R3
  - □ T4: Execution of the operation (addition)
  - □ T5: Saving the result in the register R10 (writeback)

![](_page_54_Figure_0.jpeg)

![](_page_55_Figure_0.jpeg)

![](_page_56_Figure_0.jpeg)

![](_page_57_Figure_0.jpeg)

#### Execution of the instruction ADD: Summary

![](_page_57_Figure_2.jpeg)

- Execution of the instruction ADD lasts for example 5 periods (CPI<sub>ALU</sub>= 5)
  - □ T1: Read instruction from memory
  - □ T2: Transfer of instruction from memory into the instruction register
  - □ T3: Decode the instruction and access to the operands in registers R1, R3
  - □ T4: Execution of the operation (addition)
  - □ T5: Saving the result in the register R10 (writeback)

## CPU – instr. execution: case Mini MiMo CPU

Mini MiMo - Hardwired Simple CPU Model V0.6 EVO

RA - 6

![](_page_58_Figure_3.jpeg)

024, Škraba, Rozman, FRI

## case Mini MiMo CPU: Sum of two numbers

![](_page_59_Figure_2.jpeg)

## Challenge (HW1 or optional extension to HW1)

#### Program/edit Mini MiMo model

# Challenge: Dare to create your own CPU ?

![](_page_60_Figure_3.jpeg)

![](_page_60_Figure_4.jpeg)

from 22/23

https://github.com/LAPSyLAB/RALab-STM32H7/tree/main/MiniMiMo\_HW\_CPE\_Model https://github.com/LAPSyLAB/RALab-STM32H7/tree/main/LogisimEVO\_vezja/Prispevki

![](_page_61_Picture_0.jpeg)

## 6.5 Parallel execution of instructions

- Typical CPU arch. execution of machine instructions takes at least 3 or 4 clock periods, usually even more.
- The average number of instructions executed by the CPU in one second (IPS Instructions Per Second):

63

$$IPS = \frac{f_{CPE}}{CPI}$$
 IPS is a very large number, so we divide it by 10<sup>6</sup> and get MIPS

$$MIPS = \frac{f_{CPE}}{CPI \cdot 10^6}$$

MIPS = Million Instructions Per Second

 $f_{CPE}$  = Frequency of the CPU clock

CPI = Cycles Per Instruction (average number of clock periods for the execution of one instruction)

![](_page_62_Picture_0.jpeg)

 MIPS - the number of instructions executed by the CPU in one second, can be increased in two ways: to increase f<sub>CPE</sub> and/or reduce the CPI:

$$\uparrow MIPS = \frac{\uparrow f_{CPE}}{\downarrow CPI \cdot 10^6}$$

- □ Using faster electronic elements (increase  $f_{CPE}$  = more periods in one second)
- With the use of a larger number of elements we can reduce the CPI (less clock cycles per instruction) where more instructions are executed in one clock cycle
- Use of faster electronic components does not allow larger increase in speed; it also causes other problems.

![](_page_63_Picture_0.jpeg)

![](_page_63_Figure_2.jpeg)

#### 50 Years of Microprocessor Trend Data

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2021 by K. Rupp

Vir: https://raw.githubusercontent.com/karlrupp/microprocessor-trend-data/master/50yrs/50-years-processor-trend.png

![](_page_64_Picture_0.jpeg)

#### Moore's law

## Increasing the number of transistors - Moore's Law

 Electronic Magazine has published an article in 1965 by Gordon E.
Moore in which he predicted that the number of transistors that producers are able to produce on a chip doubles every year.

 In 1975, the prediction was adjusted to the period ob two years (number of transistors doubling every two years).

As it was then intended as experimental rule should apply the next few years, it is still valid today and is known as Moore's Law.

![](_page_65_Picture_0.jpeg)

Moore's Law - increasing the number of transistors

## int<sub>el</sub>. **Moore's Law** relative Manufacturing cost per Component J Get par Caladii Monopotu 10 102 103 104. Number of Confinents per Integrated Cin Number of components per IC In 1965, Gordo he pace of silicon technology. Decades later, Moore's Law remains true, driven largely by Intel's unparalleled silicon expertise.

According to Moore's Law, the number of transistors on a chip roughly doubles every two years. As a result the scale gets smaller and smaller. For decades, Intel has met this formidable challenge through investments in technology and manufacturing resulting in the unparalleled silicon expertise that has made Moore's Law a reality.

![](_page_66_Picture_0.jpeg)

 Gordon E. Moore is now honorary president of Intel, in 1968 he was co-founder and executive vice president of Intel.

With the same technology in the period of 20 years some time ago, the maximum speed of logic elements increased by about 10 times.

At the same time, the maximum number of elements on a single chip increased by about 500 to as much as 5000-times in the memory chips.

![](_page_67_Figure_0.jpeg)

### Moore's law – Transistor count through time

| Processor •                                      | Transistor count                                      | Year ♦                  | Designer 🔶         | Process<br>(nm)          | Area (mm²) ◆          | Transistor<br>density ◆<br>(tr./mm <sup>2</sup> ) |
|--------------------------------------------------|-------------------------------------------------------|-------------------------|--------------------|--------------------------|-----------------------|---------------------------------------------------|
| MP944 (20-bit, 6-chip, 28 chips total)           | 74,442 (5,360 excl.<br>ROM & RAM) <sup>[14][15]</sup> | 1970 <sup>[12][a]</sup> | Garrett AiResearch | ?                        | ?                     | ?                                                 |
| Intel 4004 (4-bit, 16-pin)                       | 2,250                                                 | 1971                    | Intel              | 10,000 nm                | 12 mm <sup>2</sup>    | 188                                               |
| TMX 1795 (8-bit, 24-pin)                         | 3,078 <sup>[16]</sup>                                 | 1971                    | Texas Instruments  | ?                        | 30.64 mm <sup>2</sup> | 100.5                                             |
| Intel 8008 (8-bit, 18-pin)                       | 3,500                                                 | 1972                    | Intel              | 10,000 nm                | 14 mm <sup>2</sup>    | 250                                               |
| NEC µCOM-4 (4-bit, 42-pin)                       | 2,500 <sup>[17][18]</sup>                             | 1973                    | NEC                | 7,500 nm <sup>[19]</sup> | ?                     | ?                                                 |
| Toshiba TLCS-12 (12-bit)                         | 11,000+ <sup>[20]</sup>                               | 1973                    | Toshiba            | 6,000 nm                 | 32.45 mm <sup>2</sup> | 340+                                              |
| Intel 4040 (4-bit, 16-pin)                       | 3,000                                                 | 1974                    | Intel              | 10,000 nm                | 12 mm <sup>2</sup>    | 250                                               |
| Motorola 6800 (8-bit, 40-pin)                    | 4,100                                                 | 1974                    | Motorola           | 6,000 nm                 | 16 mm <sup>2</sup>    | 256                                               |
| Intel 8080 (8-bit, 40-pin)                       | 6,000                                                 | 1974                    | Intel              | 6,000 nm                 | 20 mm <sup>2</sup>    | 300                                               |
| TMS 1000 (4-bit, 28-pin)                         | 8,000 <sup>[b]</sup>                                  | 1974 <sup>[21]</sup>    | Texas Instruments  | 8,000 nm                 | 11 mm²                | 730                                               |
| MOS Technology 6502 (8-bit, 40-pin)              | 4,528 <sup>[c][22]</sup>                              | 1975                    | MOS Technology     | 8,000 nm                 | 21 mm <sup>2</sup>    | 216                                               |
| Intersil IM6100 (12-bit, 40-pin; clone of PDP-8) | 4,000                                                 | 1975                    | Intersil           | ?                        | ?                     | ?                                                 |
| CDP 1801 (8-bit, 2-chip, 40-pin)                 | 5,000                                                 | 1975                    | RCA                | ?                        | ?                     | ?                                                 |
| RCA 1802 (8-bit, 40-pin)                         | 5,000                                                 | 1976                    | RCA                | 5,000 nm                 | 27 mm <sup>2</sup>    | 185                                               |
| Zilog Z80 (8-bit, 4-bit ALU, 40-pin)             | 8,500 <sup>[d]</sup>                                  | 1976                    | Zilog              | 4,000 nm                 | 18 mm <sup>2</sup>    | 470                                               |
| Intel 8085 (8-bit, 40-pin)                       | 6,500                                                 | 1976                    | Intel              | 3,000 nm                 | 20 mm <sup>2</sup>    | 325                                               |
|                                                  |                                                       |                         |                    |                          |                       |                                                   |

#### . . . .

| Processor                                                                                                        | Transistor count                                   | Year | Designer | Process<br>(nm)                  | Area (mm²)            | Transistor<br>density<br>(tr./mm <sup>2</sup> ) |
|------------------------------------------------------------------------------------------------------------------|----------------------------------------------------|------|----------|----------------------------------|-----------------------|-------------------------------------------------|
| AMD Instinct MI300A (multi-chip module, 24<br>cores, 128 GB GPU memory + 256 MB<br>(LLC/L3) cache)               | 146,000,000,000 <sup>[192][193]</sup>              | 2023 | AMD      | 5 nm (CCD,<br>GCD)<br>6 nm (IOD) | 1,017 mm <sup>2</sup> | 144,000,000                                     |
| AMD Epyc Bergamo (4th gen/97X4 series)<br>9-chip module (up to 128 cores and 256 MB<br>(L3) + 128 MB (L2) cache) | 82,000,000,000 <sup>[191]</sup>                    | 2023 | AMD      | 5 nm (CCD)<br>6 nm (IOD)         | ?                     | ?                                               |
| Apple M2 Ultra (two M2 Max dies)                                                                                 | 134,000,000,000 <sup>[6]</sup>                     | 2023 | Apple    | 5 nm                             | ?                     | ?                                               |
| Apple M2 Max (12-core 64-bit ARM64 SoC, SIMD, caches)                                                            | 67,000,000,000 <sup>[190]</sup>                    | 2023 | Apple    | 5 nm                             | ?                     | ?                                               |
| Apple M2 Pro (12-core 64-bit ARM64 SoC, SIMD, caches)                                                            | 40,000,000,000 <sup>[190]</sup>                    | 2023 | Apple    | 5 nm                             | ?                     | ?                                               |
| Sapphire Rapids quad-chip module (up to 60 cores and 112.5 MB of cache) <sup>[188]</sup>                         | 44,000,000,000–<br>48,000,000,000 <sup>[189]</sup> | 2023 | Intel    | 10 nm ESF (Intel<br>7)           | 1,600 mm <sup>2</sup> | 27,500,000–<br>30,000,000                       |
| Apple A17                                                                                                        | 19,000,000,000<br>[187]                            | 2023 | Apple    | 3 nm                             | 103.8 mm <sup>2</sup> | 183,044,315                                     |

### Moore's law – Transistor count regarding the type of device

| Year | Component                      | Name                       | Number of<br>MOSFETs<br>(in trillions) | Remarks                                                                                                                                  |
|------|--------------------------------|----------------------------|----------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
| 2022 | Flash memory                   | Micron's V-<br>NAND module | 5.3                                    | stacked package of sixteen 232-layer 3D NAND dies                                                                                        |
| 2020 | any processor                  | Wafer Scale<br>Engine 2    | 2.6                                    | wafer-scale design of 84 exposed fields (dies)                                                                                           |
| 2024 | GPU                            | Nvidia B100                | 0.208<br>Brez naslova]                 | Uses two reticle limit dies, with 104 billion transistors each, joined together and acting as a single large monolithic piece of silicon |
| 2023 | microprocessor<br>(commercial) | M2 Ultra                   | 0.134                                  | SoC using two dies joined together with a high-speed bridge                                                                              |
| 2020 | DLP                            | Colossus Mk2<br>GC200      | 0.059                                  | An IPU <sup>[clarification needed]</sup> in contrast to CPU and GPU                                                                      |

DLP... "Deep learning processor"

![](_page_70_Picture_0.jpeg)

## How to effectively utilize multiple items?

- Efficient increase in speed of CPU:
  - CPU performs parallel more operations, which means an increase in the number of needed logic elements.

## Parallelism can be exploited on several levels:

- Parallelism at the level of instructions:
  - Some instructions in the program can be carried out simultaneously in parallel
  - □ CPU in the form of **pipeline**:
    - Exploitation of parallelism at the level of instructions
    - An important advantage: the programs stay the same !!!
    - Limited, so we are looking for other options

![](_page_71_Picture_0.jpeg)

The first higher-level parallelism is called parallelism at the level of threads.

- □ Multithreading
- □ Multi-core processors

- Parallelism at the level of CPU (MIMD multiprocessors, multicomputers)
- Data-level Parallelism (GPU, SIMD, Vector units)


### Intel Core i7 Haswell

- $\Box$  Feature size 22 nm (= 22 \* 10<sup>-9</sup> m)
- $\Box$  The number of transistors 1.6 billion (= 160000000)
- $\Box$  The size of the chip 160 mm<sup>2</sup> (From 10x to 26x mm<sup>2</sup>)
- □ The clock frequency from 2.0 GHz to 4.4 GHz
- □ The number of cores (CPU) 4
- □ graphics processor
- □ Socket LGA 1150
- □ TDP (Thermal design Power) from 11.5 W to 84 W
- □ Price  $\approx$  300-400 \$

#### Intel 80x86

### Structure of 4-core processor Intel Core i7 (Haswell)





#### Example parallelism level instructions, threads and cores Intel 80x86



#### Intel 80x86

CPU chip on the socket with the contacts (LGA775)



Contacts to connect chip to the motherboard

The upper side

Lower side with the contacts and the capacitors



#### Intel 80x86





### Intel chip Core i7 (Haswell)





#### Example parallelism level instructions, threads and cores Intel 80x86

Intel Core i7

(Ice Lake I.2019) Chip



Case: CPU-level parallelism: MIMD Computers

Examples: MIMD (Multiple Instruction multiple Data) **Multiprocessor Multicomputers** (closely connected) (loosely connected) CPU CPU CPU CPU CPU CPU Cache Cache Cache Cache Cache Cache Interconnection Memory Memory Memory Common variables Interconnection I/O System Memory TIL A I I I I I I TTTT RA 81



Case: CPU-level parallelism: GPU, SIMD, Vector units

### Parallel processing of data

Tesla K40 ima vsega skupaj 1920 procesnih elementov (15 CU \* 128 PE v CU).

| SMX                                                 |      |      |                             |      |                |      |                             |       |                |                             |      |      |                |      |      |      |         |       |     |
|-----------------------------------------------------|------|------|-----------------------------|------|----------------|------|-----------------------------|-------|----------------|-----------------------------|------|------|----------------|------|------|------|---------|-------|-----|
| Instruction Cache                                   |      |      |                             |      |                |      |                             |       |                |                             |      |      |                |      |      |      |         |       |     |
| Warp Scheduler                                      |      |      |                             |      | Warp Scheduler |      |                             |       | Warp Scheduler |                             |      |      | Warp Scheduler |      |      |      |         |       |     |
| Dispatch Unit Dispatch Unit                         |      |      | Dispatch Unit Dispatch Unit |      |                | Dis  | Dispatch Unit Dispatch Unit |       |                | Dispatch Unit Dispatch Unit |      |      |                |      |      |      |         |       |     |
| Register File (65,536 x 32-bit)                     |      |      |                             |      |                |      |                             |       |                |                             |      |      |                |      |      |      |         |       |     |
| * * * * * *                                         |      |      |                             | +    | +              | +    | +                           | +     | +              | +                           | +    | +    | +              | +    | +    | +    | +       | +     |     |
| Core                                                | Core | Core | DP Unit                     | Core | Core           | Core | DP Unit                     | LDIST | SFU            | Core                        | Core | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                | Core | Core | DP Unit                     | Core | Core           | Core | DP Unit                     | LD/ST | SFU            | Core                        | Core | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                | Core | Core | DP Unit                     | Core | Core           | Core | DP Unit                     | LD/ST | SFU            | Core                        | Core | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                | Core | Core | DP Unit                     | Core | Core           | Core | DP Unit                     | LD/ST | SFU            | Core                        | Core | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                | Core | Core | DP Unit                     | Core | Core           | Core | DP Unit                     | LDIST | SFU            | Core                        | Core | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                | Core | Core | DP Unit                     | Core | Core           | Core | DP Unit                     | LD/ST | SFU            | Core                        | Core | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                | Core | Core | DP Unit                     | Core | Core           | Core | DP Unit                     | LD/ST | SFU            | Core                        | Core | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                | Core | Core | DP Unit                     | Core | Core           | Core | DP Unit                     | LDIST | SFU            | Core                        | Core | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                | Core | Core | DP Unit                     | Core | Core           | Core | DP Unit                     | LD/ST | SFU            | Core                        | Core | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                | Core | Core | DP Unit                     | Core | Core           | Core | DP Unit                     | LD/ST | SFU            | Core                        | Core | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                | Core | Core | DP Unit                     | Core | Core           | Core | DP Unit                     | LD/ST | SFU            | Core                        | Core | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Core                                                | Core | Core | DP Unit                     | Core | Core           | Core | DP Unit                     | LDIST | SFU            | Core                        | Core | Core | DP Unit        | Core | Core | Core | DP Unit | LDIST | SFU |
| Core                                                | Core | Core | DP Unit                     | Core | Core           | Core | DP Unit                     | LDIST | SFU            | Core                        | Core | Core | DP Unit        | Core | Core | Core | DP Unit | LDIST | SFU |
| Core                                                | Core | Core | DP Unit                     | Core | Core           | Core | DP Unit                     | LD/ST | SFU            | Core                        | Core | Core | DP Unit        | Core | Core | Core | DP Unit | LDIST | SFU |
| Core                                                | Core | Core | DP Unit                     | Core | Core           | Core | DP Unit                     | LDIST | SFU            | Core                        | Core | Core | DP Unit        | Core | Core | Core | DP Unit | LDIST | SFU |
| Core                                                | Core | Core | DP Unit                     | Core | Core           | Core | DP Unit                     | LDIST | SFU            | Core                        | Core | Core | DP Unit        | Core | Core | Core | DP Unit | LD/ST | SFU |
| Interconnect Network 64 KB Shared Memory / L1 Cache |      |      |                             |      |                |      |                             |       |                |                             |      |      |                |      |      |      |         |       |     |
| 48 KB Read-Only Cache                               |      |      |                             |      |                |      |                             |       |                |                             |      |      |                |      |      |      |         |       |     |
| Tex                                                 |      |      | Tex                         |      | Tex            |      |                             | Tex   |                | Tex                         |      | Tex  |                | Tex  |      |      | Tex     |       |     |
| Tex                                                 |      |      | Tex                         |      | Tex            |      |                             | Tex   |                | Tex                         |      |      | Tex            |      | Tex  |      |         | Tex   |     |
|                                                     |      |      |                             |      |                |      |                             |       |                |                             |      |      |                |      |      |      |         |       |     |



### https://doc.sling.si/workshops/programming-gpu/GPE/teslak40/



# 6.6 Pipelined CPU (data unit)

- It is the realization of the CPU, where several instructions are executed simultaneously, so that the elementary steps of the instructions overlap.
- In a pipelined CPU, instructions are executed similar to industrial assembly line production (eg. cars) or laundry processing facilities:



Execution of the instruction can be divided into smaller elementary steps, sub-operations. Each sub-operation takes only fraction of the total time required to execute a instruction.



CPU is divided into stages or pipeline segments, that correspond to sub-operations of instruction.

each sub-operation is executed by a certain stage or segment of the pipeline.

The stages are interconnected, on the one side instructions enter, then they travel through the stages, where sub-operations are executed, and they exit on on the other side of the pipeline.

At the same time, there are as many instructions executed in parallel as many stages is there in the pipeline.





1. instruction enters the pipeline

























## Comparison of non-pipelined and 5-stage pipelined CPU



### Comparison of operation of non-pipelined and pipelined CPU





Central processing unit - execute instructions

- The execution of the instructions can be divided into <u>for example</u> to 5 general elementary steps (5-stage pipeline):
  - Reading instruction (IF Instruction Fetch)
  - □ Decoding instruction and access to registers (ID Instruction decode
  - □ Execution of instruction (EX Execute
  - Memory access
    - (Only for the LOAD instruction and STORE)
    - Saving the result in the register

(MA - Memory Access

- (WR Write Register
- If we can unify all the instructions to these common elementary steps, we can also speed up the execution of the instructions:
  - more instructions can be executed at the same time (each in its own elementary step) -> pipeline



 Performance of the pipelined CPU is determined by the rate of exit from the instruction pipeline.

Since stages are linked together, the shifts of instructions from one stage to another has to be excecuted at the same time.

• The shifts typically occur each clock cycle.

Duration of one clock period  $t_{CPE}$  can not be shorter than the time required to execute the slowest sub-operation in the pipeline.

## Case: 5-stage pipelined CPU



IF = Instruction Fetch

1. Clock period

## Case: 5-stage pipelined CPU



97

## Case: 5-stage pipelined CPU



## Case: 5-stage pipelined CPU



## Case: 5-stage pipelined CPU



© 2024, Škraba, Rozman, FRI

# Execution of instructions in non-pipelined and pipelined CPU

### Non-pipelined CPE



| 1.instr. | step 1: | step 2 | step 3 | step 4 | step 5 |         |        |        |        |        |
|----------|---------|--------|--------|--------|--------|---------|--------|--------|--------|--------|
| 2.instr. |         |        |        |        |        | step 1: | step 2 | step 3 | step 4 | step 5 |

### Pipelined CPU



|          | • |         |        |        |        |        |
|----------|---|---------|--------|--------|--------|--------|
| 2.instr. |   | step 1: | step 2 | step 3 | step 4 | step 5 |
|          |   |         |        |        |        |        |



- Today, all more powerful processors are designed as a pipelined processors.
- In developing the pipelined CPU, it is important that executions of all sub-operations take about the same time - balanced pipeline.

With an ideally balanced CPU with N stages or segments, the performance is N times greater than non-pipelined CPU.

 Each individual instruction is not executed any faster, but there are N instructions in the pipeline executed at the same time.



At the output of the pipeline, we get N times more executed instructions than in non-pipelined CPU.

The average number of clock cycles for the instruction (CPI) Is ideally N times lower than at the non-pipelined CPU.

The duration of the execution of each instruction (latency) is equal to N x t<sub>CPE</sub>, that is, at the same clock period, the same in the nonpipelined CPU.



- Can we at a sufficiently large number of stages N make CPU much faster (N times faster)?
  - No. Instructions, that are in the pipeline at the same time (each in its stage), can depend on each other in some way dependent and therefore a certain instruction can not be always executed in next clock period.
- These events are called **pipeline hazards**.



- There are three types of pipeline hazards:
  - structural hazards when several stages of the pipeline in the same clock period requires the same unit,
  - data hazards where some instruction needs the result of the previous instruction, but is not yet available
  - □ **control hazards** at the instructions that change the value of the PC (control instructions: jumps, branches, calls, ...)



#### Pipelined CPU - pipeline hazards: common solutions





- Due to the risk of pipeline hazards, part of the pipeline at least has to stop until hazard is resolved (the pipeline at that time does not accept new instructions).
- The increase in speed, therefore, **is not** *N times*.
- By increasing the number of stages *N*, the pipeline hazards occur more frequently and the pipeline is no longer as effective as with lower number of stages.



N number of stages



6.7 Cases of 5-stage pipelined CPU

- General 5-stage pipeline
- FRI SMS Atmel 9260 ARMv5


## General 5-stage pipeline

- The base should be the execution of instructions in five steps, as we described in the previous section.
- Execution of the instruction is divided into 5 sub-operations in accordance with the steps from the previous section, and CPU divided in five stages or segments:
  - □ Stage IF (Instruction Fetch) read instruction
  - Stage ID (Instruction decode) decode the instruction and access to registers
  - □ Stage EX (Execute) the execution of the operation

□ Stage MA (Memory Access) - access memory

□ Stage WR (Write Register) - save the result



 Each stage of the pipeline must execute its sub-oepration in single clock cycle (period).

The IF and MA stages can simultaneously access memory (in same clock period) - a structural hazard happens.

 To eliminate this kind of structural hazards, we must divide the cache into separate instruction and operand caches (Harvard architecture principle).



#### **Pipelined CPU**



For the simultaneous access to instruction (stage IF) and operand in cache (stage MA), the structural hazard occurs in the pipeline





Structural hazard, that would occur due to simultaneous access of stages IF and MA to memory, is eliminated by using Harvard architecture on caches



In the IF stage of pipelined CPU, the access to the instruction cache happens each clock period, however, in the non-pipelined CPU access happens only every five clock periods (in case of 5 clock periods instructions).

The speed of information transfer between the cache and the CPU must be in case of pipelined CPU, five times higher than in non-pipelined CPU.

When designing the pipelined CPU, it is important to ensure that CPU units (registers, ALU, ...) are not required to do two different operations.

## Case: structure of 5-stage pipelined CPU (ALU instruction: e.g. ADD R1,R2,R3)



## Case: structure of 5-stage pipelined CPU

(LOAD/STORE instruction: Calculation of address in EX, access in MA)



## Case: structure of 5-stage pipelined CPU

(LOAD/STORE instruction: Calculation of address in EX, access in MA)





## MiMo v2 - 5. st. pipeline in Logisim

D mimo\_32bit\_v2.1 - Zaklenitev.circ
mimo\_32bit\_v2.2 - Premoscanje.circ

D mimo\_32bit\_v2.3 - Predikcije.circ

mimo\_32bit\_v2.circ



#### https://github.com/LAPSyLAB/MiMo Student Release/tree/main/MiMo v2 Pipelined versions

## Case: structure of 5-stage pipelined CPU

- The pipeline has 5 stages; between them there are intermediate registers in which the results of sub-operations in each level are stored and all data that is needed in following stages.
- In stage IF, the instruction is read and transferred to the instruction register, and the content of the program counter PC is increased by 4 (instructions are 4 bytes long).
- Program Counter is necessary to be increased in stage IF because usually in each clock period, one instruction is fetched from instruction cache.



- The instruction currently executed (pointed by PC content) is stored in the intermediate registers (IR) because it is needed for branch instructions in the EX stage.
- Branch instructions usually write new address into PC (branch or target address), which is calculated by ALU in stage EX.
- Address for operands in instructions LOAD/STORE (indirect addressing) is also calculated by ALU in stage EX.
- Each stage executes its own instructions, therefore the intermediate registers IR in all stages always store the instructions that are read from instruction cache every clock period.

Case: Structure of 5-stage pipelined CPU: FRI SMS - Atmel 9260, ARMv5 architecture





## 6.8 Multiple issue processors

- With pipelined CPU and solving the pipeline hazards, we can achieve CPI values close to 1.
- If we want to reduce the CPI below 1, we must fetch and issue several instructions in in each clock period (and also executed them).
- Such processors are denoted as multiple-issue processors and can be divided into two groups:
  - superscalar processors instructions, that are executed in parallel, are determined by a logic in a processor dynamic decision
  - VLIW processors instructions, that are executed in parallel, are determined by a program (compiler) – static decision



**Superscalar processor** is a pipelined processor which is capable of simultaneous fetching, decoding and executing several instructions.

- The number of fetched and issued instructions in one clock period is dinamically adjusted during the program execution and determined by processor's logic.
- Processor, that can issue a maximum of n instructions is denoted as *n*-issue superscalar processor.
- Parallel (superscalar) performance requires additional interfaces and additional stages for determining interdependencies, validation and eventual retrieval of results ->



#### Superscalar processor



### simplified scheme of superscalar processor based on 5-stage pipeline

 One of the functional units in the EX stage is also stage MA (combined functional unit LOAD/STORE or separate functional units for LOAD and STORE).



Simplified case of Superscalar CPU: Intel Core i7

1.Instruction Fetch (16bytes)
2.Predecode Stage (bytes->x86 instr.)
3.μ-op decode (x86 isntr. -> μ-op)
4.Loop Stream Detection

5.Issue  $\mu$ -op -> ROB in RP

6.Execute µ-op

7.Retire (finalize)

© 2024, Škraba, Rozman, FRI





AMD Zen 2

# ARM Cortex-M7 $\rightarrow$ Dual-issue



ARM Cortex M7

Case of dual-issue simpler pipeline





#### VLIW processor

**VLIW (Very Long Instruction Word**) Processors are executing long instructions, which consist of several ordinary machine instructions that are executed in parallel by a processor using variety of functional units.

In the long instruction, each unit executes its own instruction.

VLIW instruction consists of instructions for each functional unit

| Instruction for 1.<br>functional unit | Instruction for 2.<br>functional unit | Instruction for 3.<br>functional unit | $\langle \rangle$ |  | Instruction for n-th.<br>functional unit |
|---------------------------------------|---------------------------------------|---------------------------------------|-------------------|--|------------------------------------------|
|---------------------------------------|---------------------------------------|---------------------------------------|-------------------|--|------------------------------------------|

### Case of VLIW instruction composition:

| ALU | ALU | FPU | LOAD | STORE |
|-----|-----|-----|------|-------|
|-----|-----|-----|------|-------|



- Compiler is looking in program for mutually independent instructions, that can be executed in parallel in functional units, and merges them in long instructions.
- Number of instructions, which are fetched and issued in one clock period is determined by the compiler and is not changed during the execution (static decision).
- If the compiler can not find enough instructions for all functional units in long instruction, missing instructions are replaced by the instruction NOP (No OPeration).

Compiler finds independent instructions coresponding to functional units and creates "long instructions **VLIW** processor words". Program If coresponding and independent instruction is not found, NOP is inserted LOAD . . . ("-" in VLIW instructions below). ADD . . . Dependent: Independent: ALU **FPADD** ADD R1,R2,R3 ADD R1, R2, R3 . . . SUB R7, R8, R1 LOAD **SUB R7, R5, R9** . . . (can't exec. in parallel (can exec. in parallel) ADD . . . ALU **FPADD** . . . validation LOAD retrieval issue . . . WB IF ID FPU ADD . . . ADD . . . **FPADD** LOAD . . . LOAD . . . STORE STORE Example sequence of long VLIW instructions A - FL -A - FL -AAFLS - A - L -A = ALU instruction - NOP instruction F = FPU instruction **VLIW** L = LOAD instruction instruction S = STORE instruction



retrieval

WB

Comparison: Superscalar vs. VLIW processor

### Superscalar processor

- Dynamic acquisition of several instructions (CPU decides during the execution)
  - **Complex realization** ssue instructions LOAD .... EX ID instruction window more validation and ADD EX FPADDinstructions reorder buffer renaming registers IF LOAD EX ADD at once EX FPADD LOAD ... EX ADD ADD FPADD

VLIW processor STORE ... CPU – dynamical decisions

Static schedule in long instructions (compiler decides before the execution)

