Flynn’s Taxonomy
• Based on notions of instruction and data streams
– SISD (a Single Instruction stream, a Single Data stream )
– SIMD (Single Instruction stream, Multiple Data streams )
– MISD (Multiple Instruction streams, a Single Data stream)
– MIMD (Multiple Instruction streams, Multiple Data stream)
• Popularity
– MIMD > SIMD > MISD
49 trang |
Chia sẻ: Thục Anh | Ngày: 12/05/2022 | Lượt xem: 647 | Lượt tải: 0
Bạn đang xem trước 20 trang nội dung tài liệu Bài giảng Parallel computing & Distributed systems - Chapter 2: Parallel computer architectures, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
Computer Engineering – CSE – HCMUT
Parallel computing & Distributed systems
Chapter 2: Parallel Computer Architectures
Adapted from Prof. Thoai Nam/HCMUT
1
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Outline
• Flynn’s Taxonomy
• Classification of Parallel Computers Based on
Architectures
2
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Flynn’s Taxonomy
• Based on notions of instruction and data streams
– SISD (a Single Instruction stream, a Single Data stream )
– SIMD (Single Instruction stream, Multiple Data streams )
– MISD (Multiple Instruction streams, a Single Data stream)
– MIMD (Multiple Instruction streams, Multiple Data stream)
• Popularity
– MIMD > SIMD > MISD
3
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
SISD
• SISD
– Conventional sequential machines
4
IS : Instruction Stream
DS : Data Stream
CU : Control Unit
PU : Processing Unit
MU : Memory Unit
CU PU MU
IS
IS DS
I/O
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
SIMD
• SIMD
– Vector computers, processor arrays
– Special purpose computations
5
PE : Processing Element LM : Local Memory
CU
PE1 LM1
PEn LMn
DS
DS
DS
DS
IS IS
Program loaded
from host
Data sets
loaded from host
SIMD architecture with distributed memory
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
SIMD array
6
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
SIMD for convolution
7
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
SIMD for matrix multiplication
8
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
MISD
• MISD
– Systolic arrays
– Special purpose computations
9
Memory
(Program,
Data) PU1 PU2 PUn
CU1 CU2 CUn
DS DS DS
IS IS
IS IS
IS
DS
I/O
MISD architecture (the systolic array)
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
MIMD
• MIMD
– General purpose parallel computers
10
CU1 PU1
Shared
Memory
IS
IS DS
I/O
CUn PUn
IS DSI/O
IS
MIMD architecture with shared memory
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Classification based on Architecture
• Pipelined Computers
• Dataflow Architectures
• Data Parallel Systems
• Multiprocessors
• Multicomputers
11
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Pipeline computers
• Instructions are divided into a number of steps (segments,
stages)
• At the same time, several instructions can be loaded in
the machine and be executed in different steps
12
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Pipeline computers
• IF: instruction fetch
• ID: instruction decode and register fetch
• EX: execution and effective address calculation
• MEM: memory access
• WB: write back
13
Instruction i IF ID EX WB
IF ID EX WB
IF ID EX WB
IF ID EX WB
IF ID EX WB
Instruction i+1
Instruction i+2
Instruction i+3
Instruction # 1 2 3 4 5 6 7 8
Cycles
MEM
MEM
MEM
MEM
MEM
9
Instruction i+4
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Hazards
• Situations that prevent starting the next instruction in the
next cycle
• Structure hazards
– A required resource is busy
• Data hazard
– Need to wait for previous instruction to complete its data
read/write
• Control hazard
– Deciding on control action depends on previous instruction
14
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Pipeline computer
• Five stages, one step per stage
– IF: Instruction fetch from memory
– ID: Instruction decode & register read
– EX: Execute operation or calculate address
– MEM: Access memory operand
– WB: Write result back to register
15
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Structure hazard
• Conflict for use of a resource
• In MIPS pipeline with a single memory
– Load/store requires data access
– Instruction fetch would have to stall for that cycle
• Would cause a pipeline “bubble”
• Hence, pipelined datapaths require separate instruction/
data memories
– Or separate instruction/data caches
16
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Data hazards
• An instruction depends on completion of data access by a
previous instruction
add $s0, $t0, $t1
sub $t2, $s0, $t3
17
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Forwarding (aka Bypassing)
• Use result when it is computed
– Don’t wait for it to be stored in a register
– Requires extra connections in the datapath
18
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Code scheduling
• Reorder code to avoid use of load result in the next
instruction
• C code for A = B + E; C = B + F;
19
lw $t1, 0($t0)
lw $t2, 4($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
lw $t4, 8($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
lw $t1, 0($t0)
lw $t2, 4($t0)
lw $t4, 8($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Example of branch prediction
20
Prediction
correct
Prediction
incorrect
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Realistic branch prediction
• Static branch prediction
– Based on typical branch behavior
– Example: loop and if-statement branches
• Predict backward branches taken
• Predict forward branches not taken
• Dynamic branch prediction
– Hardware measures actual branch behavior
• e.g., record recent history of each branch
– Assume future behavior will continue the trend
• When wrong, stall while re-fetching, and update history
21
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Control hazards
• Branch determines flow of control
– Fetching next instruction depends on branch outcome
– Pipeline can’t always fetch correct instruction
• Still working on ID stage of branch
• Solution
– Need to compare registers and compute target early in the
pipeline
– Add hardware to do it in ID stage
22
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Branch prediction
• Longer pipelines can’t readily determine branch outcome
early
– Stall penalty becomes unacceptable
• Predict outcome of branch
– Only stall if prediction is wrong
• In MIPS pipeline
– Can predict branches not taken
– Fetch instruction after branch, with no delay
23
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Dataflow architecture
• Data-driven model
– A program is represented as a directed acyclic graph in which a
node represents an instruction and an edge represents the
data dependency relationship between the connected nodes
– Firing rule
• A node can be scheduled for execution if and only if its input data
become valid for consumption
• Dataflow languages
– Id, SISAL, Silage, LISP,...
– Single assignment, applicative(functional) language
– Explicit parallelism
24
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Dataflow graph - DFG
25
z = (a + b) * c
+
*
a
b
c
z
The dataflow representa]on of an arithme]c expression
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Dataflow computers
• Execution of instructions is driven by data availability
– What is the difference between this and normal (control flow)
computers?
• Advantages
– Very high potential for parallelism
– High throughput
– Free from side-effect
• Disadvantages
– Time lost waiting for unneeded arguments
– High control overhead
– Difficult in manipulating data structures
26
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Dataflow computers: generic architecture
27
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Dataflow representation
28
/
d1 e1
*
+
f1
/
d2 e2
*
+
f2
/
d3 e3
*
+
f3
/
d4 e4
*
+
f4
a1
b1
a2
b2
a3
b3
a4
b4
c0 c4
c1 c2 c3
input d,e,f
c0 = 0
for i from 1 to 4 do
begin
ai := di / ei
bi := ai * fi
ci := bi + ci-1
end
output a, b, c
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Execution: control flow machine
• Assume all the external inputs are available before
entering do loop
– + : 1 cycle, * : 2 cycles, / : 3 cycles
• How long dose it take to execute this program on a
dataflow computer with 4 processors?
29
a1 b1 c1 a2 b2 c2 a4 b4 c4
Sequen]al execu]on on a uniprocessor in 24 cycles
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Execution: dataflow machine
• Data-driven execution on a 4-processor dataflow computer
in 9 cycles
–
• Can we further reduce the execution time of this program?
S = 24
9
= 2.67 ×
30
c1 c2 c3 c4a1
a2
a3
a4
b1
b2
b4
b3
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Data parallelism
• Programming model
– Operations performed in parallel on each element of data
structure
– Logically single thread of control, performs sequential or
parallel steps
– Conceptually, a processor associated with each data
element
31
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Data parallelism
• SIMD Architectural model
– Array of many simple, cheap processors with little memory
each
• Processors don’t sequence through instructions
– Attached to a control processor that issues instructions
– Specialized and general communication, cheap global
synchronization
32
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Vector processor
• Instruction set includes operations on vectors as well as
scalars
• 2 types of vector computers
– Processor arrays
– Pipelined vector processors
33
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Processor array
• A sequential computer connected with a set of identical
processing elements simultaneous doing the same
operation on different data
34
Data path
Processing
element Data
memory
In
te
rc
on
ne
ct
io
n
ne
tw
or
k
Processing
element Data
memory
Processing
element Data
memory
I/0
Processor array
Instruction
path
Program and
Data Memory
CPU
I/O processor
Front-end computer
I/0
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Pipeline vector processor
• Stream vector from memory to the CPU
• Use pipelined arithmetic units to manipulate data
– Eg: Cray-1, Cyber-205
35
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Multiprocessor
• Consists of many fully programmable processors each
capable of executing its own program
• Shared address space architecture
• Classified into 2 types
– Uniform Memory Access (UMA) Multiprocessors
– Non-Uniform Memory Access (NUMA) Multiprocessors
36
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
UMA processor
• Uses a central switching mechanism to reach a centralized shared memory
• All processors have equal access time to global memory
• Tightly coupled system
• Problem: cache consistency
37
P1
Switching mechanism
I/O
P2 Pn
Memory banks
C1 C2 Cn
Pi
Ci
Processor i
Cache i
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
UMA processor
• Crossbar switching mechanism
38
Mem
Mem
Mem
Mem
Cache
P
I/OCache
P
I/O
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
UMA processor
• Shared bus mechanism
39
Mem Mem Mem Mem
Cache
P
I/OCache
P
I/O
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
NUMA processor
• Distributed shared
memory combined by
local memory of all
processors
• Memory access time
depends on whether it is
local to the processor
• Caching shared
(particularly nonlocal)
data?
40
Network
Cache
P
Mem Cache
P
Mem
Cache
P
Mem Cache
P
Mem
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Current types of multiprocessors
• PVP (Parallel Vector Processor)
– A small number of proprietary vector processors connected
by a high-bandwidth crossbar switch
• SMP (Symmetric Multiprocessor)
– A small number of COST microprocessors connected by a
high-speed bus or crossbar switch
• DSM (Distributed Shared Memory)
– Similar to SMP
– The memory is physically distributed among nodes.
41
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Parallel vector processor
42
VP VP
Crossbar Switch
VP
SM SM SM
VP : Vector Processor
SM : Shared Memory
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Symmetric multiprocessor
43
P/C P/C
Bus or Crossbar Switch
P/C
SM SM SM
P/C : Microprocessor and Cache
SM: Shared Memory
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Distributed shared memory
44
Custom-Designed Network
MB MB
P/C
LM
NIC
DIR
P/C
LM
NIC
DIR
MB: Memory Bus
P/C: Microprocessor & Cache
LM: Local Memory
DIR: Cache Directory
NIC: Network Interface
Circuitry
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Multicomputers
• Consists of many processors with their own memory
• No shared memory
• Processors interact via message passing loosely
coupled system
→
45
P/C P/C
Message-passing
Interconnection Network
P/C
M M M
P/C: Microprocessor & Cache
M: Memory
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Current types of multicomputers
• MPP (Massively Parallel Processing)
– Total number of processors > 1000
• Cluster
– Each node in system has less than 16 processors
• Constellation
– Each node in system has more than 16 processors
46
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Massively parallel processing
47
P/C
LM
NIC
MB
P/C
LM
NIC
MB
Custom-Designed Network
P/C: Microprocessor & Cache MB: Memory Bus
NIC: Network Interface Circuitry LM: Local Memory
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Clusters
48
Commodity Network (Ethernet, ATM, Myrinet, InfiniBand (VIA))
MB MB
P/C
M
NIC
P/C
M
Bridge Bridge
LD LD
NIC
IOB IOB
MB: Memory Bus
P/C: Microprocessor &
Cache
M: Memory
LD: Local Disk
IOB: I/O Bus
NIC: Network Interface
Circuitry
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Constellations
49
P/CP/C
SM SM
NICLD
Hub
Custom or Commodity Network
>= 16
IOC
P/CP/C
SM SM
NICLD
Hub
>= 16
IOC
P/C: Microprocessor & Cache MB: Memory Bus
NIC: Network Interface Circuitry SM: Shared Memory
IOC: I/O Controller LD: Local Disk
Các file đính kèm theo tài liệu này:
- bai_giang_parallel_computing_distributed_systems_chapter_2_p.pdf