Introduction
• OpenMP is one of the most common parallel programming models in
use today
• OpenMP is an API for writing multithreaded applications
– A set of compiler directive (#pragma) and library routines for parallel
application programmers
– C/C++ and Fortran
69 trang |
Chia sẻ: Thục Anh | Lượt xem: 366 | Lượt tải: 0
Bạn đang xem trước 20 trang nội dung tài liệu Bài giảng Parallel computing & Distributed systems - Chapter 4: Parallel matrix processing, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
el and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Data sharing: fisrtprivate clause
• Variables initialized from shared variable
• Each thread gets its own copy of incr with an initial
value of 0
48
incr = 0;
#pragma omp parallel for firstprivate(incr)
for (i = 0; i <= MAX; i++) {
if ((i%2)==0) incr++;
A[i] = incr;
}
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Data sharing: lastprivate clause
• Variables update shared variable using value from last
iteration
49
void sq2(int n, double *lastterm){
double x; int i;
#pragma omp parallel for lastprivate(x)
for (i = 0; i < n; i++){
x = a[i]*a[i] + b[i]*b[i];
b[i] = sqrt(x);
}
*lastterm = x;
}
“x” has the value it held for the “last
sequential” iteration (i.e., for i=(n-1))
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Data sharing: test
• Consider this example of PRIVATE and FIRSTPRIVATE
• Are A,B,C local to each thread or shared inside the parallel
region?
• What are their initial values inside and values after the
parallel region?
50
variables: A = 1, B = 1, C = 1
#pragma omp parallel private(B) firstprivate(C)
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Data sharing: default clause
• Note that the default storage attribute is
default(shared) (so no need to use it)
– Exception: #pragma omp task
• To change default: default(private)
– each variable in the construct is made private as if specified in a
private clause
– mostly saves typing
• default(none): no default for variables in static
extent. Must list storage attribute for each variable in
static extent
51
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Default clause: example
52
itotal = 1000
#prgama omp parallel private(np, each)
{
np = omp_get_num_threads()
each = itotal/np
}
itotal = 1000
#pragma omp parallel default(priavte) shared(itotal)
{
np = omp_get_num_threads()
each = itotal/np
}
is equivalent to
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Exercise 5: Mandelbrot set area
• Mandelbrot set: the set of complex number c for which
the function does not diverge when iterated
from z = 0
𝑓𝑐 = 𝑧2 + 𝑐
53
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Exercise 5: Mandelbrot set area
• The supplied programs (mandel.c & mandel_serial.c)
compute the area of a Mandelbrot set
– mandel.c: parallel version
– mandel_serial.c: serial version, the correct one
• The parallel program has been parallelized with OpenMP,
but we were lazy and didn’t do it right
• Find and fix the errors (hint the problem is with the
data environment):
– A wrong result is produced (different from the result generated by
the serial version)
54
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Reference
55
STREAMING SIMD EXTENSION (SSE)
56
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
SIMD Architecture
• SIMD = Single Instruction,
Multiple Data
• A data parallel architecture
• Applying the same
instruction to many data
– Save control logic
– A related architecture is the vector
architecture
– SIMD and vector architectures
offer high performance for vector/
matrix-based operations
57
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Vector operations
• Vector addition Z = X + y
for(i=0;i<n;i++) z[i]=x[i]+y[i];
• Vector scaling Y = a*X
for(i=0;i<n;i++) y[i]=a*x[i];
• Dot product
for(i=0;i<n;i++) r+=x[i]*y[i];
58
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
SISD vs. SIMD operations
• C = A + B
for (i=0; i<n; i++) c[i] = a[i] + b[i]
59
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
x86 architecture SIMD support
• Both current AMD and Intel’s x86 processors have ISA and
microarchitecture support SIMD operations.
• ISA SIMD support
– MMX, 3DNow!, SSE, SSE2, SSE3, SSE4, AVX
• See the flag field by “cat /proc/cpuinfo” with a linux-based machine
• “sysctl -a | grep cpu.feat” with a MacOS
– SSE (Streaming SIMD extensions): a SIMD instruction set extension to
the x86 architecture
• Instructions for operating on multiple data simultaneously (vector operations)
• Micro architecture support
– Many functional units
– 8 128-bit vector registers, XMM0, XMM1, , XMM7
60
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
SSE programming
• Vector registers support three data types:
– Integer (16 bytes, 8 shorts, 4 int, 2 long long int, 1 dqword)
– single precision floating point (4 floats)
– double precision float point (2 doubles).
61
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
SSE programming in C/C++
• Map to intrinsics
– An intrinsic is a function known by the compiler that directly maps
to a sequence of one or more assembly language instructions.
Intrinsic functions are inherently more efficient than called
functions because no calling linkage is required.
• Intrinsics provides a C/C++ interface to use processor-
specific enhancements
• Supported by major compilers such as gcc
62
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
SSE intrinsics
• Header files to access SEE intrinsics
– #include //MMX
– #include //SSE
– #include //SSE2
– #include //SSE3
– #include //SSE3
– #include //SSE4
• MMX/SSE/SSE2 are mostly supported
• SSE4 are not well supported.
• When compile, use -msse, ‐mmmx, ‐msse2 (machine
dependent code)
– Some are default for gcc
63
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
SSE intrinsics
• Data types (mapped to an xmm register)
__m128: float
__m128d: double
__m128i: integer
• Data movement and initialization
_mm_load_ps, _mm_loadu_ps, _mm_load_pd, _mm_loadu_pd, etc
_mm_store_ps,
_mm_setzero_ps
• Arithemetic intrinsics
_mm_add_ss, _mm_add_ps, _mm_add_pd, _mm_mul_pd
• More details: check out MSDN from Microsoft
64
https://msdn.microsoft.com/en-us/library/kcwz153a(v=vs.90).aspx
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Example
• Check out ex1.c at https://www.dropbox.com/s/
jf6lnc7df1wclru/ex1.c?dl=0
• Check out sapxy.c at https://www.dropbox.com/s/
r1g1roydtspa4xv/sapxy.c?dl=0
• Please consult MSDN from Microsoft about the
instructions used in those programs
65
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
SSE intrinsics
• Data alignment issue
– Some intrinsics may require memory to be aligned to 16 bytes.
– May not work when memory is not aligned.
• See sapxy1.c at https://www.dropbox.com/s/ik7xiyy8q1gu0w5/sapxy1.c?
dl=0
• Writing more generic SSE routine
– Check memory alignment
– Slow path may not have any performance benefit with SSE
• See sapxy2.c at https://www.dropbox.com/s/tt4xznt5impan0v/sapxy2.c?
dl=0
66
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
Summary
• Contemporary CPUs have SIMD support for vector
operations
– SSE is its programming interface
• SSE can be accessed at high level languages through
intrinsic functions.
• SSE Programming needs to be very careful about memory
alignments
– Both for correctness and for performance.
67
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
One more example
• Check out division program at https://www.dropbox.com/
s/k1ny1gv1pkogikj/division.zip?dl=0
68
Parallel and Distributed Computing (c) Cuong Pham-Quoc/HCMUT
References
• Intel® 64 and IA-32 Architectures Software Developer's
Manuals (Combined volume 1-4)
– https://software.intel.com/sites/default/files/managed/39/
c5/325462-sdm-vol-1-2abcd-3abcd.pdf
• MSDN library, Microsoft
– https://msdn.microsoft.com/en-us/library/26td21ds(v=vs.90).aspx
69
Các file đính kèm theo tài liệu này:
- bai_giang_parallel_computing_distributed_systems_chapter_4_p.pdf