Memory - Politechnika Gdańska

Transkrypt

Memory in CUDA
Przetwarzanie Równoległe CUDA/CELL
Michał Wójcik
Katedra Architektury Systemów Komputerowych
Wydział Elektroniki, Telekomunikacji i Informatyki
Politechnika Gdańska
January 13, 2014
Michał Wójcik (KASK, ETI, PG)
Memory
January 13, 2014
1 / 11
CUDA memory
Figure : Types of CUDA memory
Memory
January 13, 2014
2 / 11
CUDA memory types
Global Memory
Typically implemented as Dynamic Random Access Memory
(DRAM)
Long access latencies (hundreds of clock cycles)
Finite access bandwidth
Constant Memory
Short latencies
High bandwidth
Only hard-coded, no dynamic allocation
Registers
On chip memory
Shared Memory
On chip memory
Memory
January 13, 2014
3 / 11
CUDA variable and function declaration
Table : Variables in CUDA
Variable Declaration
Automatic variables other than arrays
Automatic array variables
__device__, __shared__, int SharedVar;
__device__, int GlobalVar;
__device__, __constant__, int ConstVar
Memory
Register
Global
Shared
Global
Constant
Scope
Thread
Thread
Block
Grid
Grid
Lifetime
Kernel
Kernel
Kernel
Application
Application
Table : Functions in CUDA
Function Declaration
__device__ fload DeviceFunc()
__global__ void KernelFunc()
__host__ float HostFunc()
Executed on the
device
device
host
Memory
Only callable from the
device
host
host
January 13, 2014
4 / 11
Reducing global memory traffic
Large but slow global memory vs small but fast shared memory
Divide data in global memory into tiles
Copy tiles from global memory into shared memory
Efficient when multiple threads are using the same portion of data
Threads should simultaneously and cooperatively copy tile from
global to shared memory
Problem with exceeding shared memory size
Locality – focusing on small subset of the input, allows a much
smaller shared memory to serve most of the accesses to global
memory
Memory
January 13, 2014
5 / 11
Memory coalescing
Each thread in warp executes the same instruction
Instructions accessing consecutive global memory locations are
combined into single request
Allows the DRAMs to deliver data at a rate close to the peak
global memory bandwidth
Memory
January 13, 2014
6 / 11
Memory coalescing in reading matrix
Figure : Coalescing example
Each thread copies one cell from global memory to shared memory in
the loop.
Memory
January 13, 2014
7 / 11
Data prefetching (1)
Global memory has limited bandwidth in serving data accesses
which take a long time to complete.
The CUDA threading model tolerates long memory access latency
by allowing some warps to make progress while others wait for
their access result.
CUDA solution not sufficient when all threads have a very small
number of independent instructions between memory access
instructions and the consumer of data accessed.
Memory
January 13, 2014
8 / 11
Data prefetching (2)
Load first tile from global
memory into registers
Loop {
Load current tile to shared
memory
Loop {
Deposit tile from registers to
shared memory
__syncthreads ()
__syncthreads ()
Computer current tile
Load next tile from global
memory into registers
__syncthreads ()
}
Computer current tile
__syncthreads ()
Without prefetching
}
With prefetching
Memory
January 13, 2014
9 / 11
Memory as limiting factor
Example for G80
128K(=131,072) registers
Each SM has 8K (=8192) registers
Each SM can accommodate up to 768 threads, each thread can use
8K/768 = 10 registers
Each SM has 16kB shared memory
Each SM can accommodate up to 8 blocks, each block can use
16kB/8 = 2kB of shared memory
Rising usage of registers by thread or shared memory by block
reduces number of threads/blocks that can be run
Memory
January 13, 2014
10 / 11
Bibliography I
[1] David B Kirk and W Hwu Wen-mei.
Programming massively parallel processors: a hands-on approach.
Morgan Kaufmann, 2010.
Memory
January 13, 2014
11 / 11

Memory - Politechnika Gdańska

Transkrypt

Podobne dokumenty

Daniel Alpay and Alon Kipnis A generalized white noise space

Pobierz PDF

CG Architecture

5 may, 2011, sheraton hotel

Foreword Essays Discussions and Analyses

Parsing with Unification