Memory - Politechnika Gdańska
Transkrypt
Memory - Politechnika Gdańska
Memory in CUDA Przetwarzanie Równoległe CUDA/CELL Michał Wójcik Katedra Architektury Systemów Komputerowych Wydział Elektroniki, Telekomunikacji i Informatyki Politechnika Gdańska January 13, 2014 Michał Wójcik (KASK, ETI, PG) Memory January 13, 2014 1 / 11 CUDA memory Figure : Types of CUDA memory Michał Wójcik (KASK, ETI, PG) Memory January 13, 2014 2 / 11 CUDA memory types Global Memory Typically implemented as Dynamic Random Access Memory (DRAM) Long access latencies (hundreds of clock cycles) Finite access bandwidth Constant Memory Short latencies High bandwidth Only hard-coded, no dynamic allocation Registers On chip memory Shared Memory On chip memory Michał Wójcik (KASK, ETI, PG) Memory January 13, 2014 3 / 11 CUDA variable and function declaration Table : Variables in CUDA Variable Declaration Automatic variables other than arrays Automatic array variables __device__, __shared__, int SharedVar; __device__, int GlobalVar; __device__, __constant__, int ConstVar Memory Register Global Shared Global Constant Scope Thread Thread Block Grid Grid Lifetime Kernel Kernel Kernel Application Application Table : Functions in CUDA Function Declaration __device__ fload DeviceFunc() __global__ void KernelFunc() __host__ float HostFunc() Michał Wójcik (KASK, ETI, PG) Executed on the device device host Memory Only callable from the device host host January 13, 2014 4 / 11 Reducing global memory traffic Large but slow global memory vs small but fast shared memory Divide data in global memory into tiles Copy tiles from global memory into shared memory Efficient when multiple threads are using the same portion of data Threads should simultaneously and cooperatively copy tile from global to shared memory Problem with exceeding shared memory size Locality – focusing on small subset of the input, allows a much smaller shared memory to serve most of the accesses to global memory Michał Wójcik (KASK, ETI, PG) Memory January 13, 2014 5 / 11 Memory coalescing Each thread in warp executes the same instruction Instructions accessing consecutive global memory locations are combined into single request Allows the DRAMs to deliver data at a rate close to the peak global memory bandwidth Michał Wójcik (KASK, ETI, PG) Memory January 13, 2014 6 / 11 Memory coalescing in reading matrix Figure : Coalescing example Each thread copies one cell from global memory to shared memory in the loop. Michał Wójcik (KASK, ETI, PG) Memory January 13, 2014 7 / 11 Data prefetching (1) Global memory has limited bandwidth in serving data accesses which take a long time to complete. The CUDA threading model tolerates long memory access latency by allowing some warps to make progress while others wait for their access result. CUDA solution not sufficient when all threads have a very small number of independent instructions between memory access instructions and the consumer of data accessed. Michał Wójcik (KASK, ETI, PG) Memory January 13, 2014 8 / 11 Data prefetching (2) Load first tile from global memory into registers Loop { Load current tile to shared memory Loop { Deposit tile from registers to shared memory __syncthreads () __syncthreads () Computer current tile Load next tile from global memory into registers __syncthreads () } Computer current tile __syncthreads () Without prefetching } With prefetching Michał Wójcik (KASK, ETI, PG) Memory January 13, 2014 9 / 11 Memory as limiting factor Example for G80 128K(=131,072) registers Each SM has 8K (=8192) registers Each SM can accommodate up to 768 threads, each thread can use 8K/768 = 10 registers Each SM has 16kB shared memory Each SM can accommodate up to 8 blocks, each block can use 16kB/8 = 2kB of shared memory Rising usage of registers by thread or shared memory by block reduces number of threads/blocks that can be run Michał Wójcik (KASK, ETI, PG) Memory January 13, 2014 10 / 11 Bibliography I [1] David B Kirk and W Hwu Wen-mei. Programming massively parallel processors: a hands-on approach. Morgan Kaufmann, 2010. Michał Wójcik (KASK, ETI, PG) Memory January 13, 2014 11 / 11