Week 01 Exercises
Exercise_01_01
Question
01. Question: raw to digi conversion kernel
- Find the kernel that converts raw data to digis.
- Where is it launched?
- What is the execution configuration?
- How do we access individual threads in the kernel?
To search the source code use CMSSW dxr - software cross-reference.
Solution
01.a. Find the kernel that converts raw data to digis.
// Kernel to perform Raw to Digi conversion
__global__ void RawToDigi_kernel(const SiPixelROCsStatusAndMapping *cablingMap,
const unsigned char *modToUnp,
const uint32_t wordCounter,
const uint32_t *word,
const uint8_t *fedIds,
uint16_t *xx,
uint16_t *yy,
uint16_t *adc,
uint32_t *pdigi,
...
01.b. Where is it launched?
It is launched in the makeClustersAsync
function:
void SiPixelRawToClusterGPUKernel::makeClustersAsync(bool isRun2,
const SiPixelClusterThresholds clusterThresholds,
const SiPixelROCsStatusAndMapping *cablingMap,
const unsigned char *modToUnp,
const SiPixelGainForHLTonGPU *gains,
const WordFedAppender &wordFed,
SiPixelFormatterErrors &&errors,
const uint32_t wordCounter,
...
// Launch rawToDigi kernel
RawToDigi_kernel<<<blocks, threadsPerBlock, 0, stream>>>(
cablingMap,
modToUnp,
wordCounter,
word_d.get(),
fedId_d.get(),
digis_d.view().xx(),
digis_d.view().yy(),
digis_d.view().adc(),
...
01.c. What is the execution configuration?
For the RawToDigi_kernel
he execution configuration is defined as
<<<blocks, threadsPerBlock, 0, stream>>>
Where
const int threadsPerBlock = 512;
const int blocks = (wordCounter + threadsPerBlock - 1) / threadsPerBlock; // fill it all
In this case
01.d. How do we access individual threads in the kernel?
int32_t first = threadIdx.x + blockIdx.x * blockDim.x;
Exercise_01_02
Question
02. Question: host and device functions
-
Give an example of
global
,device
andhost-device
functions inCMSSW
. -
Can you find an example where
host
anddevice
code diverge? How is this achieved?
Solution
02.a. Give an example of global
, device
and host-device
functions in CMSSW
.
For example see __global__
kernel in previous exercise.
__device__
function in RecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu:
__device__ pixelgpudetails::DetIdGPU getRawId(const SiPixelROCsStatusAndMapping *cablingMap,
uint8_t fed,
uint32_t link,
uint32_t roc) {
uint32_t index = fed * MAX_LINK * MAX_ROC + (link - 1) * MAX_ROC + roc;
pixelgpudetails::DetIdGPU detId = {
cablingMap->rawId[index], cablingMap->rocInDet[index], cablingMap->moduleId[index]};
return detId;
}
__host__
__device__
function in HeterogeneousCore/CUDAUtilities/interface/OneToManyAssoc.h:
__host__ __device__ __forceinline__ void add(CountersOnly const &co) {
for (int32_t i = 0; i < totOnes(); ++i) {
#ifdef __CUDA_ARCH__
atomicAdd(off.data() + i, co.off[i]);
#else
auto &a = (std::atomic<Counter> &)(off[i]);
a += co.off[i];
#endif
}
}
02.b. Can you find an example where host and device code diverge? How is this achieved?
In the CUDA C Programming Guide we can read that:
The __device__
and __host__
execution space specifiers can be used together however, in which case the function is compiled for both the host and the device.
The __CUDA_ARCH__
macro introduced in Application Compatibility can be used to differentiate code paths between host and device:
__host__ __device__ func()
{
#if __CUDA_ARCH__ >= 800
// Device code path for compute capability 8.x
#elif __CUDA_ARCH__ >= 700
// Device code path for compute capability 7.x
#elif __CUDA_ARCH__ >= 600
// Device code path for compute capability 6.x
#elif __CUDA_ARCH__ >= 500
// Device code path for compute capability 5.x
#elif __CUDA_ARCH__ >= 300
// Device code path for compute capability 3.x
#elif !defined(__CUDA_ARCH__)
// Host code path
#endif
}
Based on this we can see how execution diverges in the previous add
function:
__host__ __device__ __forceinline__ void add(CountersOnly const &co) {
for (int32_t i = 0; i < totOnes(); ++i) {
#ifdef __CUDA_ARCH__
atomicAdd(off.data() + i, co.off[i]);
#else
auto &a = (std::atomic<Counter> &)(off[i]);
a += co.off[i];
#endif
}
}
Exercise_01_03
Exercise
03. Exercise: Write a kernel in which
-
if we're running on the
device
each thread prints whichblock
andthread
it is associated with, for exampleblock 1 thread 3
-
if we're running on the
host
each thread just printshost
. -
Test your program!
How can you "hide" your GPU?
Try using CUDA_VISIBLE_DEVICES
from the command line.
Exercise_01_04
Exercise
04. Exercise: Fine-grained vs coarse-grained parallelism: Give examples in the MatMulKernel
kernel of coarse-grained and fine-grained data parallelism (as defined in CUDA abstraction model) as well as sequential execution.
// Thread block size
#define BLOCK_SIZE 16
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
// Block row and column
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
// Each thread block computes one sub-matrix Csub of C
Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
// Each thread computes one element of Csub
// by accumulating results into Cvalue
float Cvalue = 0;
// Thread row and column within Csub
int row = threadIdx.y;
int col = threadIdx.x;
// Loop over all the sub-matrices of A and B that are
// required to compute Csub
// Multiply each pair of sub-matrices together
// and accumulate the results
for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {
// Get sub-matrix Asub of A
Matrix Asub = GetSubMatrix(A, blockRow, m);
// Get sub-matrix Bsub of B
Matrix Bsub = GetSubMatrix(B, m, blockCol);
// Shared memory used to store Asub and Bsub respectively
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load Asub and Bsub from device memory to shared memory
// Each thread loads one element of each sub-matrix
As[row][col] = GetElement(Asub, row, col);
Bs[row][col] = GetElement(Bsub, row, col);
// Synchronize to make sure the sub-matrices are loaded
// before starting the computation
__syncthreads();
// Multiply Asub and Bsub together
for (int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += As[row][e] * Bs[e][col];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write Csub to device memory
// Each thread writes one element
SetElement(Csub, row, col, Cvalue);
}
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
int width;
int height;
int stride;
float* elements;
} Matrix;
// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col)
{
return A.elements[row * A.stride + col];
}
// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col,
float value)
{
A.elements[row * A.stride + col] = value;
}
// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
__device__ Matrix GetSubMatrix(Matrix A, int row, int col)
{
Matrix Asub;
Asub.width = BLOCK_SIZE;
Asub.height = BLOCK_SIZE;
Asub.stride = A.stride;
Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row
+ BLOCK_SIZE * col];
return Asub;
}
Solution
04.a. Give examples in the MatMulKernel
kernel of coarse-grained data parallelism.
Coarse-grained data parallel problems in the CUDA programming model are problems that can be solved independently in parallel by blocks of threads.
For example in the MatMulKernel
:
// Block row and column
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
// Each thread block computes one sub-matrix Csub of C
Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
The computation of Csub
is independent of the computation of other submatrices of C
. The work is divided between blocks
, no synchronization is performed between computing different submatrices of C
.
04.b. Give examples in the MatMulKernel
kernel of fine-grained data parallelism.
Fine-grained data parallel problems in the CUDA programming model are finer pieces that can be solved cooperatively in parallel by all threads within the block.
For example in the MatMulKernel
:
// Get sub-matrix Asub of A
Matrix Asub = GetSubMatrix(A, blockRow, m);
// Get sub-matrix Bsub of B
Matrix Bsub = GetSubMatrix(B, m, blockCol);
// Shared memory used to store Asub and Bsub respectively
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load Asub and Bsub from device memory to shared memory
// Each thread loads one element of each sub-matrix
As[row][col] = GetElement(Asub, row, col);
Bs[row][col] = GetElement(Bsub, row, col);
// Synchronize to make sure the sub-matrices are loaded
// before starting the computation
__syncthreads();
Loading data into shared memory blocks of matrix A and B
is executed parellel by all threads within the block.
04.c. Give examples in the MatMulKernel
kernel of sequential execution.
// Multiply Asub and Bsub together
for (int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += As[row][e] * Bs[e][col];
The computation of Cvalue
for each thread is sequential, we execute BLOCK_SIZE
additions and multiplications.
On the other hand the computation of Cvalue
is also a good example of fine-grained data parallelism, since there is one value computed by each thread in the block parallel.
To identify fine-grained parallelism one just needs to look for block-level synchronization:
for (int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += As[row][e] * Bs[e][col];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();