Introduction

go to https://developer.nvidia.com/about-cuda to see detail

How to write code in CUDA C

this is just a brief explanation on the use of which, C++ high level API functions of the CUDA runtime application programming interface, to write a code running on single GPU. Since cuda 6.0 has put off in the late of 2013, there is more dramatic programming model improvements introduced, especially for the unified memory. The following steps is the traditional way to program, and such kind of memory can largely simplify the complexity of code by optimizing memory management between CPU and GPU. If you are interested, please visit CUDA C/C++ unified memory. For a cuda programming beginner, it is still recommended to begin with the traditional method.

step 1

distribute a specified size of memory to the variable, which shall be operands on GPU computation, stored in the device memory

__inline__ __host__ cudaError_t cudaMalloc (T **devPtr, size_t size)
(where "devPtr" is the device pointer to that variable)

a realistic example gives: "cudaMalloc((void**)&d_v,sizeof(double)*n)". d_v is a double array with n variables stored in device.

step 2

copy data in host memory to device memory allocated in step 1

__host__ cudaError_t CUDARTAPI cudaMemcpy(void *dst, const void *src, size_t count, enum cudaMemcpyKind kind)

a realistic example gives: "cudaMemcpy(d_v,h_v,sizeof(double)*n,cudaMemcpyHostToDevice)". d_v is the pointer to the memory on device and h_v is to memory on host.

step 3

call kernel function. (kernel function defines the program executed by individual thread) the syntax is like:

kernel_function<<<grid_size,block_size,shared_memory_size,stream_num>>>()

a realistic example gives: "Sum<<<n_1,n_2,sizeof(double)*n>>>(d_v)". n_1 and n_2 are both integers respectively representing the number of blocks and threads in each block. Also they can be replaced with a 2D vector or 3D vector defined by Dim3. In general, we ignore the stream parameter which will be set default automatically.

step 4

copy the data after operation from device to host. The function is similar to that in step 2 except changing "cudaMemcpyHostToDevice" to "cudaMemcpyDeviceToHost".

step 5

free all the memory distributed in step 1

__host__ cudaError_t CUDARTAPI cudaFree(void *devPtr)

appendix

the above steps just shows very basic procedures in implementing a cuda code. for a better understanding of GPU programming and even master of it, it is highly recommended that beginners should know the actual structure of GPU and CPU, different types of memory in GPU and their individual efficiency in program, and more advanced libraries. a more detailed tutorial is given here: CUDA C/C++ Basics

Compile

in order to run on our combo(all nodes except Front End), compile .cu code using

nvcc -m64 -arch=sm_35 main.cu -o main

to compile.

CUDA debugger

cuda-gdb is integrated with CUDA toolkit, to use debugger, compile program with nvcc by adding flag "-g -G", and run program. By getting the PID of the process, open cuda-gdb with program name and PID to enter debug mode.

related information

useful book

Programming Massively Parallel Processors, Second Edition by David B. Kirk and Wen-mei W. Hwu

useful link

simplified cuda function

CUDA C

Contents

Introduction

How to write code in CUDA C

step 1

step 2

step 3

step 4

step 5

appendix

Compile

CUDA debugger

related information

useful book

useful link

Navigation menu

Views

Personal tools

Navigation

Search

Tools