Creating a New Project in CUDA

ammasajan
CERTIFIED EXPERT
Published:

This tutorial demonstrates how to create a new Project for developing CUDA enabled Apps in NVIDIA GPU platform.

Prerequisites:

GPU(s) - Geforce, Tesla, etc.
CUDA SDK - Installed
CUDA Driver - Installed
CUDA Toolkit - Installed
CUDA Samples -Installs with Toolkit

Tutorial

1. Login to GPU Machine (ssh access also fine)

2. Set PATH variable - Add to ~/.bash_profile

export PATH=$PATH:/usr/local/cuda/bin
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib

3. Need to add /usr/local/cuda/lib (for 64 bit machines use /usr/local/cuda/lib64)  to /etc/ld.so.conf

Create a File called gpu.conf under /etc/ld.so.conf.d Directory
Add /usr/local/cuda/lib64 to gpu.conf

4. Run ldconfig as Root user

5. you can Enable Profiler for GPU (optional)

export CUDA_PROFILE=1

(if you enable cuda profiler and run your main App, you can see a file in the current directory named cuda_profile.log )
Eg:.
 
# CUDA_PROFILE_LOG_VERSION 1.5
                      # CUDA_DEVICE 0 Tesla C1060
                      # TIMESTAMPFACTOR fd4920a156863f8
                      method,gputime,cputime,occupancy
                      method=[ memcpyHtoD ] gputime=[ 3.744 ] cputime=[ 2.000 ] 
                      method=[ memcpyHtoD ] gputime=[ 3.968 ] cputime=[ 1.000 ] 
                      method=[ _Z6vecAddPiS_S_ ] gputime=[ 6.656 ] cputime=[ 8.000 ] occupancy=[ 0.031 ] 
                      method=[ memcpyDtoH ] gputime=[ 4.416 ] cputime=[ 17.000 ]
                      

Open in new window



Basic Development Environment Setup done!


Hint: On 64-Bit machines cudart ld load error will occur - to fix it try the two steps

ln -s /usr/local/cuda/lib64/libcudart.so /usr/lib/libcudart.so
ln -s /usr/lib64/libXi.so.6 /usr/lib64/libXi.so

Check Installation

Edit /opt/sample/C/common/common.mk and set the cuda install path /usr/local/cuda
Go to /opt/sample/C
run make will compile the samples, if any error persists, check the previous steps

execute a sample  ./opt/sample/C/bin/linux/release/bandwidthTest(optional)

create a New Project (Assumption all the above steps are done successful)

cd /opt/sample/C/src
cp template/ yourprojectName -R
cd yourprojectName
change the Makefile
# Add source files here
                      EXECUTABLE	:= yourprojectName
                      # Cuda source files (compiled with cudacc)
                      CUFILES		:= yourprojectName.cu
                      

Open in new window

(make changes to the yourprojectName.cu file and yourprojectName_kernel.cu file)
make

Execute the GPU program

bash ../../bin/linux/release/yourprojectName

Sample Code:

MakeFile:
 
################################################################################
                      
                      # Add source files here
                      EXECUTABLE	:= saj
                      # Cuda source files (compiled with cudacc)
                      CUFILES		:= saj.cu
                      # C/C++ source files (compiled with gcc / c++)
                      CCFILES		:=
                      
                      
                      ################################################################################
                      # Rules and targets
                      
                      include ../../common/common.mk
                      

Open in new window


Source Code

 
/*
                      Hello world Program to compute the sum of two arrays of size N using GPU
                      (not used blocksDim and blockIdx and grid concepts, so that any body can familier with CUDA)
                      
                      @author Sajan Kumar.S
                      @email: nospam+ammasajan[A.T]gmail[.]com
                      */
                      
                      
                      #include <stdio.h>
                      #include <stdlib.h>
                      #define N 20 // 20 elements
                      
                      __global__ void vecAdd(int *A, int  *B, int *C){
                               int i=threadIdx.x;
                      
                               __shared__ int s_A[N],s_B[N],s_C[N]; // N Value depends on size of shared memory
                      
                              // copy the values to shared mem and attack! :D
                      
                              s_A[i]=A[i];
                              s_B[i]=B[i];
                      
                              __syncthreads();
                      //       C[i]=A[i]+B[i];
                      
                      //      s_C[i]=s_A[i]+s_B[i]; // to calucate the sume of elements
                              s_C[i]=s_A[i]*s_B[i]; // to caluclate the sume of elements
                              __syncthreads();
                      
                              C[i]=s_C[i];
                      }
                      
                      int main(){
                      
                              int *h_a=0,*h_b=0,*h_c=0;
                              int *d_a=0,*d_b=0,*d_c=0;
                              int memSize=N*sizeof(int);
                      
                              // allocate host memory size of N
                              h_a=(int *)malloc(memSize);
                              h_b=(int *)malloc(memSize);
                              h_c=(int *)malloc(memSize);
                      
                              // allocate GPU memory size of N
                              cudaMalloc((void **)&d_a,memSize);
                              cudaMalloc((void **)&d_b,memSize);
                              cudaMalloc((void **)&d_c,memSize);
                      
                              // Init values to A and B arrays(clearing C array)
                              for(int i=0;i<N;i++){
                                      h_a[i]=i+2;
                                      h_b[i]=i+3;
                                      h_c[i]=0;
                              }
                      
                              // Copied the values to GPU arrays A and B
                              cudaMemcpy(d_a,h_a,memSize,cudaMemcpyHostToDevice);
                              cudaMemcpy(d_b,h_b,memSize,cudaMemcpyHostToDevice);
                      
                              // printing the A array and B array on CPU
                              printf("\n Array A : \n");
                              for(int i=0;i<N;i++)
                                      printf("%d\t",h_a[i]);
                              printf("\n Array B : \n");
                              for(int i=0;i<N;i++)
                                      printf("%d\t",h_b[i]);
                              printf("\ncalucalting Sum : ");
                              vecAdd<<<1, N>>>(d_a,d_b,d_c);
                      
                              // copying the output C from GPU to mem
                              cudaMemcpy(h_c,d_c,memSize,cudaMemcpyDeviceToHost);
                      
                              printf("\nSum of Arrays: \n");
                              for(int i=0;i<N;i++)
                                      printf("%d\t",h_c[i]);
                      
                              cudaFree(d_a);
                              cudaFree(d_b);
                              cudaFree(d_c);
                      
                              free(h_a);
                              free(h_b);
                              free(h_c);
                      
                              return 1;
                      }
                      

Open in new window



References:

1. Developer docs
2.en.wikipedia.org/wiki/CUDA

0
5,094 Views
ammasajan
CERTIFIED EXPERT

Comments (0)

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.