<

Creating a New Project in CUDA

Published on
10,808 Points
4,808 Views
Last Modified:
Approved

This tutorial demonstrates how to create a new Project for developing CUDA enabled Apps in NVIDIA GPU platform.

Prerequisites:

GPU(s) - Geforce, Tesla, etc.
CUDA SDK - Installed
CUDA Driver - Installed
CUDA Toolkit - Installed
CUDA Samples -Installs with Toolkit

Tutorial

1. Login to GPU Machine (ssh access also fine)

2. Set PATH variable - Add to ~/.bash_profile

export PATH=$PATH:/usr/local/cuda/bin
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib

3. Need to add /usr/local/cuda/lib (for 64 bit machines use /usr/local/cuda/lib64)  to /etc/ld.so.conf

Create a File called gpu.conf under /etc/ld.so.conf.d Directory
Add /usr/local/cuda/lib64 to gpu.conf

4. Run ldconfig as Root user

5. you can Enable Profiler for GPU (optional)

export CUDA_PROFILE=1

(if you enable cuda profiler and run your main App, you can see a file in the current directory named cuda_profile.log )
Eg:.
 
# CUDA_PROFILE_LOG_VERSION 1.5
# CUDA_DEVICE 0 Tesla C1060
# TIMESTAMPFACTOR fd4920a156863f8
method,gputime,cputime,occupancy
method=[ memcpyHtoD ] gputime=[ 3.744 ] cputime=[ 2.000 ] 
method=[ memcpyHtoD ] gputime=[ 3.968 ] cputime=[ 1.000 ] 
method=[ _Z6vecAddPiS_S_ ] gputime=[ 6.656 ] cputime=[ 8.000 ] occupancy=[ 0.031 ] 
method=[ memcpyDtoH ] gputime=[ 4.416 ] cputime=[ 17.000 ]

Open in new window



Basic Development Environment Setup done!


Hint: On 64-Bit machines cudart ld load error will occur - to fix it try the two steps

ln -s /usr/local/cuda/lib64/libcudart.so /usr/lib/libcudart.so
ln -s /usr/lib64/libXi.so.6 /usr/lib64/libXi.so

Check Installation

Edit /opt/sample/C/common/common.mk and set the cuda install path /usr/local/cuda
Go to /opt/sample/C
run make will compile the samples, if any error persists, check the previous steps

execute a sample  ./opt/sample/C/bin/linux/release/bandwidthTest(optional)

create a New Project (Assumption all the above steps are done successful)

cd /opt/sample/C/src
cp template/ yourprojectName -R
cd yourprojectName
change the Makefile
# Add source files here
EXECUTABLE	:= yourprojectName
# Cuda source files (compiled with cudacc)
CUFILES		:= yourprojectName.cu

Open in new window

(make changes to the yourprojectName.cu file and yourprojectName_kernel.cu file)
make

Execute the GPU program

bash ../../bin/linux/release/yourprojectName

Sample Code:

MakeFile:
 
################################################################################

# Add source files here
EXECUTABLE	:= saj
# Cuda source files (compiled with cudacc)
CUFILES		:= saj.cu
# C/C++ source files (compiled with gcc / c++)
CCFILES		:=


################################################################################
# Rules and targets

include ../../common/common.mk

Open in new window


Source Code

 
/*
Hello world Program to compute the sum of two arrays of size N using GPU
(not used blocksDim and blockIdx and grid concepts, so that any body can familier with CUDA)

@author Sajan Kumar.S
@email: nospam+ammasajan[A.T]gmail[.]com
*/


#include <stdio.h>
#include <stdlib.h>
#define N 20 // 20 elements

__global__ void vecAdd(int *A, int  *B, int *C){
         int i=threadIdx.x;

         __shared__ int s_A[N],s_B[N],s_C[N]; // N Value depends on size of shared memory

        // copy the values to shared mem and attack! :D

        s_A[i]=A[i];
        s_B[i]=B[i];

        __syncthreads();
//       C[i]=A[i]+B[i];

//      s_C[i]=s_A[i]+s_B[i]; // to calucate the sume of elements
        s_C[i]=s_A[i]*s_B[i]; // to caluclate the sume of elements
        __syncthreads();

        C[i]=s_C[i];
}

int main(){

        int *h_a=0,*h_b=0,*h_c=0;
        int *d_a=0,*d_b=0,*d_c=0;
        int memSize=N*sizeof(int);

        // allocate host memory size of N
        h_a=(int *)malloc(memSize);
        h_b=(int *)malloc(memSize);
        h_c=(int *)malloc(memSize);

        // allocate GPU memory size of N
        cudaMalloc((void **)&d_a,memSize);
        cudaMalloc((void **)&d_b,memSize);
        cudaMalloc((void **)&d_c,memSize);

        // Init values to A and B arrays(clearing C array)
        for(int i=0;i<N;i++){
                h_a[i]=i+2;
                h_b[i]=i+3;
                h_c[i]=0;
        }

        // Copied the values to GPU arrays A and B
        cudaMemcpy(d_a,h_a,memSize,cudaMemcpyHostToDevice);
        cudaMemcpy(d_b,h_b,memSize,cudaMemcpyHostToDevice);

        // printing the A array and B array on CPU
        printf("\n Array A : \n");
        for(int i=0;i<N;i++)
                printf("%d\t",h_a[i]);
        printf("\n Array B : \n");
        for(int i=0;i<N;i++)
                printf("%d\t",h_b[i]);
        printf("\ncalucalting Sum : ");
        vecAdd<<<1, N>>>(d_a,d_b,d_c);

        // copying the output C from GPU to mem
        cudaMemcpy(h_c,d_c,memSize,cudaMemcpyDeviceToHost);

        printf("\nSum of Arrays: \n");
        for(int i=0;i<N;i++)
                printf("%d\t",h_c[i]);

        cudaFree(d_a);
        cudaFree(d_b);
        cudaFree(d_c);

        free(h_a);
        free(h_b);
        free(h_c);

        return 1;
}

Open in new window



References:

1. Developer docs
2.en.wikipedia.org/wiki/CUDA

0
Author:ammasajan
Ask questions about what you read
If you have a question about something within an article, you can receive help directly from the article author. Experts Exchange article authors are available to answer questions and further the discussion.
Get 7 days free