c++ - figuring out how many blocks and threads for a cuda kernel, and how to use them -
i have been trying figure out how make thought simple kernel take average of values in 2d matrix, having issues getting thought process straight on it.
according devicequery output, gpu has 16mp, 32cores/mp, blocks max 1024x1024x64 , have max threads/block=1024.
so, working on processings large images. maybe 5000px x 3500px or that. 1 of kernels taking average of values across pixels in image.
the existing code has images stored 2d array [rows][cols]. kernel, in c, looks you'd expect, wtih loop on rows, , loop on cols, calculation in middle.
so how set dimension calculation portion of code in cuda? have looked @ reduction code int sdk, single dimension array. doesnt have mention of how set number of blocks , threads when have soemthing 2d.
i thinking i'd need set so, , i'd chime in , help:
num_threads=1024; blocksx = num_cols/sqrt(num_threads); blocksy = num_rows/sqrt(num_threads); num_blocks = (num_rows*num_cols)/(blocksx*blocksy); dim3 dimblock(blocksx, blocksy, 1); dim3 dimgrid(num_blocks, 1, 1);
does seem make sense setup?
and in kernel, work on particular row or column, i'd have use
rowidx = (blockidx.x*blockdim.x)+threadid.x colidx = (blockidx.y*blockdim.y)+threadid.y
at least think work getting row , column.
how access particular row r , column c in kernel? in cuda programming guide found following code:
// host code int width = 64, height = 64; float* devptr; size_t pitch; cudamallocpitch(&devptr, &pitch, width * sizeof(float), height); mykernel<<<100, 512>>>(devptr, pitch, width, height); // device code __global__ void mykernel(float* devptr, size_t pitch, int width, int height) { (int r = 0; r < height; ++r) { float* row = (float*)((char*)devptr + r * pitch); (int c = 0; c < width; ++c) { float element = row[c]; } } }
which looks similar how you'd use malloc in c declare 2d array, doesnt have mention of accessing array in own kernel. guess in code, use cudamallocpitch call, , perform memcpy data 2d array on device?
any tips appreciated! thanks!
recently, figured question in following fashion.
// grid , block size const dim3 blocksize(16,16,1); const dim3 gridsize(numrows, numcols, 1); // kernel call rgba_to_greyscale<<<gridsize, blocksize>>>(d_rgbaimage, d_greyimage, numrows, numcols
gridsize = number of block
blocksize = threads per block
here corresponding kernel
__global__ void rgba_to_greyscale(const uchar4* const rgbaimage, unsigned char* const greyimage, int numrows, int numcols) { int idx = blockidx.x + blockidx.y * numrows; uchar4 pixel = rgbaimage[idx]; float intensity = 0.299f * pixel.x + 0.587f * pixel.y + 0.114f * pixel.z; greyimage[idx] = static_cast<unsigned char>(intensity); }
good luck!!!
Comments
Post a Comment