Data Parallel Computing

12225 ワード

Parallel Programming テキストリンク

When modern software applications run slowly, the problem is usually data, too much data to be processed.

Terms

host = CPU/ devices = GPUskernels = The device code is marked with CUDA keywords for data-parallel functionsand their associated helper functions and data structures.grid all the threads that are generated by a kernel launch Threads a simplified view of how a processor executes a sequential program in modern computers. It consists of the followings:

code of the program

particular point in the code that is being exeucted

values of its variables and data structures

sequential execution

CUDA API for managing device global memory

cudaMalloc()

Allocates object in the device global memory

Two parameters

Address of a pointer to the allocated object

Size of allocated object in terms of bytes

The address of the pointer variable should be cast to (void **) because the function expects a generic pointer; the memory allocation function is a generic function that is not restricted to any particular type of objectscudaFree()

Frees object from device global memory

Pointer to freed object

cudaMemcpy()

memory data transfer

Requires four parameters

Pointer to destination

Pointer to source
- Number of bytes copied
- Type/Direction of transfer (ex. cudaMemcpyHostToDevice, DeviceToHost, DeviceToDevice, or, HostToHost)

The vecAdd function, outlined in Figure 2.6, allocates device memory, requests data transfers, and launches the kernel that performs the actual vector addition.

Vector Add (Hello world of Parallel Programming)

A vector addition kernel function

// Compute vector sum C = A+B
// Each thread performs one pair-wise addition 
__global__
void vecAddKernel(float* A, float* B, float* C, int n) {
             int i = threadIdx.x + blockDim.x * blockIdx.x;
             if (i<n) C[i] = A[i] + B[i];
         }

__global__ indicates that the function is a kernel and that it can be called froma host functions to generate a grid of threads on a device

vector addition kernal launch statement

int vectAdd(float* A, float* B, float* C, int n)
      {
// d_A, d_B, d_C allocations and copies omitted
// Run ceil(n/256) blocks of 256 threads each 
vecAddKernel<<<ceil(n/256.0), 256>>>(d_A, d_B, d_C, n);
//<<<# of block in the grid, # of threads in  each block>>>
}

Final host code in vecAdd

void vecAdd(float* A, float* B, float* C, int n)
{
  int size = n * sizeof(float);
  float *d_A, *d_B, *d_C;

  cudaMalloc((void **) &d_A, size);
  cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice); 
  cudaMalloc((void **) &d_B, size);
  cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
  cudaMalloc((void **) &d_C, size); 

  vecAddKernel<<<ceil(n/256.0), 256>>>(d_A, d_B, d_C, n); 

  cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);

  // Free device memory for A, B, C
  cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
}

Error Check

cudaError_t err = cudaMalloc((void **) &d_A, size); 

if (error != cudaSuccess) {
  printf(“%s in %s at line %d\n”, cudaGetErrorString( err), __FILE__, __LINE__);
  exit(EXIT_FAILURE); 
}

Reference

この問題について(Data Parallel Computing), 我々は、より多くの情報をここで見つけました https://velog.io/@skang6283/Data-Parallel-Computing-425s7ui7

テキストは自由に共有またはコピーできます。ただし、このドキュメントのURLは参考URLとして残しておいてください。

Collection and Share based on the CC Protocol

[Algorithm]純白準14631を使用

JAvaスレッドスタック分析ツールjca 457.jar;ヒープメモリ解析ツールMemory Analyzer