• Welcome to Jose's Read Only Forum 2023.
 

OpenCL: The first PowerBASIC example on the planet :)

Started by Petr Schreiber, November 04, 2009, 08:21:40 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Petr Schreiber

All examples in this thread have been tuned to work with current OpenCL implementations

Dear friends,

this forum is focused on cutting edge technology, and OpenCL is one of such a tools.
After terrible complications I finally managed to prepare first PowerBASIC OpenCL example, allowing some basic operations on the arrays.

I think combination of PB and GPU power can be very strong tool for many tasks, so I hope you will like it and have use for it.
The headers are still work-in-progress, so be careful :)

As OpenCL is very recent technology, you will need GeForce 8 (3 years old card) or better + latest drivers.
For Radeon cards, you need to download their StreamSDK and extract atiocl.dll + rename it to OpenCL.DLL. They do not provide the DLL with drivers yet.


Petr


'
' Sum of two vectors calculated on GPU
' [!] GeForce 8 and newer, or Radeon HD 4xxx and newer with OpenCL enabled
' [!] drivers REQUIRED
'
' Code could be further optimized by separating the JIT compiling part
'
' Petr Schreiber, José Roca, 2010
' USE AT OWN RISK
'
' Code heavily based on fantastic C code by Yuan Wan:
' https://www.wiki.ed.ac.uk/display/ecdfwiki/OpenCL+quick+start
'

#INCLUDE "cl.inc"

'%TRUE = -1
'%FALSE=  0

' -- Do you want to do error checking?
%USE_ERROR_CHECKING = 1   ' %TRUE

' -- Operation equates
%OPERATION_SUM      = 1
%OPERATION_SUBTRACT = 2
%OPERATION_MULTIPLY = 3
%OPERATION_DIVIDE   = 4

' -- How many element will the array have
%ELEMENTS = 20

FUNCTION PBMAIN()

 DIM InitialData1(%ELEMENTS-1) AS LONG
 DIM InitialData2(%ELEMENTS-1) AS LONG
 DIM CombineDataX(%ELEMENTS-1) AS LONG

 ARRAY ASSIGN InitialData1() = 37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17
 ARRAY ASSIGN InitialData2() = 35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15

 OpenCL_CombineArraysLONG(%OPERATION_SUM, InitialData1(), InitialData2(), CombineDataX())

 LOCAL sOutput AS STRING, c AS LONG

 FOR c = 0 TO %ELEMENTS-1
   sOutput += FORMAT$(CombineDataX(c))+", "
 NEXT

 sOutput = RTRIM$(sOutput, ", ")

 MSGBOX sOutput, %MB_ICONINFORMATION OR %MB_OK, "OpenCL calculation"

 FUNCTION = 0
END FUNCTION

FUNCTION OpenCL_CombineArraysLONG(operation AS LONG, arrayA() AS LONG, arrayB() AS LONG, destinationArrayC() AS LONG) AS LONG
 #IF %USE_ERROR_CHECKING
   IF operation < %OPERATION_SUM OR operation > %OPERATION_DIVIDE THEN
     MSGBOX "Invalid operation"
     EXIT FUNCTION
   END IF

   IF UBOUND(arrayA) <> UBOUND(arrayB) OR UBOUND(arrayA) <> UBOUND(destinationArrayC) THEN
     MSGBOX "The number of dimensions of arrays must match"
     EXIT FUNCTION
   END IF

   IF LBOUND(arrayA) <> 0 OR LBOUND(arrayB) <> 0 OR LBOUND(destinationArrayC) <> 0 THEN
     MSGBOX "The lBound must be zero"
     EXIT FUNCTION
   END IF
 #ENDIF

 REGISTER c AS LONG

 LOCAL errorCL AS LONG   ' type_cl_int
 LOCAL OpenCLSource AS STRING
 LOCAL uBoundOfArrays AS LONG
 LOCAL lSizeOfArrays AS LONG

 uBoundOfArrays = UBOUND(destinationArrayC)
 lSizeOfArrays  = uBoundOfArrays + 1

 OpenCLSource = "__kernel void VectorAdd(__global int* c, __global int* a,__global int* b)"+$CRLF+ _
                "{                                                                        "+$CRLF+ _
                "  // Index of the elements to add                                        "+$CRLF+ _
                "  unsigned int n = get_global_id(0);                                     "+$CRLF+ _
                "                                                                         "+$CRLF+ _
                "  // Sum the n'th element of vectors a and b and store in c              "+$CRLF+ _
                "  c[n] = a[n] "+CHOOSE$(operation, "+", "-", "*", "/")+" b[n];           "+$CRLF+ _
                "}                                                                        "

 ' Two integer source vectors in Host memory
 DIM HostVector1(uBoundOfArrays) AS LONG
 DIM HostVector2(uBoundOfArrays) AS LONG

 ' Initialize with some interesting repeating data
 FOR c = 0 TO uBoundOfArrays
     HostVector1(c) = arrayA(c)
     HostVector2(c) = arrayB(c)
 NEXT

 LOCAL platforms AS DWORD ' cl_platform_id
 LOCAL platformsFound AS DWORD

 errorCL = clGetPlatformIDs(1, platforms, platformsFound)
 #IF %USE_ERROR_CHECKING
   IF (errorCL <> %CL_SUCCESS) THEN
     MSGBOX "Failed to enumerate platforms!" +  openCL_DecodeError(errorCL)
   END IF
 #ENDIF

 LOCAL numDevices AS DWORD
 LOCAL device_ID AS DWORD
 errorCL = clGetDeviceIDs(platforms, %CL_DEVICE_TYPE_GPU, 1, device_id, numDevices)
 #IF %USE_ERROR_CHECKING
   IF (errorCL <> %CL_SUCCESS) THEN
     MSGBOX "Failed to create a device group!" +  openCL_DecodeError(errorCL)
   END IF
 #ENDIF

 ' Create a context to run OpenCL on our CUDA-enabled NVIDIA GPU
 LOCAL GPUContext AS DWORD ' cl_context
 GPUContext = clCreateContext(0, 1, device_id, BYVAL 0, BYVAL 0, errorCL)
 #IF %USE_ERROR_CHECKING
   IF (errorCL <> %CL_SUCCESS) THEN
     MSGBOX "Failed to create a context!" +  openCL_DecodeError(errorCL)
   END IF
 #ENDIF

 ' Get the list of GPU devices associated with this context
 LOCAL ParmDataBytes AS DWORD
 clGetContextInfo(GPUContext, %CL_CONTEXT_DEVICES, 0, BYVAL 0, ParmDataBytes)
 #IF %USE_ERROR_CHECKING
   IF ParmDataBytes = 0 THEN MSGBOX "clGetContextInfo: Incorrect size of devices" : EXIT FUNCTION
 #ENDIF


 LOCAL GPUDevices AS DWORD PTR   ' type_cl_device_id PTR
 GPUDevices = Memory_Alloc(ParmDataBytes)
'  clGetContextInfo(GPUContext, %CL_CONTEXT_DEVICES, ParmDataBytes, GPUDevices, BYVAL 0)
 clGetContextInfo(GPUContext, %CL_CONTEXT_DEVICES, ParmDataBytes, BYVAL GPUDevices, BYVAL 0)
 #IF %USE_ERROR_CHECKING
   IF GPUDevices = 0 THEN MSGBOX "clGetContextInfo: Zero GPUDevices" : EXIT FUNCTION
 #ENDIF


 ' Create a command-queue on the first GPU device
 LOCAL GPUCommandQueue AS DWORD   ' type_cl_command_queue
 GPUCommandQueue = clCreateCommandQueue(GPUContext, @GPUDevices, 0, errorCL)
 #IF %USE_ERROR_CHECKING
   IF errorCL THEN MSGBOX "clCreateCommandQueue: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
 #ENDIF


 ' Allocate GPU memory for source vectors AND initialize from CPU memory
 LOCAL GPUVector1 AS DWORD   ' type_cl_mem
 LOCAL GPUVector2 AS DWORD   ' type_cl_mem
 GPUVector1 = clCreateBuffer(GPUContext, %CL_MEM_READ_ONLY OR %CL_MEM_COPY_HOST_PTR, 4 * lSizeOfArrays, HostVector1(0), errorCL)
 #IF %USE_ERROR_CHECKING
   IF errorCL THEN MSGBOX "GPUVector1: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
 #ENDIF

 GPUVector2 = clCreateBuffer(GPUContext, %CL_MEM_READ_ONLY OR %CL_MEM_COPY_HOST_PTR, 4 * lSizeOfArrays, HostVector2(0), errorCL)
 #IF %USE_ERROR_CHECKING
   IF errorCL THEN MSGBOX "GPUVector2: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
 #ENDIF


 ' Allocate output memory on GPU
 LOCAL GPUOutputVector AS DWORD   ' type_cl_mem
 GPUOutputVector = clCreateBuffer(GPUContext, %CL_MEM_WRITE_ONLY, 4 * lSizeOfArrays, BYVAL 0, errorCL)
 #IF %USE_ERROR_CHECKING
   IF errorCL THEN MSGBOX "GPUOutputVector: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
 #ENDIF


 ' Create OpenCL program with source code
 LOCAL OpenCLProgram AS DWORD   ' type_cl_program
 OpenCLProgram = clCreateProgramWithSource(GPUContext, 1, OpenCLSource, 0, errorCL)
 #IF %USE_ERROR_CHECKING
   IF errorCL THEN MSGBOX "clCreateProgramWithSource: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
 #ENDIF

 ' Build the program (OpenCL JIT compilation)
 errorCL = clBuildProgram(OpenCLProgram, 0, BYVAL 0, BYVAL 0,  0, BYVAL 0)
 #IF %USE_ERROR_CHECKING
   IF errorCL THEN MSGBOX "clBuildProgram: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
 #ENDIF

 ' Create a handle to the compiled OpenCL function (Kernel)
 LOCAL OpenCLVectorAdd AS DWORD   ' type_cl_kernel
 LOCAL kernelName AS STRING
 kernelName = "VectorAdd"
 OpenCLVectorAdd = clCreateKernel(OpenCLProgram, BYCOPY kernelName, errorCL)
 #IF %USE_ERROR_CHECKING
   IF errorCL THEN MSGBOX "clCreateKernel: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
 #ENDIF

 ' In the next step we associate the GPU memory with the Kernel arguments
 ' This is basically setting the parameters for kernel function
 clSetKernelArg(OpenCLVectorAdd, 0, 4, GPUOutputVector)
 clSetKernelArg(OpenCLVectorAdd, 1, 4, GPUVector1)
 clSetKernelArg(OpenCLVectorAdd, 2, 4, GPUVector2)

 ' Launch the Kernel on the GPU
 DIM WorkSize AS DWORD ' one dimensional Range
 WorkSize = lSizeOfArrays
 clEnqueueNDRangeKernel(GPUCommandQueue, OpenCLVectorAdd, 1, BYVAL 0, WorkSize, BYVAL 0, 0, BYVAL 0, BYVAL 0)

 ' Copy the output in GPU memory back to CPU memory
 clEnqueueReadBuffer(GPUCommandQueue, GPUOutputVector, %CL_TRUE, 0, 4 * lSizeOfArrays, destinationArrayC(0), 0, BYVAL 0, BYVAL 0)

 ' Cleanup
 Memory_Free(GPUDevices)
 clReleaseKernel(OpenCLVectorAdd)
 clReleaseProgram(OpenCLProgram)
 clReleaseCommandQueue(GPUCommandQueue)
 clReleaseContext(GPUContext)
 clReleaseMemObject(GPUVector1)
 clReleaseMemObject(GPUVector2)
 clReleaseMemObject(GPUOutputVector)

 FUNCTION = 0

END FUNCTION

FUNCTION OpenCL_DecodeError(eCode AS LONG) AS STRING
 SELECT CASE eCode
   CASE 0
     FUNCTION = "No problem..."
   CASE -30
     FUNCTION = "CL_INVALID_VALUE"
   CASE -31
     FUNCTION = "CL_INVALID_DEVICE_TYPE"
   CASE -32
     FUNCTION = "CL_INVALID_PLATFORM"
   CASE -33
     FUNCTION = "CL_INVALID_DEVICE"
   CASE -34
     FUNCTION = "CL_INVALID_CONTEXT"
   CASE -35
     FUNCTION = "CL_INVALID_QUEUE_PROPERTIES"
   CASE -36
     FUNCTION = "CL_INVALID_COMMAND_QUEUE"
   CASE -37
     FUNCTION = "CL_INVALID_HOST_PTR"
   CASE -38
     FUNCTION = "CL_INVALID_MEM_OBJECT"
   CASE -39
     FUNCTION = "CL_INVALID_IMAGE_FORMAT_DESCRIPTOR"
   CASE -40
     FUNCTION = "CL_INVALID_IMAGE_SIZE"
   CASE -41
     FUNCTION = "CL_INVALID_SAMPLER"
   CASE -42
     FUNCTION = "CL_INVALID_BINARY"
   CASE -43
     FUNCTION = "CL_INVALID_BUILD_OPTIONS"
   CASE -44
     FUNCTION = "CL_INVALID_PROGRAM"
   CASE -45
     FUNCTION = "CL_INVALID_PROGRAM_EXECUTABLE"
   CASE -46
     FUNCTION = "CL_INVALID_KERNEL_NAME"
   CASE -47
     FUNCTION = "CL_INVALID_KERNEL_DEFINITION"
   CASE -48
     FUNCTION = "CL_INVALID_KERNEL"
   CASE -49
     FUNCTION = "CL_INVALID_ARG_INDEX"
   CASE -50
     FUNCTION = "CL_INVALID_ARG_VALUE"
   CASE -51
     FUNCTION = "CL_INVALID_ARG_SIZE"
   CASE -52
     FUNCTION = "CL_INVALID_KERNEL_ARGS"
   CASE -53
     FUNCTION = "CL_INVALID_WORK_DIMENSION"
   CASE -54
     FUNCTION = "CL_INVALID_WORK_GROUP_SIZE"
   CASE -55
     FUNCTION = "CL_INVALID_WORK_ITEM_SIZE"
   CASE -56
     FUNCTION = "CL_INVALID_GLOBAL_OFFSET"
   CASE -57
     FUNCTION = "CL_INVALID_EVENT_WAIT_LIST"
   CASE -58
     FUNCTION = "CL_INVALID_EVENT"
   CASE -59
     FUNCTION = "CL_INVALID_OPERATION"
   CASE -60
     FUNCTION = "CL_INVALID_GL_OBJECT"
   CASE -61
     FUNCTION = "CL_INVALID_BUFFER_SIZE"
   CASE -62
     FUNCTION = "CL_INVALID_MIP_LEVEL"
   CASE -63
     FUNCTION = "CL_INVALID_GLOBAL_WORK_SIZE"
   CASE ELSE
     FUNCTION = "[!] Unknown error:"+STR$(eCode)
 END SELECT

END FUNCTION

FUNCTION Memory_Alloc(numberOfBytes AS LONG) AS DWORD

 LOCAL dwordPointer AS DWORD
 GLOBALMEM ALLOC numberOfBytes TO dwordPointer

 FUNCTION = dwordPointer

END FUNCTION

FUNCTION Memory_Free(dwordPointer AS DWORD) AS DWORD
 GLOBALMEM FREE dwordPointer TO dwordPointer

 FUNCTION = dwordPointer
END FUNCTION

AMD Sempron 3400+ | 1GB RAM @ 533MHz | GeForce 6200 / GeForce 9500GT | 32bit Windows XP SP3

psch.thinbasic.com

Petr Schreiber

#1
There is no limitation to do just integer math, here is code listing for +-*/ floating point (SINGLE) operations:

'
' Operation on two vectors calculated on GPU
' [!] GeForce 8 and newer, or Radeon HD 4xxx and newer with OpenCL enabled
' [!] drivers REQUIRED
'
' Code could be further optimized by separating the JIT compiling part
'
' Petr Schreiber, José Roca, 2010
' USE AT OWN RISK
'
' Code heavily based on fantastic C code by Yuan Wan:
' https://www.wiki.ed.ac.uk/display/ecdfwiki/OpenCL+quick+start
'

#INCLUDE "cl.inc"

%TRUE = -1
%FALSE=  0

' -- Do you want to do error checking?
%USE_ERROR_CHECKING = %TRUE

' -- Operation equates
%OPERATION_SUM      = 1
%OPERATION_SUBTRACT = 2
%OPERATION_MULTIPLY = 3
%OPERATION_DIVIDE   = 4
MACRO OPERATION_DECODE(operation) = CHOOSE$(operation, "+", "-", "*", "/")

' -- How many element will the array have
%ELEMENTS = 9

FUNCTION PBMAIN()

  DIM InputVectorA(%ELEMENTS-1) AS SINGLE
  DIM InputVectorB(%ELEMENTS-1) AS SINGLE
  DIM OutputVector(%ELEMENTS-1) AS SINGLE

  ARRAY ASSIGN InputVectorA() = 1, 2, 3, 4, 5, 6, 7, 8, 9
  ARRAY ASSIGN InputVectorB() = 9, 8, 7, 6, 5, 4, 3, 2, 1

  LOCAL op AS LONG

  FOR op = %OPERATION_SUM TO %OPERATION_DIVIDE
    OpenCL_CombineArraysSINGLE(op, OutputVector(), InputVectorA(), InputVectorB())
    MSGBOX FORMAT_CombineArraysRESULT(op, OutputVector(), InputVectorA(), InputVectorB()), %MB_ICONINFORMATION OR %MB_OK, "OpenCL calculation"
  NEXT


  FUNCTION = 0
END FUNCTION

FUNCTION OpenCL_CombineArraysSINGLE(operation AS LONG, destinationArrayC() AS SINGLE, arrayA() AS SINGLE, arrayB() AS SINGLE) AS LONG
  #IF %USE_ERROR_CHECKING
    IF operation < %OPERATION_SUM OR operation > %OPERATION_DIVIDE THEN
      MSGBOX "Invalid operation"
      EXIT FUNCTION
    END IF

    IF UBOUND(arrayA) <> UBOUND(arrayB) OR UBOUND(arrayA) <> UBOUND(destinationArrayC) THEN
      MSGBOX "The number of dimensions of arrays must match"
      EXIT FUNCTION
    END IF

    IF LBOUND(arrayA) <> 0 OR LBOUND(arrayB) <> 0 OR LBOUND(destinationArrayC) <> 0 THEN
      MSGBOX "The lBound must be zero"
      EXIT FUNCTION
    END IF
  #ENDIF

  REGISTER c AS LONG

  LOCAL errorCL AS LONG
  LOCAL OpenCLSource AS STRING
  LOCAL uBoundOfArrays AS LONG
  LOCAL lSizeOfArrays AS LONG

  uBoundOfArrays = UBOUND(destinationArrayC)
  lSizeOfArrays  = uBoundOfArrays + 1

  OpenCLSource = "__kernel void VectorAdd(__global float* c, __global float* a,__global float* b)"+$CRLF+ _
                 "{                                                                              "+$CRLF+ _
                 "  // Index of the elements to add                                              "+$CRLF+ _
                 "  unsigned int n = get_global_id(0);                                           "+$CRLF+ _
                 "                                                                               "+$CRLF+ _
                 "  // Sum the n'th element of vectors a and b and store in c                    "+$CRLF+ _
                 "  c[n] = a[n] "+OPERATION_DECODE(operation)+" b[n];                            "+$CRLF+ _
                 "}                                                                              "

  LOCAL platforms AS DWORD ' %CL_platform_id
  LOCAL platformsFound AS DWORD

  errorCL = clGetPlatformIDs(1, platforms, platformsFound)
  #IF %USE_ERROR_CHECKING
    IF (errorCL <> %CL_SUCCESS) THEN
      MSGBOX "Failed to enumerate platforms!" +  openCL_DecodeError(errorCL)
    END IF
  #ENDIF

  LOCAL numDevices AS DWORD
  LOCAL device_ID AS DWORD
  errorCL = clGetDeviceIDs(platforms, %CL_DEVICE_TYPE_GPU, 1, device_id, numDevices)
  #IF %USE_ERROR_CHECKING
    IF (errorCL <> %CL_SUCCESS) THEN
      MSGBOX "Failed to create a device group!" +  openCL_DecodeError(errorCL)
    END IF
  #ENDIF

  ' Create a context to run OpenCL on our CUDA-enabled NVIDIA GPU
  LOCAL GPUContext AS DWORD ' %CL_context
  GPUContext = clCreateContext(0, 1, device_id, BYVAL 0, BYVAL 0, errorCL)
  #IF %USE_ERROR_CHECKING
    IF (errorCL <> %CL_SUCCESS) THEN
      MSGBOX "Failed to create a context!" +  openCL_DecodeError(errorCL)
    END IF
  #ENDIF

  ' Get the list of GPU devices associated with this context
  LOCAL ParmDataBytes AS DWORD
  clGetContextInfo(GPUContext, %CL_CONTEXT_DEVICES, 0, BYVAL 0, ParmDataBytes)
  #IF %USE_ERROR_CHECKING
    IF ParmDataBytes = 0 THEN MSGBOX "clGetContextInfo: Incorrect size of devices" : EXIT FUNCTION
  #ENDIF


  LOCAL GPUDevices AS DWORD PTR   ' type_cl_device_id PTR
  GPUDevices = Memory_Alloc(ParmDataBytes)
'  clGetContextInfo(GPUContext, %CL_CONTEXT_DEVICES, ParmDataBytes, GPUDevices, BYVAL 0)
  clGetContextInfo(GPUContext, %CL_CONTEXT_DEVICES, ParmDataBytes, BYVAL GPUDevices, BYVAL 0)
  #IF %USE_ERROR_CHECKING
    IF GPUDevices = 0 THEN MSGBOX "clGetContextInfo: Zero GPUDevices" : EXIT FUNCTION
  #ENDIF


  ' Create a command-queue on the first GPU device
  LOCAL GPUCommandQueue AS DWORD   ' type_cl_command_queue
  GPUCommandQueue = clCreateCommandQueue(GPUContext, @GPUDevices, 0, errorCL)
  #IF %USE_ERROR_CHECKING
    IF errorCL THEN MSGBOX "clCreateCommandQueue: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
  #ENDIF
         

  ' Allocate GPU memory for source vectors AND initialize from CPU memory
  LOCAL GPUVector1 AS DWORD 'type_cl_mem
  LOCAL GPUVector2 AS DWORD 'type_cl_mem
  GPUVector1 = clCreateBuffer(GPUContext, %CL_MEM_READ_ONLY OR %CL_MEM_COPY_HOST_PTR, 4 * lSizeOfArrays, arrayA(0), errorCL)
  #IF %USE_ERROR_CHECKING
    IF errorCL THEN MSGBOX "GPUVector1: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
  #ENDIF

  GPUVector2 = clCreateBuffer(GPUContext, %CL_MEM_READ_ONLY OR %CL_MEM_COPY_HOST_PTR, 4 * lSizeOfArrays, arrayB(0), errorCL)
  #IF %USE_ERROR_CHECKING
    IF errorCL THEN MSGBOX "GPUVector2: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
  #ENDIF


  ' Allocate output memory on GPU
  LOCAL GPUOutputVector AS DWORD'type_cl_mem
  GPUOutputVector = clCreateBuffer(GPUContext, %CL_MEM_WRITE_ONLY, 4 * lSizeOfArrays, BYVAL 0, errorCL)
  #IF %USE_ERROR_CHECKING
    IF errorCL THEN MSGBOX "GPUOutputVector: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
  #ENDIF


  ' Create OpenCL program with source code
  LOCAL OpenCLProgram AS DWORD 'type_cl_program
  OpenCLProgram = clCreateProgramWithSource(GPUContext, 1, OpenCLSource, 0, errorCL)  ' vptr
  #IF %USE_ERROR_CHECKING
    IF errorCL THEN MSGBOX "clCreateProgramWithSource: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
  #ENDIF

  ' Build the program (OpenCL JIT compilation)
  errorCL = clBuildProgram(OpenCLProgram, 0, BYVAL 0, BYVAL 0,  0, BYVAL 0)
  #IF %USE_ERROR_CHECKING
    IF errorCL THEN MSGBOX "clBuildProgram: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
  #ENDIF

  ' Create a handle to the compiled OpenCL function (Kernel)
  LOCAL OpenCLVectorAdd AS DWORD 'type_cl_kernel
  LOCAL kernelName AS STRING
  kernelName = "VectorAdd"
  OpenCLVectorAdd = clCreateKernel(OpenCLProgram, BYCOPY kernelName, errorCL)   ' -- jenom 1x char
  #IF %USE_ERROR_CHECKING
    IF errorCL THEN MSGBOX "clCreateKernel: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
  #ENDIF

  ' In the next step we associate the GPU memory with the Kernel arguments
  ' This is basically setting the parameters for kernel function
  clSetKernelArg(OpenCLVectorAdd, 0, 4, GPUOutputVector)
  clSetKernelArg(OpenCLVectorAdd, 1, 4, GPUVector1)
  clSetKernelArg(OpenCLVectorAdd, 2, 4, GPUVector2)

  ' Launch the Kernel on the GPU
  DIM WorkSize AS DWORD ' one dimensional Range
  WorkSize = lSizeOfArrays
  clEnqueueNDRangeKernel(GPUCommandQueue, OpenCLVectorAdd, 1, BYVAL 0, WorkSize, BYVAL 0, 0, BYVAL 0, BYVAL 0)

  ' Copy the output in GPU memory back to CPU memory
  clEnqueueReadBuffer(GPUCommandQueue, GPUOutputVector, %CL_TRUE, 0, 4 * lSizeOfArrays, destinationArrayC(0), 0, BYVAL 0, BYVAL 0)

  ' Cleanup
  Memory_Free(GPUDevices)
  clReleaseKernel(OpenCLVectorAdd)
  clReleaseProgram(OpenCLProgram)
  clReleaseCommandQueue(GPUCommandQueue)
  clReleaseContext(GPUContext)
  clReleaseMemObject(GPUVector1)
  clReleaseMemObject(GPUVector2)
  clReleaseMemObject(GPUOutputVector)

  FUNCTION = 0

END FUNCTION

FUNCTION OpenCL_DecodeError(eCode AS LONG) AS STRING
  SELECT CASE eCode
    CASE 0
      FUNCTION = "No problem..."
    CASE -30
      FUNCTION = "CL_INVALID_VALUE"
    CASE -31
      FUNCTION = "CL_INVALID_DEVICE_TYPE"
    CASE -32
      FUNCTION = "CL_INVALID_PLATFORM"
    CASE -33
      FUNCTION = "CL_INVALID_DEVICE"
    CASE -34
      FUNCTION = "CL_INVALID_CONTEXT"
    CASE -35
      FUNCTION = "CL_INVALID_QUEUE_PROPERTIES"
    CASE -36
      FUNCTION = "CL_INVALID_COMMAND_QUEUE"
    CASE -37
      FUNCTION = "CL_INVALID_HOST_PTR"
    CASE -38
      FUNCTION = "CL_INVALID_MEM_OBJECT"
    CASE -39
      FUNCTION = "CL_INVALID_IMAGE_FORMAT_DESCRIPTOR"
    CASE -40
      FUNCTION = "CL_INVALID_IMAGE_SIZE"
    CASE -41
      FUNCTION = "CL_INVALID_SAMPLER"
    CASE -42
      FUNCTION = "CL_INVALID_BINARY"
    CASE -43
      FUNCTION = "CL_INVALID_BUILD_OPTIONS"
    CASE -44
      FUNCTION = "CL_INVALID_PROGRAM"
    CASE -45
      FUNCTION = "CL_INVALID_PROGRAM_EXECUTABLE"
    CASE -46
      FUNCTION = "CL_INVALID_KERNEL_NAME"
    CASE -47
      FUNCTION = "CL_INVALID_KERNEL_DEFINITION"
    CASE -48
      FUNCTION = "CL_INVALID_KERNEL"
    CASE -49
      FUNCTION = "CL_INVALID_ARG_INDEX"
    CASE -50
      FUNCTION = "CL_INVALID_ARG_VALUE"
    CASE -51
      FUNCTION = "CL_INVALID_ARG_SIZE"
    CASE -52
      FUNCTION = "CL_INVALID_KERNEL_ARGS"
    CASE -53
      FUNCTION = "CL_INVALID_WORK_DIMENSION"
    CASE -54
      FUNCTION = "CL_INVALID_WORK_GROUP_SIZE"
    CASE -55
      FUNCTION = "CL_INVALID_WORK_ITEM_SIZE"
    CASE -56
      FUNCTION = "CL_INVALID_GLOBAL_OFFSET"
    CASE -57
      FUNCTION = "CL_INVALID_EVENT_WAIT_LIST"
    CASE -58
      FUNCTION = "CL_INVALID_EVENT"
    CASE -59
      FUNCTION = "CL_INVALID_OPERATION"
    CASE -60
      FUNCTION = "CL_INVALID_GL_OBJECT"
    CASE -61
      FUNCTION = "CL_INVALID_BUFFER_SIZE"
    CASE -62
      FUNCTION = "CL_INVALID_MIP_LEVEL"
    CASE -63
      FUNCTION = "CL_INVALID_GLOBAL_WORK_SIZE"
    CASE ELSE
      FUNCTION = "[!] Unknown error:"+STR$(eCode)
  END SELECT

END FUNCTION

FUNCTION FORMAT_CombineArraysRESULT(operation AS LONG, OutputVector() AS SINGLE, InputVectorA() AS SINGLE, InputVectorB() AS SINGLE) AS STRING
  REGISTER c AS LONG
  LOCAL sOutput AS STRING

  FOR c = 0 TO %ELEMENTS-1
    sOutput += FORMAT$(InputVectorA(c))+$TAB
  NEXT
  sOutput = RTRIM$(sOutput, $TAB)+$CRLF+OPERATION_DECODE(operation)+$CRLF
  FOR c = 0 TO %ELEMENTS-1
    sOutput += FORMAT$(InputVectorB(c))+$TAB
  NEXT
  sOutput = RTRIM$(sOutput, $TAB)+$CRLF+"="+$CRLF
  FOR c = 0 TO %ELEMENTS-1
    sOutput += FORMAT$(OutputVector(c))+$TAB
  NEXT
  sOutput = RTRIM$(sOutput, $TAB)

  FUNCTION = sOutput
END FUNCTION

FUNCTION Memory_Alloc(numberOfBytes AS LONG) AS DWORD

  LOCAL dwordPointer AS DWORD
  GLOBALMEM ALLOC numberOfBytes TO dwordPointer

  FUNCTION = dwordPointer

END FUNCTION

FUNCTION Memory_Free(dwordPointer AS DWORD) AS DWORD
  GLOBALMEM FREE dwordPointer TO dwordPointer

  FUNCTION = dwordPointer
END FUNCTION


It is also possible to do calculations with DOUBLEs, but for this you would need really cutting edge hardware, while this example runs on anything OpenCL enabled.
AMD Sempron 3400+ | 1GB RAM @ 533MHz | GeForce 6200 / GeForce 9500GT | 32bit Windows XP SP3

psch.thinbasic.com

Petr Schreiber

#2
This version has already separated building from the main function part + added GPU vs CPU comparison.
Interesting is that PB (on 1800 MHz CPU) still beats the GPU code on GeForce 9500GT (lower end card, 500 MHz GPU):

'
' Operation on two vectors calculated on GPU and CPU
' [!] GeForce 8 and newer, or Radeon HD 4xxx and newer with OpenCL enabled
' [!] drivers REQUIRED
'
' Code already optimized by separating the JIT compiling part
'
' Petr Schreiber, José Roca, 2010
' USE AT OWN RISK
'
' Code based on fantastic C code by Yuan Wan:
' https://www.wiki.ed.ac.uk/display/ecdfwiki/OpenCL+quick+start
'

#INCLUDE "cl.inc"

' -- Time measurement
GLOBAL t1, t2 AS DWORD
DECLARE FUNCTION GetTickCount LIB "KERNEL32.DLL" ALIAS "GetTickCount" () AS DWORD

%TRUE = -1
%FALSE=  0

' -- Do you want to do error checking?
%OPENCL_ERROR_CHECKING = %FALSE

' -- Operation equates
%OPERATION_SUM      = 1
%OPERATION_SUBTRACT = 2
%OPERATION_MULTIPLY = 3
%OPERATION_DIVIDE   = 4
MACRO OPERATION_DECODE(operation) = CHOOSE$(operation, "+", "-", "*", "/")
MACRO OPERATION_DECODE_FUNC(operation) = CHOOSE$(operation, "VectorSum", "VectorSub", "VectorMul", "VectorDiv")

' -- How many element will the array have
%ELEMENTS = 10000000
FUNCTION PBMAIN()

  MSGBOX "Press OK when ready to test", %MB_ICONINFORMATION OR %MB_OK, "OpenCL calculation"
  DIM InputVectorA(%ELEMENTS-1) AS SINGLE
  DIM InputVectorB(%ELEMENTS-1) AS SINGLE
  DIM OutputVector(%ELEMENTS-1) AS SINGLE

  REGISTER i AS LONG
  FOR i = 0 TO %ELEMENTS-1
    InputVectorA(i) = INT(1+RND*10)'1'rnd*80000
    InputVectorB(i) = INT(1+RND*10)'2'RND*80000
  NEXT

  LOCAL op AS LONG
  LOCAL Results AS STRING
  OpenCL_Allocate()

  FOR op = %OPERATION_SUM TO %OPERATION_DIVIDE
    Results += "TEST "+OPERATION_DECODE_FUNC(op)+$CRLF
    t1 = GetTickCount
      OpenCL_CombineArraysSINGLE(op, OutputVector(), InputVectorA(), InputVectorB())
    t2 = GetTickCount

    Results += "GPU time elapsed:"+$TAB+FORMAT$((t2-t1)/1000, "#.000")+$CRLF

    t1 = GetTickCount
      PB_CombineArraysSINGLE(op, OutputVector(), InputVectorA(), InputVectorB())
    t2 = GetTickCount
    Results += "CPU time elapsed:"+$TAB+FORMAT$((t2-t1)/1000, "#.000")+$CRLF
    Results += $CRLF+$CRLF
  NEXT
  Results = TRIM$(Results, $CRLF)

  CLIPBOARD RESET
  CLIPBOARD SET TEXT Results
  Results += $CRLF+$CRLF+"(Results available in clipboard as well)"

  MSGBOX TRIM$(Results, $CRLF), %MB_ICONINFORMATION OR %MB_OK, "OpenCL calculation"
  OpenCL_Release()

  FUNCTION = 0
END FUNCTION

TYPE OpenCL_Info
  GPUContext AS DWORD
  GPUDevices AS DWORD PTR

  Program(4) AS DWORD
  Kernel (4) AS DWORD
  Queue  (4) AS DWORD
END TYPE

GLOBAL OpenCL AS OpenCL_Info

FUNCTION OpenCL_Allocate() AS LONG
  LOCAL errorCL AS LONG
  LOCAL platforms AS DWORD ' cl_platform_id
  LOCAL platformsFound AS DWORD

  errorCL = clGetPlatformIDs(1, platforms, platformsFound)
  #IF %OPENCL_ERROR_CHECKING
    IF (errorCL <> %CL_SUCCESS) THEN
      MSGBOX "Failed to enumerate platforms!" +  openCL_DecodeError(errorCL)
    END IF
  #ENDIF

  LOCAL numDevices AS DWORD
  LOCAL device_ID AS DWORD
  errorCL = clGetDeviceIDs(platforms, %CL_DEVICE_TYPE_GPU, 1, device_id, numDevices)
  #IF %OPENCL_ERROR_CHECKING
    IF (errorCL <> %CL_SUCCESS) THEN
      MSGBOX "Failed to create a device group!" +  openCL_DecodeError(errorCL)
    END IF
  #ENDIF

  ' Create a context to run OpenCL on our CUDA-enabled NVIDIA GPU
  OpenCL.GPUContext = clCreateContext(0, 1, device_id, BYVAL 0, BYVAL 0, errorCL)
  #IF %OPENCL_ERROR_CHECKING
    IF (errorCL <> %CL_SUCCESS) THEN
      MSGBOX "Failed to create a context!" +  openCL_DecodeError(errorCL)
    END IF
  #ENDIF

  ' Get the list of GPU devices associated with this context
  LOCAL ParmDataBytes AS DWORD
  clGetContextInfo(OpenCL.GPUContext, %CL_CONTEXT_DEVICES, 0, BYVAL 0, ParmDataBytes)
  #IF %OPENCL_ERROR_CHECKING
    IF ParmDataBytes = 0 THEN MSGBOX "clGetContextInfo: Incorrect size of devices" : EXIT FUNCTION
  #ENDIF


  OpenCL.GPUDevices = Memory_Alloc(ParmDataBytes)
'  clGetContextInfo(GPUContext, %CL_CONTEXT_DEVICES, ParmDataBytes, GPUDevices, BYVAL 0)
  clGetContextInfo(OpenCL.GPUContext, %CL_CONTEXT_DEVICES, ParmDataBytes, BYVAL OpenCL.GPUDevices, BYVAL 0)
  #IF %OPENCL_ERROR_CHECKING
    IF GPUDevices = 0 THEN MSGBOX "clGetContextInfo: Zero GPUDevices" : EXIT FUNCTION
  #ENDIF


 
  DIM OpenCLSource(%OPERATION_SUM TO %OPERATION_DIVIDE) AS STRING

  LOCAL op AS LONG
  DIM kernelName(%OPERATION_SUM TO %OPERATION_DIVIDE) AS STRING


  FOR op = %OPERATION_SUM TO %OPERATION_DIVIDE
    ' Create a command-queue on the first GPU device
    OpenCL.Queue(op) = clCreateCommandQueue(OpenCL.GPUContext, OpenCL.@GPUDevices, 0, errorCL)
    #IF %OPENCL_ERROR_CHECKING
      IF errorCL THEN MSGBOX "clCreateCommandQueue: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
    #ENDIF
  NEXT

  FOR op = %OPERATION_SUM TO %OPERATION_DIVIDE

    OpenCLSource(op) = "__kernel void "+OPERATION_DECODE_FUNC(op)+"(__global float* c, __global float* a,__global float* b)"+$CRLF+ _
                       "{                                                                              "+$CRLF+ _
                       "  // Index of the elements to add                                              "+$CRLF+ _
                       "  unsigned int n = get_global_id(0);                                           "+$CRLF+ _
                       "                                                                               "+$CRLF+ _
                       "  // Sum the n'th element of vectors a and b and store in c                    "+$CRLF+ _
                       "  c[n] = a[n] "+OPERATION_DECODE(op)+" b[n];                                   "+$CRLF+ _
                       "}                                                                              "

    ' Create OpenCL program with source code
    LOCAL OpenCLProgram AS DWORD
    OpenCL.Program(op) = clCreateProgramWithSource(OpenCL.GPUContext, 1, OpenCLSource(op), 0, errorCL)  ' vptr
    #IF %OPENCL_ERROR_CHECKING
      IF errorCL THEN MSGBOX "clCreateProgramWithSource: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
    #ENDIF

    ' Build the program (OpenCL JIT compilation)
    errorCL = clBuildProgram(OpenCL.Program(op), 0, BYVAL 0, BYVAL 0,  0, BYVAL 0)
    #IF %OPENCL_ERROR_CHECKING
      IF errorCL THEN MSGBOX "clBuildProgram: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
    #ENDIF

    ' Create a handle to the compiled OpenCL function (Kernel)
    kernelName(op) = OPERATION_DECODE_FUNC(op)
    OpenCL.Kernel(op) = clCreateKernel(OpenCL.Program(op), BYCOPY kernelName(op), errorCL)
    #IF %OPENCL_ERROR_CHECKING
      IF errorCL THEN MSGBOX "clCreateKernel: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
    #ENDIF

  NEXT

  FUNCTION = 0

END FUNCTION

FUNCTION OpenCL_Release() AS LONG
  LOCAL i AS LONG
  FOR i = %OPERATION_SUM TO %OPERATION_DIVIDE
    clReleaseKernel(OpenCL.Kernel(i))
    clReleaseProgram(OpenCL.Program(i))
    clReleaseCommandQueue(OpenCL.Queue(i))
  NEXT
  Memory_Free(OpenCL.GPUDevices)
  clReleaseContext(OpenCL.GPUContext)

END FUNCTION

FUNCTION OpenCL_CombineArraysSINGLE(operation AS LONG, destinationArrayC() AS SINGLE, arrayA() AS SINGLE, arrayB() AS SINGLE) AS LONG
  #IF %OPENCL_ERROR_CHECKING
    IF operation < %OPERATION_SUM OR operation > %OPERATION_DIVIDE THEN
      MSGBOX "Invalid operation"
      EXIT FUNCTION
    END IF

    IF UBOUND(arrayA) <> UBOUND(arrayB) OR UBOUND(arrayA) <> UBOUND(destinationArrayC) THEN
      MSGBOX "The number of dimensions of arrays must match"
      EXIT FUNCTION
    END IF

    IF LBOUND(arrayA) <> 0 OR LBOUND(arrayB) <> 0 OR LBOUND(destinationArrayC) <> 0 THEN
      MSGBOX "The lBound must be zero"
      EXIT FUNCTION
    END IF
  #ENDIF

  LOCAL errorCL AS LONG
  LOCAL uBoundOfArrays AS LONG
  LOCAL lSizeOfArrays AS LONG

  uBoundOfArrays = UBOUND(destinationArrayC)
  lSizeOfArrays  = uBoundOfArrays + 1



  ' Allocate GPU memory for source vectors AND initialize from CPU memory
  LOCAL GPUVector1 AS DWORD 'type_cl_mem
  LOCAL GPUVector2 AS DWORD 'type_cl_mem
  GPUVector1 = clCreateBuffer(OpenCL.GPUContext, %CL_MEM_READ_ONLY OR %CL_MEM_COPY_HOST_PTR, 4 * lSizeOfArrays, arrayA(0), errorCL)
  #IF %OPENCL_ERROR_CHECKING
    IF errorCL THEN MSGBOX "GPUVector1: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
  #ENDIF

  GPUVector2 = clCreateBuffer(OpenCL.GPUContext, %CL_MEM_READ_ONLY OR %CL_MEM_COPY_HOST_PTR, 4 * lSizeOfArrays, arrayB(0), errorCL)
  #IF %OPENCL_ERROR_CHECKING
    IF errorCL THEN MSGBOX "GPUVector2: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
  #ENDIF

  ' Allocate output memory on GPU
  LOCAL GPUOutputVector AS DWORD 'type_cl_mem
  GPUOutputVector = clCreateBuffer(OpenCL.GPUContext, %CL_MEM_WRITE_ONLY, 4 * lSizeOfArrays, BYVAL 0, errorCL)
  #IF %OPENCL_ERROR_CHECKING
    IF errorCL THEN MSGBOX "GPUOutputVector: "+openCL_DecodeError(errorCL) : EXIT FUNCTION
  #ENDIF

  ' In the next step we associate the GPU memory with the Kernel arguments
  ' This is basically setting the parameters for kernel function
  clSetKernelArg(OpenCL.Kernel(operation), 0, 4, GPUOutputVector)
  clSetKernelArg(OpenCL.Kernel(operation), 1, 4, GPUVector1)
  clSetKernelArg(OpenCL.Kernel(operation), 2, 4, GPUVector2)

  ' Launch the Kernel on the GPU
  DIM WorkSize AS DWORD ' one dimensional Range
  WorkSize = lSizeOfArrays
  clEnqueueNDRangeKernel(OpenCL.Queue(operation), OpenCL.Kernel(operation), 1, BYVAL 0, WorkSize, BYVAL 0, 0, BYVAL 0, BYVAL 0)

  ' Copy the output in GPU memory back to CPU memory
  clEnqueueReadBuffer(OpenCL.Queue(operation), GPUOutputVector, %CL_TRUE, 0, 4 * lSizeOfArrays, destinationArrayC(0), 0, BYVAL 0, BYVAL 0)

  ' Cleanup
  clReleaseMemObject(GPUVector1)
  clReleaseMemObject(GPUVector2)
  clReleaseMemObject(GPUOutputVector)
  FUNCTION = 0

END FUNCTION

FUNCTION OpenCL_DecodeError(eCode AS LONG) AS STRING
  SELECT CASE eCode
    CASE 0
      FUNCTION = "No problem..."
    CASE -30
      FUNCTION = "CL_INVALID_VALUE"
    CASE -31
      FUNCTION = "CL_INVALID_DEVICE_TYPE"
    CASE -32
      FUNCTION = "CL_INVALID_PLATFORM"
    CASE -33
      FUNCTION = "CL_INVALID_DEVICE"
    CASE -34
      FUNCTION = "CL_INVALID_CONTEXT"
    CASE -35
      FUNCTION = "CL_INVALID_QUEUE_PROPERTIES"
    CASE -36
      FUNCTION = "CL_INVALID_COMMAND_QUEUE"
    CASE -37
      FUNCTION = "CL_INVALID_HOST_PTR"
    CASE -38
      FUNCTION = "CL_INVALID_MEM_OBJECT"
    CASE -39
      FUNCTION = "CL_INVALID_IMAGE_FORMAT_DESCRIPTOR"
    CASE -40
      FUNCTION = "CL_INVALID_IMAGE_SIZE"
    CASE -41
      FUNCTION = "CL_INVALID_SAMPLER"
    CASE -42
      FUNCTION = "CL_INVALID_BINARY"
    CASE -43
      FUNCTION = "CL_INVALID_BUILD_OPTIONS"
    CASE -44
      FUNCTION = "CL_INVALID_PROGRAM"
    CASE -45
      FUNCTION = "CL_INVALID_PROGRAM_EXECUTABLE"
    CASE -46
      FUNCTION = "CL_INVALID_KERNEL_NAME"
    CASE -47
      FUNCTION = "CL_INVALID_KERNEL_DEFINITION"
    CASE -48
      FUNCTION = "CL_INVALID_KERNEL"
    CASE -49
      FUNCTION = "CL_INVALID_ARG_INDEX"
    CASE -50
      FUNCTION = "CL_INVALID_ARG_VALUE"
    CASE -51
      FUNCTION = "CL_INVALID_ARG_SIZE"
    CASE -52
      FUNCTION = "CL_INVALID_KERNEL_ARGS"
    CASE -53
      FUNCTION = "CL_INVALID_WORK_DIMENSION"
    CASE -54
      FUNCTION = "CL_INVALID_WORK_GROUP_SIZE"
    CASE -55
      FUNCTION = "CL_INVALID_WORK_ITEM_SIZE"
    CASE -56
      FUNCTION = "CL_INVALID_GLOBAL_OFFSET"
    CASE -57
      FUNCTION = "CL_INVALID_EVENT_WAIT_LIST"
    CASE -58
      FUNCTION = "CL_INVALID_EVENT"
    CASE -59
      FUNCTION = "CL_INVALID_OPERATION"
    CASE -60
      FUNCTION = "CL_INVALID_GL_OBJECT"
    CASE -61
      FUNCTION = "CL_INVALID_BUFFER_SIZE"
    CASE -62
      FUNCTION = "CL_INVALID_MIP_LEVEL"
    CASE -63
      FUNCTION = "CL_INVALID_GLOBAL_WORK_SIZE"
    CASE ELSE
      FUNCTION = "[!] Unknown error:"+STR$(eCode)
  END SELECT

END FUNCTION

FUNCTION FORMAT_CombineArraysRESULT(operation AS LONG, OutputVector() AS SINGLE, InputVectorA() AS SINGLE, InputVectorB() AS SINGLE) AS STRING
  LOCAL c AS LONG
  LOCAL sOutput AS STRING

  FOR c = 0 TO %ELEMENTS-1
    sOutput += FORMAT$(InputVectorA(c))+$TAB
  NEXT
  sOutput = RTRIM$(sOutput, $TAB)+$CRLF+OPERATION_DECODE(operation)+$CRLF
  FOR c = 0 TO %ELEMENTS-1
    sOutput += FORMAT$(InputVectorB(c))+$TAB
  NEXT
  sOutput = RTRIM$(sOutput, $TAB)+$CRLF+"="+$CRLF
  FOR c = 0 TO %ELEMENTS-1
    sOutput += FORMAT$(OutputVector(c))+$TAB
  NEXT
  sOutput = RTRIM$(sOutput, $TAB)

  FUNCTION = sOutput
END FUNCTION

FUNCTION Memory_Alloc(numberOfBytes AS LONG) AS DWORD

  LOCAL dwordPointer AS DWORD
  GLOBALMEM ALLOC numberOfBytes TO dwordPointer

  FUNCTION = dwordPointer

END FUNCTION

FUNCTION Memory_Free(dwordPointer AS DWORD) AS DWORD
  GLOBALMEM FREE dwordPointer TO dwordPointer

  FUNCTION = dwordPointer
END FUNCTION

' -- CPU

FUNCTION PB_CombineArraysSINGLE(operation AS LONG, OutputVector() AS SINGLE, InputVectorA() AS SINGLE, InputVectorB() AS SINGLE) AS LONG
  REGISTER i AS LONG

  #IF %OPENCL_ERROR_CHECKING
    IF operation < %OPERATION_SUM OR operation > %OPERATION_DIVIDE THEN
      MSGBOX "Invalid operation"
      EXIT FUNCTION
    END IF

    IF UBOUND(InputVectorA) <> UBOUND(InputVectorB) OR UBOUND(InputVectorA) <> UBOUND(OutputVector) THEN
      MSGBOX "The number of dimensions of arrays must match"
      EXIT FUNCTION
    END IF

    IF LBOUND(InputVectorA) <> 0 OR LBOUND(InputVectorB) <> 0 OR LBOUND(OutputVector) <> 0 THEN
      MSGBOX "The lBound must be zero"
      EXIT FUNCTION
    END IF
  #ENDIF

  LOCAL errorCL AS LONG
  LOCAL uBoundOfArrays AS LONG
  LOCAL lSizeOfArrays AS LONG

  uBoundOfArrays = UBOUND(OutputVector)
  lSizeOfArrays  = uBoundOfArrays + 1

  SELECT CASE AS LONG operation
    CASE %OPERATION_SUM
      FOR i = 0 TO uBoundOfArrays
        OutputVector(i) = InputVectorA(i) + InputVectorB(i)
      NEXT

    CASE %OPERATION_SUBTRACT
      FOR i = 0 TO uBoundOfArrays
        OutputVector(i) = InputVectorA(i) - InputVectorB(i)
      NEXT

    CASE %OPERATION_MULTIPLY
      FOR i = 0 TO uBoundOfArrays
        OutputVector(i) = InputVectorA(i) * InputVectorB(i)
      NEXT

    CASE %OPERATION_DIVIDE
      FOR i = 0 TO uBoundOfArrays
        OutputVector(i) = InputVectorA(i) / InputVectorB(i)
      NEXT
  END SELECT
END FUNCTION


Results on my PC:
Quote
TEST VectorSum
GPU time elapsed:   0.234
CPU time elapsed:   0.188


TEST VectorSub
GPU time elapsed:   0.234
CPU time elapsed:   0.172


TEST VectorMul
GPU time elapsed:   0.250
CPU time elapsed:   0.172


TEST VectorDiv
GPU time elapsed:   0.250
CPU time elapsed:   0.172

I would be interested in results retrieved on more powerful graphic card.


Petr

*Note: The execution of GPU code might be damaged by my GPU coding skills, maybe it can be done better
AMD Sempron 3400+ | 1GB RAM @ 533MHz | GeForce 6200 / GeForce 9500GT | 32bit Windows XP SP3

psch.thinbasic.com

Patrice Terrier

#3
Petr

See the attached screen shot for my HP HDX18 notebook computer running Windows 7 64-bit

- Processeur Dual Core2 Duo P8400 (2.26 Ghz).
- 4096 MB DDR2 SDRAM.
- nVidia GeForce 9600M GT (with 512 Mb of dedicated DDR2).
- Screen resolution Full HD 1920 x 1080.

Note: cut and paste from clipboard doesn't work.

Once again, Dual Core processors seems to play in another category ;)

...
Patrice Terrier
GDImage (advanced graphic addon)
http://www.zapsolution.com

Petr Schreiber

#4
Thank you Patrice!,

which driver release you use?

The fact is that this kind of example is not something GPU paralelism can take advantage of, as it is basically executing one line of code. But I was curious how fast/slow the conventional tasks are.

I will prepare more complex samples.

Thanks a lot,
Petr

P.S.
QuoteNote: cut and paste from clipboard doesn't work.
Do you mean CLIPBOARD SET TEXT statement not having effect?
AMD Sempron 3400+ | 1GB RAM @ 533MHz | GeForce 6200 / GeForce 9500GT | 32bit Windows XP SP3

psch.thinbasic.com

Patrice Terrier

QuoteDo you mean CLIPBOARD SET TEXT statement not having effect?
Yes ... no effect on Windows 7 and probably the same on VISTA.

Here is the version i am using:
nvdrivers 2.3 winvista 64 190.89-beta for notebook

...
Patrice Terrier
GDImage (advanced graphic addon)
http://www.zapsolution.com

Petr Schreiber

#6
Hi Patrice,

thanks for your information. I sent a info to support on this problem.

Patrice, do you think you could install 195.39 for notebooks once available? It offers better performance and fixes some problems.
If you do not want to install this, I understand - drivers are very delicate beasts.

Here comes next example - this time it is array of 4xsingles (classic vectors). It seems my graphic card starts to catch breath here, and it is almost as fast as PB. Let me know. I think you have advantage of superb CPU, while mine is ... working one :)

I think I start to understand how to build  "good food" for hungry GPU. Now I target to break the PB barrier ( run on GPU faster than PB can on my CPU ) :)


Petr
AMD Sempron 3400+ | 1GB RAM @ 533MHz | GeForce 6200 / GeForce 9500GT | 32bit Windows XP SP3

psch.thinbasic.com

Petr Schreiber

#7
Hi,

I still did not gave up on OpenCL. I noticed one pleasant thing - with ForceWare 195.55, the performance of the code posted above dramatically increased - GPU is now faster than my CPU.
The speed increase comparing to 195.39 is really nice, the calculation is about twice as fast now.

I attach the results so you can compare :)


Petr
AMD Sempron 3400+ | 1GB RAM @ 533MHz | GeForce 6200 / GeForce 9500GT | 32bit Windows XP SP3

psch.thinbasic.com

José Roca

#8
 
Hi Petr,

I have translated the OpenCL headers to PowerBASIC and modified the first of your examples to suit the syntax that I have used.

If you have the time, I would like to know if it works, as I currently can't test it.

TIA

Petr Schreiber

#9
Hi José,

thank you very much.

The example didn't work, but that was "thanks" to fact latest GeForce drivers do not let you create context without properly retrieving device first (specification says this is "implementation defined").

So I modded the example for valid initialization and post it here. This way it works perfectly with your headers!

AMD Sempron 3400+ | 1GB RAM @ 533MHz | GeForce 6200 / GeForce 9500GT | 32bit Windows XP SP3

psch.thinbasic.com

José Roca

 
Thanks very much, Petr. I just wanted to ascertain that my translation was correct to incorporate them to the header's collection.

Petr Schreiber

I think the headers are fine,

I will let you know in case I will find some problem with them, but so far all seems very good.


Thanks,
Petr
AMD Sempron 3400+ | 1GB RAM @ 533MHz | GeForce 6200 / GeForce 9500GT | 32bit Windows XP SP3

psch.thinbasic.com

Petr Schreiber

#12
All my examples in this thread have been updated.

They don't show much speed improvement, do they? This is mostly because the examples here are trivial. With real life task, expanding to larger code than just C = A+B the speed improvement goes up.

Expect more examples during the summer, as a little tasting I attach results of implementation of PCSM method (heavy ray-obstacle calculation) implemented on CPU and GPU.
You can see modern GPU performs 60x better than solid quad core from AMD. On quad core it is OpenCL CPU implementation with heavy use of SSE using all 4 cores at 100%, on GPU it is OpenCL GPU implementation.
Notice even the low end 9500GT is still 7x faster, at fraction of price and power consumption with exactly the same code.

During my studies I realised accelerating your application using OpenCL in real life tasks can be even more efficient than using inline assembly. The main benefit is in maintainability of the code (high level syntax) and possibility to cross target various devices.
Of course, user of your application must have the appropriate GPGPU hardware -> ATi Radeon HD 4000 and newer or GeForce 8 and newer. In case of GeForce8 it is almost 4 years old metal.


Petr
AMD Sempron 3400+ | 1GB RAM @ 533MHz | GeForce 6200 / GeForce 9500GT | 32bit Windows XP SP3

psch.thinbasic.com

Aslan Babakhanov

Jose,

Can you please re-upload the cl.zip header archive? It seems broken.

Thanks!

José Roca

It no longer exists. cl.inc was incorporated in my Windows API headers package.