The overhead you are experiencing in calling the @cuda.jit or @cuda.reduce function multiple times can be due to the compilation and setup time associated with each call. One way to potentially reduce this overhead is to compile the kernel function just once and then reuse it for the subsequent calls.
Here is an example of how you can achieve this by compiling the kernel function outside of the loop and then invoking it inside the loop:
import numpy as np import time from numba import cuda @cuda.reduce def CudaMax(a, b): return max(a, b) N = 1000 A = np.random.uniform(low=-10000, high=10000, size=(N)) A_device = cuda.to_device(A) # Compile the kernel function cuda_max_kernel = CudaMax.specialize(np.float32) # Warm up the kernel cuda_max_kernel(A_device, init=np.finfo(np.float32).min) ts = time.time() for i in range(1000): m = cuda_max_kernel(A_device, init=np.finfo(np.float32).min) print("size:", N, "time:", time.time() - ts) N = 10000000 A = np.random.uniform(low=-10000, high=10000, size=(N)) A_device = cuda.to_device(A) # Warm up the kernel with larger data cuda_max_kernel(A_device, init=np.finfo(np.float32).min) ts = time.time() for i in range(1000): m = cuda_max_kernel(A_device, init=np.finfo(np.float32).min) print("size:", N, "time:", time.time() - ts)
By compiling the kernel function once and then reusing it, you may see a reduction in the overall time it takes to call the function. Give it a try and see if it helps improve the performance with small data as well. Let me know if you have any further questions!
Leave a Reply