Recommand · May 31, 2021 0

Cuda kernel skip loop for large dataset?

I wonder why: I have 990002 rows data in my dataset. For each row, it contains list of items (max 100 items each row). I run a CUDA kernel to scan the dataset to find distinct items and store it in matrix inside the kernel. When I tried to compile and run, the kernel is not invoked (found it when debugging). I tried to use another dataset with less number of rows. The kernel is invoked but some rows is skipped. I tried to reduce the dataset then the kernel is working correctly. So, my program is working correctly in less size of dataset. I use 1 block with 1024 threads. Does anybody know what happen? or what should I do?