Trace memory error of CUDA program

ROBIN DONG 2021-05-14 08:57

The program which used CUDA for computing in GPU reported error about memory:

terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] an illegal memory access was encountered LightGBM/src/treelearner/cuda_tree_learner.cpp 239

For common C++ program, we usegdbfor debugging. For CUDA program, we should usecuda-gdb. Make sure to compile CUDA code with-gflag and then run:

/usr/local/cuda-11.0/bin/cuda-gdb python3
(cuda-gdb) run

After a while, we could see the exact memory corrupt position of the code:

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x1668b2f0 (

Thread 1 "python3" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 10, block (2163,0,0), thread (0,0,0), device 0, sm 0, warp 3, lane 0]
0x000000001668b380 in LightGBM::histogram16<<<(7360,1,1),(16,1,1)>>> () at LightGBM/src/treelearner/kernels/
185            feature = (feature >> ((ind & 1) << 2)) & 0xf;

[返回] [原文链接]