Silent CUDA Kernel Failures: Why My Error Checking Mechanism Failed Me

Recently, I hit a really frustrating bug: a specific issue that only appeared in Release Mode but vanished in Debug Mode (or any mode with debugging symbols).

The program wasn’t crashing loudly, but the kernel outputs were consistently 0. Even worse, my error checking mechanism was completely silent.

To catch CUDA kernel launch failures, you typically just need to call cudaGetLastError(); it returns an error code if the launch had an issue. In our case, it returned cudaSuccess.

Because the issue vanished whenever I added debugging symbols, I couldn’t simply attach a debugger or use breakpoints to see what was happening under the hood. There was definitely an issue, but I was flying blind.

I decided to manually insert a cudaGetLastError() call immediately after the kernel launch in the main application. Suddenly, it failed loudly with the error: no kernel image is available for execution on the device.

The root cause was in my CMake configuration. The CMAKE_CUDA_ARCHITECTURES variable was set to "75;80;86;89;90", but the GPU architecture I was testing on was 70.

After fixing the architecture list, Release Mode was working again. But there was still a lingering question…

Why did my original error checking mechanism fail me?

I found the answer in the CUDA Runtime API - Error Handling documentation:

Description Returns the last error that has been produced by any of the runtime calls in the same instance of the CUDA Runtime library in the host thread…
Note: Multiple instances of the CUDA Runtime library can be present in an application when using a library that statically links the CUDA Runtime.

It failed because it was not checking the same instance of the CUDA Runtime library.

The library containing the kernel was compiled separately from the application that executed and checked the errors. Both were linking to the CUDA Runtime library Statically. This created two separate runtime states: one that saw the error, and one that saw “Success”.

To solve this, I changed the integration of the CUDA Runtime in both projects to link against the Shared version of the library instead.

See the Issue for this bug