o` (Device Link Time Optimization) at the device code compilation step and `dlink` step help reduce the protentional perf degradation of `-rdc`. Note that it needs to be used at both steps to be useful. If you have `rdc` objects you need to have an extra `-dlink` (device linking) step before the CPU symbol linking step. There is also a case where `-dlink` is used without `-rdc`: when an extension is linked against a static lib containing rdc-compiled objects like the [NVSHMEM library](https://developer.nvidia.com/nvshmem). Note: Ninja is required to build a CUDA Extension with RDC linking. Example: >>> # xdoctest: +SKIP >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CPP_EXT) >>> CUDAExtension( ... name='cuda_extension', ... sources=['extension.cpp', 'extension_kernel.cu'], ... dlink=True, ... dlink_libraries=["dlink_lib"], ... extra_compile_args={'cxx': ['-g'], ... 'nvcc': ['-O2', '-rdc=true']}) rp