. name='cuda_extension', ... sources=['extension.cpp', 'extension_kernel.cu'], ... extra_compile_args={'cxx': ['-g'], ... 'nvcc': ['-O2']}) ... ], ... cmdclass={ ... 'build_ext': BuildExtension ... }) Compute capabilities: By default the extension will be compiled to run on all archs of the cards visible during the building process of the extension, plus PTX. If down the road a new card is installed the extension may need to be recompiled. If a visible card has a compute capability (CC) that's newer than the newest version for which your nvcc can build fully-compiled binaries, Pytorch will make nvcc fall back to building kernels with the newest version of PTX your nvcc does support (see below for details on PTX). You can override the default behavior using `TORCH_CUDA_ARCH_LIST` to explicitly specify which CCs you want the extension to support: TORCH_CUDA_ARCH_LIST="6.1 8.6" python build_my_extension.py TORCH_CUDA_ARCH_LIST="5.2 6.0 6.1 7.0 7.5 8.0 8.6+PTX" python build_my_extension.py The +PTX option causes extension kernel binaries to include PTX instructions for the specified CC. PTX is an intermediate representation that allows kernels to runtime-compile for any CC >= the specified CC (for example, 8.6+PTX generates PTX that can runtime-compile for any GPU with CC >= 8.6). This improves your binary's forward compatibility. However, relying on older PTX to provide forward compat by runtime-compiling for newer CCs can modestly reduce performance on those newer CCs. If you know exact CC(s) of the GPUs you want to target, you're always better off specifying them individually. For example, if you want your extension to run on 8.0 and 8.6, "8.0+PTX" would work functionally because it includes PTX that can runtime-compile for 8.6, but "8.0 8.6" would be better. Note that while it's possible to include all supported archs, the more archs get included the slower the building process will be, as it will build a separate kernel image for each arch. Note that CUDA-11.5 nvcc will hit internal compiler error while parsing torch/extension.h on Windows. To workaround the issue, move python binding logic to pure C++ file. Example use: #include at::Tensor SigmoidAlphaBlendForwardCuda(....) Instead of: #include torch::Tensor SigmoidAlphaBlendForwardCuda(...) Currently open issue for nvcc bug: https://github.com/pytorch/pytorch/issues/69460 Complete workaround code example: https://github.com/facebookresearch/pytorch3d/commit/cb170ac024a949f1f9614ffe6af1c38d972f7d48 Relocatable device code linking: If you want to reference device symbols across compilation units (across object files), the object files need to be built with `relocatable device code` (-rdc=true or -dc). An exception to this rule is "dynamic parallelism" (nested kernel launches) which is not used a lot anymore. `Relocatable device code` is less optimized so it needs to be used only on object files that need it. Using `-dlto` (Device Link Time Optimization) at the device code compilation step and `dlink` step help reduce the protentional perf degradation of `-rdc`. Note that it needs to be used at both steps to be useful. If you have `rdc` objects you need to have an extra `-dlink` (device linking) step before the CPU symbol linking step. There is also a case where `-dlink` is used without `-rdc`: when an extension is linked against a static lib containing rdc-compiled objects like the [NVSHMEM library](https://developer.nvidia.com/nvshmem). Note: Ninja is required to build a CUDA Extension with RDC linking. Example: >>> # xdoctest: +SKIP >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CPP_EXT) >>> CUDAExtension( ... name='cuda_extension', ... sources=['extension.cpp', 'extension_kernel.cu'], ... dlink=True, ... dlink_libraries=["dlink_lib"], ... extra_compile_args={'cxx': ['-g'], ... 'nvcc': ['-O2', '-rdc=true']}) rb