s improves throughput in some scenarios (tested on Ampere GPU), esp w/ channels_last layout. However, benefits are not always clear and can perform worse on other GPUs. r/