e performant the optimization. Since we rely on using sparse layout tensors, we infer that any materialized value in the sparse layout is non-zero and we do NOT actually verify that all values are not zero! It is important to not conflate a semantically sparse tensor (a tensor where many of its values are zeros) with a sparse layout tensor (a tensor where ``.is_sparse`` returns ``True``). The SparseAdam approximation is intended for `semantically` sparse tensors and the sparse layout is only a implementation detail. A clearer implementation would be to use MaskedTensors, but those are experimental. .. note:: If you suspect your gradients are semantically sparse (but do not have sparse layout), this variant may not be the best for you. Ideally, you want to avoid materializing anything that is suspected to be sparse in the first place, since needing to convert all your grads from dense layout to sparse layout may outweigh the performance gain. Here, using Adam may be the best alternative, unless you can easily rig up your module to output sparse grads similar to ``nn.Embedding(sparse=True)``. If you insist on converting your grads, you can do so by manually overriding your parameters' ``.grad`` fields with their sparse equivalents before calling ``.step()``. Args: params (iterable): iterable of parameters to optimize or dicts defining parameter groups lr (float, optional): learning rate (default: 1e-3) betas (Tuple[float, float], optional): coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999)) eps (float, optional): term added to the denominator to improve numerical stability (default: 1e-8) zd .. _Adam\: A Method for Stochastic Optimization: https://arxiv.org/abs/1412.6980 ) r$