e performant the optimization.
    Since we rely on using sparse layout tensors, we infer that any materialized value in the
    sparse layout is non-zero and we do NOT actually verify that all values are not zero!
    It is important to not conflate a semantically sparse tensor (a tensor where many
    of its values are zeros) with a sparse layout tensor (a tensor where ``.is_sparse``
    returns ``True``). The SparseAdam approximation is intended for `semantically` sparse
    tensors and the sparse layout is only a implementation detail. A clearer implementation
    would be to use MaskedTensors, but those are experimental.


    .. note::

        If you suspect your gradients are semantically sparse (but do not have sparse
        layout), this variant may not be the best for you. Ideally, you want to avoid
        materializing anything that is suspected to be sparse in the first place, since
        needing to convert all your grads from dense layout to sparse layout may outweigh
        the performance gain. Here, using Adam may be the best alternative, unless you
        can easily rig up your module to output sparse grads similar to
        ``nn.Embedding(sparse=True)``. If you insist on converting your grads, you can do
        so by manually overriding your parameters' ``.grad`` fields with their sparse
        equivalents before calling ``.step()``.


    Args:
        params (iterable): iterable of parameters to optimize or dicts defining
            parameter groups
        lr (float, optional): learning rate (default: 1e-3)
        betas (Tuple[float, float], optional): coefficients used for computing
            running averages of gradient and its square (default: (0.9, 0.999))
        eps (float, optional): term added to the denominator to improve
            numerical stability (default: 1e-8)
        zd

    .. _Adam\: A Method for Stochastic Optimization:
        https://arxiv.org/abs/1412.6980

    )
r$