ther and to the data fit
term as independent as possible of the size `n_samples` of the training set.

The objective function is minimized with an alternating minimization of W
and H. If H is given and update_H=False, it solves for W only.

Note that the transformed data is named W and the components matrix is named H. In
the NMF literature, the naming convention is usually the opposite since the data
matrix X is transposed.

Parameters
----------
X : {array-like, sparse matrix} of shape (n_samples, n_features)
    Constant matrix.

W : array-like of shape (n_samples, n_components), default=None
    If `init='custom'`, it is used as initial guess for the solution.
    If `update_H=False`, it is initialised as an array of zeros, unless
    `solver='mu'`, then it is filled with values calculated by
    `np.sqrt(X.mean() / self._n_components)`.
    If `None`, uses the initialisation method specified in `init`.

H : array-like of shape (n_components, n_features), default=None
    If `init='custom'`, it is used as initial guess for the solution.
    If `update_H=False`, it is used as a constant, to solve for W only.
    If `None`, uses the initialisation method specified in `init`.

n_components : int or {'auto'} or None, default='auto'
    Number of components. If `None`, all features are kept.
    If `n_components='auto'`, the number of components is automatically inferred
    from `W` or `H` shapes.

    .. versionchanged:: 1.4
        Added `'auto'` value.

    .. versionchanged:: 1.6
        Default value changed from `None` to `'auto'`.

init : {'random', 'nndsvd', 'nndsvda', 'nndsvdar', 'custom'}, default=None
    Method used to initialize the procedure.

    Valid options:

    - None: 'nndsvda' if n_components < n_features, otherwise 'random'.
    - 'random': non-negative random matrices, scaled with:
      `sqrt(X.mean() / n_components)`
    - 'nndsvd': Nonnegative Double Singular Value Decomposition (NNDSVD)
      initialization (better for sparseness)
    - 'nndsvda': NNDSVD with zeros filled with the average of X
      (better when sparsity is not desired)
    - 'nndsvdar': NNDSVD with zeros filled with small random values
      (generally faster, less accurate alternative to NNDSVDa
      for when sparsity is not desired)
    - 'custom': If `update_H=True`, use custom matrices W and H which must both
      be provided. If `update_H=False`, then only custom matrix H is used.

    .. versionchanged:: 0.23
        The default value of `init` changed from 'random' to None in 0.23.

    .. versionchanged:: 1.1
        When `init=None` and n_components is less than n_samples and n_features
        defaults to `nndsvda` instead of `nndsvd`.

update_H : bool, default=True
    Set to True, both W and H will be estimated from initial guesses.
    Set to False, only W will be estimated.

solver : {'cd', 'mu'}, default='cd'
    Numerical solver to use:

    - 'cd' is a Coordinate Descent solver that uses Fast Hierarchical
      Alternating Least Squares (Fast HALS).
    - 'mu' is a Multiplicative Update solver.

    .. versionadded:: 0.17
       Coordinate Descent solver.

    .. versionadded:: 0.19
       Multiplicative Update solver.

beta_loss : float or {'frobenius', 'kullback-leibler',             'itakura-saito'}, default='frobenius'
    Beta divergence to be minimized, measuring the distance between X
    and the dot product WH. Note that values different from 'frobenius'
    (or 2) and 'kullback-leibler' (or 1) lead to significantly slower
    fits. Note that for beta_loss <= 0 (or 'itakura-saito'), the input
    matrix X cannot contain zeros. Used only in 'mu' solver.

    .. versionadded:: 0.19

tol : float, default=1e-4
    Tolerance of the stopping condition.

max_iter : int, default=200
    Maximum number of iterations before timing out.

alpha_W : float, default=0.0
    Constant that multiplies the regularization terms of `W`. Set it to zero
    (default) to have no regularization on `W`.

    .. versionadded:: 1.0

alpha_H : float or "same", default="same"
    Constant that multiplies the regularization terms of `H`. Set it to zero to
    have no regularization on `H`. If "same" (default), it takes the same value as
    `alpha_W`.

    .. versionadded:: 1.0

l1_ratio : float, default=0.0
    The regularization mixing parameter, with 0 <= l1_ratio <= 1.
    For l1_ratio = 0 the penalty is an elementwise L2 penalty
    (aka Frobenius Norm).
    For l1_ratio = 1 it is an elementwise L1 penalty.
    For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.

random_state : int, RandomState instance or None, default=None
    Used for NMF initialisation (when ``init`` == 'nndsvdar' or
    'random'), and in Coordinate Descent. Pass an int for reproducible
    results across multiple function calls.
    See :term:`Glossary <random_state>`.

verbose : int, default=0
    The verbosity level.

shuffle : bool, default=False
    If true, randomize the order of coordinates in the CD solver.

Returns
-------
W : ndarray of shape (n_samples, n_components)
    Solution to the non-negative least squares problem.

H : ndarray of shape (n_components, n_features)
    Solution to the non-negative least squares problem.

n_iter : int
    Actual number of iterations.

References
----------
.. [1] :doi:`"Fast local algorithms for large scale nonnegative matrix and tensor
   factorizations" <10.1587/transfun.E92.A.708>`
   Cichocki, Andrzej, and P. H. A. N. Anh-Huy. IEICE transactions on fundamentals
   of electronics, communications and computer sciences 92.3: 708-721, 2009.

.. [2] :doi:`"Algorithms for nonnegative matrix factorization with the
   beta-divergence" <10.1162/NECO_a_00168>`
   Fevotte, C., & Idier, J. (2011). Neural Computation, 23(9).

Examples
--------
>>> import numpy as np
>>> X = np.array([[1,1], [2, 1], [3, 1.2], [4, 1], [5, 0.8], [6, 1]])
>>> from sklearn.decomposition import non_negative_factorization
>>> W, H, n_iter = non_negative_factorization(
...     X, n_components=2, init='random', random_state=0)
)