decomposition

Abstract Base Class

Decomposition classes are built through an AbstractDecompositionModel, which extends scikit-learn’s BaseEstimator class to include methods that are relevant for decomposition methods.

class pyuoi.decomposition.base.AbstractDecompositionModel[source]
abstract fit()[source]

Placeholder for fit. Subclasses should implement this method. Fit the model with X.

Parameters

X (array-like, shape (n_samples, n_features)) – Training data.

Returns

self – Returns the instance itself.

Return type

object

abstract fit_transform(X)[source]

Transform the data X according to the fitted decomposition.

Parameters

X (array-like, shape (n_samples, n_features)) – Data matrix to be decomposed.

Returns

X_new – Transformed data.

Return type

array-like, shape (n_samples, n_components)

abstract transform(X)[source]

Apply dimensionality reduction to X.

Parameters

X (array-like, shape (n_samples, n_features)) – Data matrix to be transformed.

Returns

X_new – The transformed data matrix.

Return type

array-like, shape (n_samples, n_components)

CUR Decomposition

The pyuoi package includes a class to perform ordinary CUR decomposition in addition to a class that performs UoICUR.

class pyuoi.decomposition.CUR.CUR(max_k, algorithm='randomized', n_iter=5, tol=0.0, random_state=None)[source]

Performs ordinary column subset selection through a CUR decomposition.

Parameters
  • max_k (int) – The maximum rank of the singular value decomposition.

  • algorithm (string, optional) – SVD solver to use. Either “arpack” for the ARPACK wrapper in SciPy (scipy.sparse.linalg.svds), or “randomized” for the randomized algorithm due to Halko (2009).

  • n_iter (int, optional) – Number of iterations for randomized SVD solver. Not used by ARPACK. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum.

  • random_state (int, RandomState instance, or None, optional) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • tol (float, optional) – Tolerance for ARPACK. 0 means machine precision. Ignored by randomized SVD solver.

components_

The selected columns of the design matrix.

Type

ndarray, shape (n_samples, n_components)

column_indices_

The indices of the columns selected by the algorithm.

Type

ndarray, shape (n_components,)

fit(X, c=None)[source]

Performs column subset selection in the UoI framework on a provided matrix.

Parameters
  • X (ndarray, shape (n_samples, n_features)) – The data matrix.

  • c (float) – The expected number of columns to select. If None, c will vary with the rank k.

fit_transform(X, c=None)[source]

Fit and transform the data by choosing and extracting specific columns.

Parameters

X (array-like, shape (n_samples, n_features)) – Data matrix from which to select columns.

Returns

X_new – Data matrix comprised of selected columns.

Return type

array-like, shape (n_samples, n_components)

transform(X)[source]

Transform the data by extracting the selected columns.

Parameters

X (array-like, shape (n_samples, n_features)) – Data matrix from which to select columns.

Returns

X_new – Data matrix comprised of selected columns.

Return type

array-like, shape (n_samples, n_components)

class pyuoi.decomposition.CUR.UoI_CUR(n_boots, max_k, boots_frac, stability_selection=1.0, algorithm='randomized', n_iter=5, tol=0.0, random_state=None)[source]

Performs column subset selection (CUR decomposition) in the Union of Intersections framework.

Parameters
  • n_boots (int) – Number of bootstraps.

  • max_k (int) – The maximum rank of the singular value decomposition.

  • boots_frac (float) – The fraction of data to use in the bootstrap.

  • algorithm (string, optional) – SVD solver to use. Either “arpack” for the ARPACK wrapper in SciPy (scipy.sparse.linalg.svds), or “randomized” for the randomized algorithm due to Halko (2009).

  • n_iter (int, optional) – Number of iterations for randomized SVD solver. Not used by ARPACK. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum.

  • random_state (int, RandomState instance, or None, optional) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • tol (float, optional) – Tolerance for ARPACK. 0 means machine precision. Ignored by randomized SVD solver.

components_

The selected columns of the design matrix.

Type

ndarray, shape (n_samples, n_components)

column_indices_

The indices of the columns selected by the algorithm.

Type

ndarray, shape (n_components,)

check_ks_and_cs(ks=None, cs=None)[source]

Process the set of ranks to calculate leverage scores over, and the expected number of columns for each rank.

Parameters
  • ks (ndarray) – The ranks to compute leverage scores over.

  • cs (ndarray) – The expected number of columns to select for each rank.

Returns

  • ks (ndarray) – Processed and checked ranks.

  • cs (ndarray) – Processed expected number of columns.

fit(X, ks=None, cs=None, stratify=None)[source]

Performs column subset selection in the UoI framework on a provided matrix.

Parameters
  • X (ndarray, shape (n_samples, n_features)) – The data matrix.

  • cs (int, float or None) – The expected number of columns to select. If None, c will vary with the rank k.

  • ks (int, list, ndarray, or None) – The ranks to consider union over. If None, all ranks from (1, …, max_k) will be used.

  • stratify (array-like or None) – Ensures groups of samples are alloted to bootstraps proportionally. Labels for each group must be an int greater than zero. Must be of size equal to the number of samples, with further restrictions on the number of groups.

Returns

union – A numpy array containing the indices of the selected columns.

Return type

ndarray, shape (n_components,)

fit_transform(X, ks=None, cs=None)[source]

Fit and transform the data by choosing and extracting specific columns.

Parameters

X (array-like, shape (n_samples, n_features)) – Data matrix from which to select columns.

Returns

X_new – Data matrix comprised of selected columns.

Return type

array-like, shape (n_samples, n_components)

transform(X)[source]

Transform the data by extracting the selected columns.

Parameters

X (array-like, shape (n_samples, n_features)) – Data matrix from which to select columns.

Returns

X_new – Data matrix comprised of selected columns.

Return type

array-like, shape (n_samples, n_components)

Non-negative Matrix Factorization

UoINMF can be customized with various NMF, clustering, non-negative least squares, and consensus algorithms. A base class accepts general objects or functions to perform the desired NMF, clustering, regression, and consensus grouping (provided that they have the correct structure). A derived class which uses

  • scikit-learn’s NMF object

  • DBSCAN for clustering

  • scipy’s non-negative least squares function

  • the median function for consensus grouping

is also provided. This derived class accepts keyword arguments that correspond to the keyword arguments of the above algorithms, so that the user does not have to provide instantiated objects.

class pyuoi.decomposition.NMF.UoI_NMF(n_boots, ranks=None, nmf_init='random', nmf_solver='mu', nmf_beta_loss='kullback-leibler', nmf_tol=0.0001, nmf_max_iter=400, db_eps=0.5, db_min_samples=None, db_metric='euclidean', db_metric_params=None, db_algorithm='auto', db_leaf_size=30, use_dissimilarity=True, random_state=None, logger=None, nmf=None, cluster=None, nnreg=None, cons_meth=None)[source]

Performs non-negative matrix factorization in the Union of Intersections framework.

This derived class uses (and accepts the keyword arguments for) scikit-learn’s NMF and DBSCAN objects, scipy’s non-negative least squares function, and a mean function for consensus grouping.

Parameters
  • n_boots (int) – The number of bootstraps to use for model selection.

  • ranks (int, list, or None) – The range of k to use. If ranks is an int, range(2, ranks + 1) will be used. If not specified, range(X.shape[1]) will be used.

  • nmf_init ("random" | "nndsvd" | "nndsvda" | "nndsvdar" | "custom") –

    Method used to initialize the NMF procedure. Valid options:
    • ”random”: non-negative random matrices, scaled with

      sqrt(X.mean() / n_components)

    • ”nndsvd”: Nonnegative Double Singular Value Decomposition (NNDSVD)

      initialization (better for sparseness)

    • ”nndsvda”: NNDSVD with zeros filled with the average of X

      (better when sparsity is not desired)

    • ”nndsvdar”: NNDSVD with zeros filled with small random values

      (generally faster, less accurate alternative to NNDSVDa for when sparsity is not desired)

    • ”custom”: use custom matrices W and H

  • nmf_solver ('cd' | 'mu', optional) – Numerical solver to use for NMF: ‘cd’ is a Coordinate Descent solver, while ‘mu’ is a Multiplicative Update solver.

  • nmf_beta_loss (float or string, optional) – String must be in {‘frobenius’, ‘kullback-leibler’, ‘itakura-saito’}. Beta divergence to be minimized, measuring the distance between X and the dot product WH. Note that values different from ‘frobenius’ (or 2) and ‘kullback-leibler’ (or 1) lead to significantly slower fits. Note that for beta_loss <= 0 (or ‘itakura-saito’), the input matrix X cannot contain zeros. Used only in ‘mu’ solver.

  • nmf_tol (float, optional) – Tolerance of the stopping condition for NMF algorithm.

  • nmf_max_iter (integer, optional) – Maximum number of iterations before timing out in NMF.

  • db_eps (float, optional) – The maximum distance between two samples for them to be considered as in the same neighborhood in the DBSCAN algorithm.

  • db_min_samples (int, optional) – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

  • db_metric (string, or callable, optional) – The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by sklearn.metrics.pairwise_distances() for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN.

  • db_metric_params (dict, optional) – Additional keyword arguments for the metric function.

  • db_algorithm ({'auto', 'ball_tree', 'kd_tree', 'brute'}, optional) – The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. See NearestNeighbors module documentation for details.

  • db_leaf_size (int, optional (default = 30)) – Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

  • random_state (int, RandomState instance, or None) – The seed of the pseudo random number generator that selects a random feature to update. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • logger (Logger) – The logger to use for messages when verbose=True in fit. If None is passed, a logger that writes to sys.stdout will be used.

class pyuoi.decomposition.NMF.UoI_NMF_Base(n_boots=10, ranks=None, nmf=None, cluster=None, nnreg=None, cons_meth=None, use_dissimilarity=True, random_state=None, logger=None)[source]

Performs non-negative matrix factorization in the Union of Intersections framework.

This base class accepts objects or functions that perform the NMF fitting, clustering, non-negative regression, and consensus grouping.

Parameters
  • n_boots (int) – The number of bootstraps to use for model selection.

  • ranks (int, list, or None) – The range of k to use. If ranks is an int, range(2, ranks + 1) will be used. If not specified, range(X.shape[1]) will be used.

  • nmf (NMF object) – The NMF object to use to perform fitting. Note: this class must take n_components as an argument.

  • cluster (Clustering object) – Clustering object to use. If None, defaults to DBSCAN.

  • nnreg (NNLS object) – Non-negative regressor to use. If None, defaults to scipy.optimize.nnls.

  • cons_meth (function) – The method for computing consensus bases after clustering. If None, uses np.mean.

  • use_dissimilarity (bool) – Whether to use dissimilarity to choose the final rank. If False, all bases across ranks are concatenated and clustered. The final rank in this case is how many clusters are chosen.

  • random_state (int, RandomState instance, or None) – The seed of the pseudo random number generator that selects a random feature to update. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • logger (Logger) – The logger to use for messages when verbose=True in fit. If None is passed, a logger that writes to sys.stdout will be used.

fit(X, verbose=False)[source]

Compute the basis matrix on the provided data matrix using the UoINMF algorithm.

Parameters
  • X (ndarray, shape (n_samples, n_features)) – Data matrix to be decomposed.

  • verbose (bool) – If True, outputs status updates.

fit_transform(X, reconstruction_err=True, verbose=None)[source]

Fit and transform the data according to the fitted UoINMF model.

Parameters
  • X (array-like, shape (n_samples, n_features)) – Data matrix to be decomposed.

  • reconstruction_err (bool) – If True, the reconstruction error is computed and stored as a class attribute.

  • verbose (bool) – If True, outputs status updates.

Returns

W – Transformed data (coefficients of bases).

Return type

array-like, shape (n_samples, n_components)

inverse_transform(W)[source]

Transform data back to its original space.

Parameters

W (array-like, shape (n_samples, n_components)) – Transformed data matrix.

Returns

X – Data matrix of original shape.

Return type

array-like, shape (n_samples, n_features)

set_params(**kwargs)[source]

Set the parameters of this estimator.

transform(X, reconstruction_err=True)[source]

Transform the data according to the fitted UoINMF model.

Parameters
  • X (array-like, shape (n_samples, n_features)) – Data matrix to be decomposed.

  • reconstruction_err (bool) – If True, the reconstruction error is computed and stored as a class attribute.

Returns

W – Transformed data (coefficients of bases).

Return type

array-like, shape (n_samples, n_components)