decomposition¶
Abstract Base Class¶
Decomposition classes are built through an AbstractDecompositionModel
, which
extends scikit-learn
’s BaseEstimator
class to include methods that are
relevant for decomposition methods.
-
class
pyuoi.decomposition.base.
AbstractDecompositionModel
[source]¶ -
abstract
fit
()[source]¶ Placeholder for fit. Subclasses should implement this method. Fit the model with X.
- Parameters
X (array-like, shape (n_samples, n_features)) – Training data.
- Returns
self – Returns the instance itself.
- Return type
-
abstract
CUR Decomposition¶
The pyuoi
package includes a class to perform ordinary CUR decomposition in
addition to a class that performs UoICUR.
-
class
pyuoi.decomposition.CUR.
CUR
(max_k, algorithm='randomized', n_iter=5, tol=0.0, random_state=None)[source]¶ Performs ordinary column subset selection through a CUR decomposition.
- Parameters
max_k (int) – The maximum rank of the singular value decomposition.
algorithm (string, optional) – SVD solver to use. Either “arpack” for the ARPACK wrapper in SciPy (scipy.sparse.linalg.svds), or “randomized” for the randomized algorithm due to Halko (2009).
n_iter (int, optional) – Number of iterations for randomized SVD solver. Not used by ARPACK. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum.
random_state (int, RandomState instance, or None, optional) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
tol (float, optional) – Tolerance for ARPACK. 0 means machine precision. Ignored by randomized SVD solver.
-
components_
¶ The selected columns of the design matrix.
- Type
ndarray, shape (n_samples, n_components)
-
column_indices_
¶ The indices of the columns selected by the algorithm.
- Type
ndarray, shape (n_components,)
-
fit
(X, c=None)[source]¶ Performs column subset selection in the UoI framework on a provided matrix.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The data matrix.
c (float) – The expected number of columns to select. If None, c will vary with the rank k.
-
fit_transform
(X, c=None)[source]¶ Fit and transform the data by choosing and extracting specific columns.
- Parameters
X (array-like, shape (n_samples, n_features)) – Data matrix from which to select columns.
- Returns
X_new – Data matrix comprised of selected columns.
- Return type
array-like, shape (n_samples, n_components)
-
class
pyuoi.decomposition.CUR.
UoI_CUR
(n_boots, max_k, boots_frac, stability_selection=1.0, algorithm='randomized', n_iter=5, tol=0.0, random_state=None)[source]¶ Performs column subset selection (CUR decomposition) in the Union of Intersections framework.
- Parameters
n_boots (int) – Number of bootstraps.
max_k (int) – The maximum rank of the singular value decomposition.
boots_frac (float) – The fraction of data to use in the bootstrap.
algorithm (string, optional) – SVD solver to use. Either “arpack” for the ARPACK wrapper in SciPy (scipy.sparse.linalg.svds), or “randomized” for the randomized algorithm due to Halko (2009).
n_iter (int, optional) – Number of iterations for randomized SVD solver. Not used by ARPACK. The default is larger than the default in randomized_svd to handle sparse matrices that may have large slowly decaying spectrum.
random_state (int, RandomState instance, or None, optional) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
tol (float, optional) – Tolerance for ARPACK. 0 means machine precision. Ignored by randomized SVD solver.
-
components_
¶ The selected columns of the design matrix.
- Type
ndarray, shape (n_samples, n_components)
-
column_indices_
¶ The indices of the columns selected by the algorithm.
- Type
ndarray, shape (n_components,)
-
check_ks_and_cs
(ks=None, cs=None)[source]¶ Process the set of ranks to calculate leverage scores over, and the expected number of columns for each rank.
- Parameters
ks (ndarray) – The ranks to compute leverage scores over.
cs (ndarray) – The expected number of columns to select for each rank.
- Returns
ks (ndarray) – Processed and checked ranks.
cs (ndarray) – Processed expected number of columns.
-
fit
(X, ks=None, cs=None, stratify=None)[source]¶ Performs column subset selection in the UoI framework on a provided matrix.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The data matrix.
cs (int, float or None) – The expected number of columns to select. If None, c will vary with the rank k.
ks (int, list, ndarray, or None) – The ranks to consider union over. If None, all ranks from (1, …, max_k) will be used.
stratify (array-like or None) – Ensures groups of samples are alloted to bootstraps proportionally. Labels for each group must be an int greater than zero. Must be of size equal to the number of samples, with further restrictions on the number of groups.
- Returns
union – A numpy array containing the indices of the selected columns.
- Return type
ndarray, shape (n_components,)
-
fit_transform
(X, ks=None, cs=None)[source]¶ Fit and transform the data by choosing and extracting specific columns.
- Parameters
X (array-like, shape (n_samples, n_features)) – Data matrix from which to select columns.
- Returns
X_new – Data matrix comprised of selected columns.
- Return type
array-like, shape (n_samples, n_components)
Non-negative Matrix Factorization¶
UoINMF can be customized with various NMF, clustering, non-negative least squares, and consensus algorithms. A base class accepts general objects or functions to perform the desired NMF, clustering, regression, and consensus grouping (provided that they have the correct structure). A derived class which uses
scikit-learn
’s NMF objectDBSCAN for clustering
scipy
’s non-negative least squares functionthe median function for consensus grouping
is also provided. This derived class accepts keyword arguments that correspond to the keyword arguments of the above algorithms, so that the user does not have to provide instantiated objects.
-
class
pyuoi.decomposition.NMF.
UoI_NMF
(n_boots, ranks=None, nmf_init='random', nmf_solver='mu', nmf_beta_loss='kullback-leibler', nmf_tol=0.0001, nmf_max_iter=400, db_eps=0.5, db_min_samples=None, db_metric='euclidean', db_metric_params=None, db_algorithm='auto', db_leaf_size=30, use_dissimilarity=True, random_state=None, logger=None, nmf=None, cluster=None, nnreg=None, cons_meth=None)[source]¶ Performs non-negative matrix factorization in the Union of Intersections framework.
This derived class uses (and accepts the keyword arguments for)
scikit-learn
’s NMF and DBSCAN objects,scipy
’s non-negative least squares function, and a mean function for consensus grouping.- Parameters
n_boots (int) – The number of bootstraps to use for model selection.
ranks (int, list, or None) – The range of k to use. If ranks is an int, range(2, ranks + 1) will be used. If not specified, range(X.shape[1]) will be used.
nmf_init ("random" | "nndsvd" | "nndsvda" | "nndsvdar" | "custom") –
- Method used to initialize the NMF procedure. Valid options:
- ”random”: non-negative random matrices, scaled with
sqrt(X.mean() / n_components)
- ”nndsvd”: Nonnegative Double Singular Value Decomposition (NNDSVD)
initialization (better for sparseness)
- ”nndsvda”: NNDSVD with zeros filled with the average of X
(better when sparsity is not desired)
- ”nndsvdar”: NNDSVD with zeros filled with small random values
(generally faster, less accurate alternative to NNDSVDa for when sparsity is not desired)
”custom”: use custom matrices W and H
nmf_solver ('cd' | 'mu', optional) – Numerical solver to use for NMF: ‘cd’ is a Coordinate Descent solver, while ‘mu’ is a Multiplicative Update solver.
nmf_beta_loss (float or string, optional) – String must be in {‘frobenius’, ‘kullback-leibler’, ‘itakura-saito’}. Beta divergence to be minimized, measuring the distance between X and the dot product WH. Note that values different from ‘frobenius’ (or 2) and ‘kullback-leibler’ (or 1) lead to significantly slower fits. Note that for beta_loss <= 0 (or ‘itakura-saito’), the input matrix X cannot contain zeros. Used only in ‘mu’ solver.
nmf_tol (float, optional) – Tolerance of the stopping condition for NMF algorithm.
nmf_max_iter (integer, optional) – Maximum number of iterations before timing out in NMF.
db_eps (float, optional) – The maximum distance between two samples for them to be considered as in the same neighborhood in the DBSCAN algorithm.
db_min_samples (int, optional) – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
db_metric (string, or callable, optional) – The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by
sklearn.metrics.pairwise_distances()
for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN.db_metric_params (dict, optional) – Additional keyword arguments for the metric function.
db_algorithm ({'auto', 'ball_tree', 'kd_tree', 'brute'}, optional) – The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. See NearestNeighbors module documentation for details.
db_leaf_size (int, optional (default = 30)) – Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
random_state (int, RandomState instance, or None) – The seed of the pseudo random number generator that selects a random feature to update. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
logger (Logger) – The logger to use for messages when
verbose=True
infit
. If None is passed, a logger that writes tosys.stdout
will be used.
-
class
pyuoi.decomposition.NMF.
UoI_NMF_Base
(n_boots=10, ranks=None, nmf=None, cluster=None, nnreg=None, cons_meth=None, use_dissimilarity=True, random_state=None, logger=None)[source]¶ Performs non-negative matrix factorization in the Union of Intersections framework.
This base class accepts objects or functions that perform the NMF fitting, clustering, non-negative regression, and consensus grouping.
- Parameters
n_boots (int) – The number of bootstraps to use for model selection.
ranks (int, list, or None) – The range of k to use. If ranks is an int, range(2, ranks + 1) will be used. If not specified, range(X.shape[1]) will be used.
nmf (NMF object) – The NMF object to use to perform fitting. Note: this class must take n_components as an argument.
cluster (Clustering object) – Clustering object to use. If None, defaults to DBSCAN.
nnreg (NNLS object) – Non-negative regressor to use. If None, defaults to scipy.optimize.nnls.
cons_meth (function) – The method for computing consensus bases after clustering. If None, uses np.mean.
use_dissimilarity (bool) – Whether to use dissimilarity to choose the final rank. If False, all bases across ranks are concatenated and clustered. The final rank in this case is how many clusters are chosen.
random_state (int, RandomState instance, or None) – The seed of the pseudo random number generator that selects a random feature to update. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
logger (Logger) – The logger to use for messages when
verbose=True
infit
. If None is passed, a logger that writes tosys.stdout
will be used.
-
fit
(X, verbose=False)[source]¶ Compute the basis matrix on the provided data matrix using the UoINMF algorithm.
- Parameters
X (ndarray, shape (n_samples, n_features)) – Data matrix to be decomposed.
verbose (bool) – If True, outputs status updates.
-
fit_transform
(X, reconstruction_err=True, verbose=None)[source]¶ Fit and transform the data according to the fitted UoINMF model.
- Parameters
- Returns
W – Transformed data (coefficients of bases).
- Return type
array-like, shape (n_samples, n_components)
-
inverse_transform
(W)[source]¶ Transform data back to its original space.
- Parameters
W (array-like, shape (n_samples, n_components)) – Transformed data matrix.
- Returns
X – Data matrix of original shape.
- Return type
array-like, shape (n_samples, n_features)
-
transform
(X, reconstruction_err=True)[source]¶ Transform the data according to the fitted UoINMF model.
- Parameters
X (array-like, shape (n_samples, n_features)) – Data matrix to be decomposed.
reconstruction_err (bool) – If True, the reconstruction error is computed and stored as a class attribute.
- Returns
W – Transformed data (coefficients of bases).
- Return type
array-like, shape (n_samples, n_components)