Introduction to Union of Intersections

The Union of Intersections (UoI) is a flexible, modular, and scalable framework for enhancing both the identification of features (model selection) as well as the estimation of the contributions of these features (model estimation) within a model-fitting procedure.

Methods built into the UoI framework (e.g. regression, classification, dimensionality reduction) leverage stochastic data resampling and a range of sparsity-inducing regularization parameters/dimensions to build families of potential feature sets robust to resamples (i.e., perturbations) of the data, and then average nearly unbiased parameter estimates of selected features to maximize predictive accuracy.

In simpler terms: scientific data is often noisy, and we are generally interested in identifying features in a model that are stable to that noise. UoI uses novel aggregating procedures within a resampling framework to do both sparse (i.e., getting rid of features that aren’t robust) and predictive (i.e., keeping features that are informative) model fitting.

PyUoI contains implementations of the UoI variants for several lasso or elastic-net penalized generalized linear models (linear, logistic, and Poisson regression) as well as dimensionality reduction techniques, including CUR (column subset selection) and non-negative matrix factorization. Below, we detail the UoI framework in these contexts.

Linear Models

PyUoI contains implementations to the UoI versions of lasso, elastic net, logistic regression, and Poisson regression. We introduce the UoI framework for lasso, and briefly elaborate on the cost functions for the other types of regressions.

Lasso Regression

Consider a linear model with \(M\) features:

\[\begin{align} y &= \sum_{i=1}^{M} x_i \beta_i + \epsilon, \end{align}\]

where \(\left\{ x_i \right\}_{i=1}^M\) are the features, \(y\) is the response variable, and \(\epsilon\) is variability uncaptured by our model.

A common approach to fitting the parameters \(\beta\) relies on the L1 penalty, which we apply to a typical mean squared (reconstruction) error:

\[\begin{align} \boldsymbol{\beta}^* &= \underset{\boldsymbol{\beta}}{\text{argmax }} \Bigg\{ \frac{1}{N}\sum_{i=1}^N \left(y_i - \sum_{j=1}^M x_{ij}\beta_j\right)^2+ \lambda \sum_{j=1}^M |\beta_j| \Bigg\}. \end{align}\]

The addition of the L1 penalty on the parameters has the impact of setting some subset of them exactly equally to zero, thereby performing model selection. The parameter \(\lambda\) specifies how strongly sparsity is enforced, and is typically chosen through cross-validation or a penalized score function across a set of hyperparameters \(\left\{\lambda_k\right\}_{k=1}^{H}\).

In the UoI framework, model selection and model estimation are performed in separate modules. For UoI_Lasso, the procedure is as follows:

Model Selection: For each \(\lambda_k\), generate Lasso estimates on \(N_S\) resamples of the data. The support \(S_k\) (i.e., the set of non-zero parameters) for \(\lambda_k\) consists of the features that persist (via an intersection) in all model fits across the resamples.
Model Estimation: For each support \(S_k\), apply some nearly unbiased estimator, such as Ordinary Least Squares, on \(N_E\) resamples of the data, using only the features in \(S_k\). The final model is obtained by averaging (i.e., unionizing) across the fitted models that optimally predict (according to a specified metric) held-out data for each resample. This also reduces the variance of the estimates.

Thus, the selection module ensures that, for each \(\lambda_j\), only features that are stable to perturbations in the data (resamples) are allowed in the support \(S_j\). Meanwhile, the estimation module ensures that only the predictive supports are averaged together in the final model. The degree of feature compression via intersections (quantified by \(N_S\)) and the degree of feature expansion via unions (quantified by \(N_E\)) can be balanced to maximize prediction accuracy for the response variable \(y\).

Elastic Net

The elastic net penalty includes an additional L2 penalty on the parameter values:

\[\begin{align} \boldsymbol{\beta}^* &= \underset{\boldsymbol{\beta}}{\text{argmax }} \Bigg\{ \frac{1}{N}\sum_{i=1}^N \left(y_i - \sum_{j=1}^M x_{ij}\beta_j\right)^2+ \lambda \left(\alpha \sum_{j=1}^M |\beta_j| + \frac{1}{2}(1-\alpha) \sum_{j=1}^M \beta_j^2\right) \Bigg\}. \end{align}\]

Thus, there are two parameters \(\lambda\) and \(\alpha\) to optimize over. In the UoI framework, instead of iterating over a single hyperparameter (e.g. the \(\lambda_k\) in Lasso), each pair of hyperparameters is iterated over. Then, the selection module proceeds normally. The estimation module is unchanged, because it operates only on unique supports produced by the selection module.

Logistic Regresion

Logistic regression can be divided into two cases: Bernoulli logistic regression, or a binary choice, and multinomial logistic regression, where there are multiple choices. In the Bernoulli, or binary case, the model is given by

\[\begin{split}\begin{align} P(y=1|\mathbf{x};\boldsymbol{\beta}) = \hat y &= \sigma\left(\beta_0 + \sum_{i=1}^M x_i \beta_i\right)\qquad \text{with}\\ \sigma(r) &= \frac{1}{1+\exp(-r)}. \end{align}\end{split}\]

Meanwhile, for multinomial logistic regression (multiclass), the model is defined as

\[\begin{split}\begin{equation} P(y_i=1|\mathbf{x}, \boldsymbol{\beta}) = \hat y_i = \displaystyle\begin{cases} \frac{1}{C} \exp\left(\sum_{j=1}^M x_j \beta_{ij} \right), & i=1\\ \frac{1}{C} \exp\left(\beta_i + \sum_{j=1}^M x_j \beta_{ij}\right), & i>1 \end{cases}, \end{equation}\end{split}\]

where

\[\begin{align} C &= \exp\left(\sum_{j=1}^M x_j \beta_{1j}\right) + \sum_{i \neq 1} \exp\left(\beta_i + \sum_{j=1}^M x_j \beta_{ij} \right). \end{align}\]

Note that we have set the first class’ intercept to zero to account for overparameterization. The parameters are found by minimizing the penalized negative log-likelihood. The penalized negative log-likelihood for one sample is given by

\[\begin{split}\begin{align} \text{cost}(\mathbf{x}, y_i=1) = \begin{cases} \frac{1}{N}\left(\log C - \sum_{j=1}^M x_j \beta_{ij}\right) + \lambda \sum_{j=1}^M |\beta_{ij}|, & i=1\\ \frac{1}{N}\left(\log C - \beta_i - \sum_{j=1}^M x_j \beta_{ij}\right) + \lambda \sum_{j=1}^M |\beta_{ij}|, & i>1 \end{cases}. \end{align}\end{split}\]

In PyUoI, this cost function is solved using a modified orthant-wise limited memory quasi-Newton method.

Poisson Regression

In Poisson Regression, we assume that the response variable follows a Poisson distribution, with mean equal to the linear combination of the features and the parameters:

\[\begin{split}\begin{align} P(y|\mathbf{x}, \boldsymbol{\beta}) &= \frac{1}{y!} m^y e^{-m} \\ m &= \beta_0 + \sum_{i=1}^M x_i \beta_i. \end{align}\end{split}\]

Thus, we desire the parameters that minimize the penalized average log-likelihood:

\[\begin{split}\begin{align} \boldsymbol{\beta}^* &= \underset{\boldsymbol{\beta}}{\text{argmin }} \left\{-\frac{1}{N}\sum_{i=1}^{N}\left[y_i \left(\beta_0 + \sum_{j=1}^M x_{ij} \beta_j\right) -\exp\left(\beta_0 + \sum_{j=1}^M x_{ij} \beta_j\right) \right] \right. \\ & \qquad \qquad \qquad + \left. \lambda \left(\alpha \sum_{j=1}^M |\beta_j| + \frac{1}{2}(1-\alpha) \sum_{j=1}^M \beta_j^2\right)\right\}. \end{align}\end{split}\]

Thus, the UoI fitting procedure proceeds similarly as with UoI_ElasticNet, except a different cost function is optimized over. Tgus can be achieved with coordinate descent, similar to lasso, as well as the modified orthant-wise limited memory quasi-Newton Method discussed in the previous section.

Dimensionality Reduction

Dimensionality reduction techniques can also be placed within the UoI framework. There are differences, however, from the linear models. We detail the UoI variants of CUR (column subset selection) decomposition and non-negative matrix factorization.

CUR Decomposition (Column Subset Selection)

A common dimensionality reduction technique is column subset selection (CSS), the selection of representative features from a data design matrix. Closely related to CSS is CUR matrix decomposition, where the design matrix is written as a decomposition of representative columns and rows. Here, we detail how CSS procedures (and, by extension, CUR decomposition) naturally fit into the UoI framework.

For a design matrix \(A\) (with \(N\) samples and \(M\) features), CSS is ordinarily performed by operating on the top right \(K\) singular vectors, represented by \(V_K\). Thus, \(K\), the number of singular vectors to extract, is an initial hyperparameter of the problem. To perform CSS, we operate on the leverage scores \(\ell_i\) for each column (feature) of \(A\). The leverage score of column \(i\) is defined as the norm of the \(i\), normalized to \(K\). Column selection is performed via importance sampling, using the leverage scores, scaled by a constant \(c\), as the probability of selection. The constant \(c\) denotes the expected number of columns to select, and is an additional parameter of the algorithm.

The UoI procedure for CSS is described in the pseudocode below. Briefly, the algorithm extracts columns which persist across resamples of the data matrix while combining columns selected across different SVD ranks. The algorithm can accept any number of ranks to unionize over, though the default is to unionize over \(k\in \left\{1, \ldots, K\right\}\) where \(K\) is some maximum rank.

def UoI_CSS(A, K, c, n_bootstraps):
    # iterate over bootstraps
    for j in range(n_bootstraps):
        Aj = Generate resample of the data matrix A
        # iterate over ranks
        for k in K:
            Ci = CSS(Aj, k, c)
    C = union(intersection(Ci))
    return C

Non-negative Matrix Factorization

A non-negative matrix factorization aims to find a parts-based decomposition of some data matrix \(A \in \mathbb{R}^{m\times n}_+\). This can be posed as a non-convex optimization problem, solving for the matrices \(W \in \mathbb{R}_+^{m\times k}\) and \(H \in \mathbb{R}_+^{k\times n}\) such that:

\[\begin{align} \min_{W\geq 0, H\geq 0} ||A - WH||_F \end{align}\]

where \(F\) denotes the Frobenius norm. Here, the rows of \(H\) form some basis of the parts (of which there are \(k\)) while the rows of \(W\) are the weights of the basis in \(A\). Importantly, \(W\) and \(H\) are both non-negative, so the parts in NMF are often more interpretable.

There are a variety of algorithms and approaches to both choose the correct number of bases and estimate the values of \(W\) and \(H\). In the UoI framework, basis estimation and weight estimation are separated into distinct modules (similar to the linear models). NMF is fit to many bootstraps of the data matrix, using a desired approach (in PyUoI, it defaults to a symmetric KL divergence loss with multiplicative update rules). The fitted bases across bootstraps are aggregated to form the final bases, which can then be used to extract the weights.

Specifically, during basis estimation, NMF is fit to many bootstraps of the data matrix across a variety of ranks. The fit bases will tend to form clusters near the bases that would be fit if there data were noiseless. Then, the final rank is chosen by evaluating a dissimilarity metric, which prefers ranks that result in tight basis clusters. The final bases are chosen using a clustering algorithm – DBSCAN – to identify clusters of bases across bootstraps. A consensus procedure, such as the median, extracts the actual basis from each cluster. Finally, given a set of bases \(H\), the weights can be determined using non-negative least squares.

The above procedure is described in the following pseudocode:

def UoI_NMF(A, ranks, n_bootstraps):
    # iterate over bootstraps
    for k in ranks:
        for i in range(n_bootstraps):
            Aj = Generate bootstrapped resample of the data matrix A
            Hi, Wi = NMF(A, k)

    Compute diss(Hi^k, H_j^k) and Gamma(k) and choose the best rank k_hat
    let Hk denote the bases fitted from rank k_hat

    # cluster the best set of bases
    Cluster Hk using DBSCAN
    Set centers of the clusters as the best bases H

    # fit W
    Fit W using non-negative least squares between A and H

    return W, H

Above, the dissimilarity between two sets of bases from different data matrices \(H, H'\) with rank \(k\) is given by

\[\begin{align} \text{diss}(H, H') = 1 - \frac{1}{2k} \left( \sum_{j=1}^k \text{max}_i C_{ij} + \sum_{i=1}^k \text{max}_j C_{ij} \right) \end{align}\]

where \(C_{ij}\) is the cross-correlation matrix between \(H\) and \(H'\). The discrepancy \(\Gamma(k)\), which aggregates dissimilarities across pairs of bootstraps, is given by

\[\begin{align} \Gamma(k) = \sum_{1 \leq i \leq j \leq N_B} \text{diss}(H_i, H_j) \end{align}\]

where \(N_B\) is the number of boostraps.