Distributions & Conditioning: conventions and equations

This page documents the mathematical conventions used throughout cgmm: the parameter names, the probability density functions, and the closed-form marginalization and conditioning rules for each distribution family. The code objects (MultivariateNormal, MultivariateStudentT, MixtureDistribution, GaussianMixtureDistribution) implement exactly these formulas, so this is the reference for why a returned object has the parameters it has.

Notation

We split the variables of a \(d\)-dimensional distribution into two blocks: a conditioning block \(X_2\) (the variables we observe, selected by cond_idx) and the target block \(X_1\) (the complement). Let \(q = \dim(X_2)\).

For a scale/covariance matrix \(S\) we write the corresponding block partition

\[\begin{split} S = \begin{pmatrix} S_{11} & S_{12} \\ S_{21} & S_{22} \end{pmatrix}, \qquad S_{21} = S_{12}^{\mathsf T}, \end{split}\]

where index 1 is the target block and index 2 is the conditioning block. Three quantities recur in every family:

\[ \textbf{affine mean:}\quad \mu_{1\mid 2} = \mu_1 + S_{12} S_{22}^{-1} (x_2 - \mu_2) \]
\[ \textbf{Schur complement:}\quad S_{11\cdot 2} = S_{11} - S_{12} S_{22}^{-1} S_{21} \]
\[ \textbf{Mahalanobis term:}\quad d^2 = (x_2 - \mu_2)^{\mathsf T} S_{22}^{-1} (x_2 - \mu_2) \]

Note

In the API, cond_idx selects the conditioning block \(X_2\); the target block \(X_1\) is the complement, returned in increasing index order. condition(cond_idx, x_cond) returns a distribution of the same family over \(X_1\).

Multivariate Normal

Object: MultivariateNormal(mean, cov). Parameters: mean \(=\mu\) (shape (d,)) and cov \(=\Sigma\) (shape (d, d), symmetric positive-definite). For the Gaussian family the scale matrix equals the covariance, so scale_ is an alias of cov().

Density

\[ f(x) = (2\pi)^{-d/2}\,|\Sigma|^{-1/2} \exp\!\Big(-\tfrac12 (x-\mu)^{\mathsf T}\Sigma^{-1}(x-\mu)\Big). \]

Marginal (marginal(idx)): a sub-block is Gaussian with the sub-vector of \(\mu\) and the sub-matrix of \(\Sigma\).

Conditional (condition(cond_idx, x_cond)):

\[ X_1 \mid X_2 = x_2 \;\sim\; \mathcal N\big(\mu_{1\mid 2},\; S_{11\cdot 2}\big). \]

The conditional covariance does not depend on \(x_2\) (homoskedastic) - this is the special property the heavy-tailed families relax.

Multivariate Student-t

Object: MultivariateStudentT(loc, scale, df). Parameters follow scipy.stats.multivariate_t: location loc \(=\mu\), scale (dispersion) matrix scale \(=\Sigma\), and degrees of freedom df \(=\nu > 0\).

Important

scale is the dispersion matrix, not the covariance. They differ by a factor depending on \(\nu\) (see moments below), which is why scale_ and cov() are exposed separately.

Density in dimension \(p\):

\[ f(x) = \frac{\Gamma\!\big(\tfrac{\nu+p}{2}\big)} {\Gamma\!\big(\tfrac{\nu}{2}\big)\,(\nu\pi)^{p/2}\,|\Sigma|^{1/2}} \left[1 + \tfrac{1}{\nu}(x-\mu)^{\mathsf T}\Sigma^{-1}(x-\mu)\right]^{-(\nu+p)/2}. \]

Moments

\[ \mathbb E[X] = \mu \;\;(\nu > 1), \qquad \operatorname{Cov}[X] = \frac{\nu}{\nu-2}\,\Sigma \;\;(\nu > 2). \]

As \(\nu \to \infty\) the Student-t converges to \(\mathcal N(\mu, \Sigma)\) and \(\operatorname{Cov}[X] \to \Sigma\), recovering the Gaussian.

Marginal (marginal(idx)): same \(\nu\), sub-location and sub-scale.

Conditional (condition(cond_idx, x_cond)):

\[ X_1 \mid X_2 = x_2 \;\sim\; t_{\nu + q}\!\left(\; \mu_{1\mid 2},\; \frac{\nu + d^2}{\nu + q}\, S_{11\cdot 2} \;\right). \]

Two things change relative to the Gaussian, and both are data-dependent:

  • the degrees of freedom rise to \(\nu + q\) (thinner conditional tails), and

  • the scale is the Schur complement inflated by the factor \((\nu + d^2)/(\nu + q)\), which grows with the Mahalanobis distance \(d^2\) of the observed \(x_2\) - i.e. conditioning on an outlier widens the predictive distribution (conditional heteroskedasticity).

Letting \(\nu \to \infty\) sends the inflation factor to \(1\) and the degrees of freedom to infinity, so the Student-t conditional reduces to the Gaussian one.

Multivariate Generalized Hyperbolic

Object: MultivariateGH(lambda_, chi, psi, loc, scale, gamma). The GH is a normal mean-variance mixture (NMVM): with a Generalized Inverse Gaussian mixing scalar \(W \sim \mathrm{GIG}(\lambda, \chi, \psi)\),

\[ X = \mu + W\,\gamma + \sqrt{W}\, A Z, \qquad A A^{\mathsf T} = \Sigma,\;\; Z \sim \mathcal N(0, I). \]

Parameters: GIG index \(\lambda\) (lambda_), GIG parameters \(\chi, \psi \ge 0\) (not both zero), location \(\mu\) (loc), dispersion \(\Sigma\) (scale), and the skewness vector \(\gamma\) (gamma); \(\gamma = 0\) is symmetric. The family subsumes the Gaussian, Student-t, variance-gamma, NIG and hyperbolic laws.

Density in dimension \(d\), with \(Q(x) = (x-\mu)^{\mathsf T}\Sigma^{-1}(x-\mu)\) and \(p = \gamma^{\mathsf T}\Sigma^{-1}\gamma\):

\[ f(x) = c\; \frac{K_{\lambda - d/2}\!\big(\sqrt{(\chi + Q)(\psi + p)}\big)\, e^{(x-\mu)^{\mathsf T}\Sigma^{-1}\gamma}} {\big(\sqrt{(\chi + Q)(\psi + p)}\big)^{d/2 - \lambda}}, \qquad c = \frac{(\psi/\chi)^{\lambda/2}\,(\psi + p)^{d/2 - \lambda}} {(2\pi)^{d/2}\,|\Sigma|^{1/2}\,K_\lambda(\sqrt{\chi\psi})}, \]

where \(K\) is the modified Bessel function of the second kind (evaluated stably via the exponentially-scaled scipy.special.kve).

Moments use the GIG moments of \(W\) (Bessel-K ratios at \(\eta=\sqrt{\chi\psi}\)):

\[ \mathbb E[X] = \mu + \mathbb E[W]\,\gamma, \qquad \operatorname{Cov}[X] = \mathbb E[W]\,\Sigma + \operatorname{Var}[W]\,\gamma\gamma^{\mathsf T}, \]

so cov() differs from scale both by the scalar \(\mathbb E[W]\) and by a rank-contribution along \(\gamma\).

Marginal (marginal(idx)): keeps \((\lambda, \chi, \psi)\) and takes the sub-vectors of \(\mu, \gamma\) and the sub-matrix of \(\Sigma\).

Conditional (condition(cond_idx, x_cond)): the GIG is conjugate to conditioning, so \(X_1 \mid X_2 = x_2\) is again GH with

\[ \lambda^\* = \lambda - q/2,\qquad \chi^\* = \chi + d^2,\qquad \psi^\* = \psi + \gamma_2^{\mathsf T} S_{22}^{-1}\gamma_2, \]
\[ \mu^\* = \mu_{1\mid 2},\qquad \Sigma^\* = S_{11\cdot 2},\qquad \gamma^\* = \gamma_1 - S_{12} S_{22}^{-1}\gamma_2 . \]

These same formulas specialize to every sub-family. In particular the symmetric Student-t is the GH point \(\gamma = 0,\ \psi = 0,\ \lambda = -\nu/2,\ \chi = \nu\); applying the rules above gives \(\lambda^\* = -(\nu+q)/2\) and \(\chi^\* = \nu + d^2\), which is exactly the Student-t conditional \(t_{\nu+q}(\mu_{1\mid 2}, \frac{\nu+d^2}{\nu+q} S_{11\cdot 2})\) from the section above.

Important

\(\psi = 0\) is a degenerate boundary of the GIG (it collapses to an inverse-gamma), where the generic Bessel evaluations break down. MultivariateGH handles the symmetric-t boundary (\(\psi=0,\ \gamma=0\)) with a dedicated branch that delegates to MultivariateStudentT; the skew-t boundary (\(\psi=0,\ \gamma\neq 0\)) is not yet supported - use a small \(\psi>0\).

Mixtures

Objects: MixtureDistribution(weights, components) (family-agnostic) and GaussianMixtureDistribution (a back-compatible subclass of scikit-learn's GaussianMixture). A mixture has weights \(w_k \ge 0\), \(\sum_k w_k = 1\), and components \(f_k\) of one family:

\[ f(x) = \sum_{k} w_k\, f_k(x). \]

Conditioning reweights the components by how well each one explains the observed \(x_2\), then conditions each component. With \(f_k^{(2)}\) the marginal of component \(k\) over the conditioning block,

\[ w_k' \;\propto\; w_k\, f_k^{(2)}(x_2), \qquad \sum_k w_k' = 1, \]
\[ X_1 \mid X_2 = x_2 \;\sim\; \sum_k w_k'\, \big[\,f_k \mid X_2 = x_2\,\big], \]

where each conditioned component \(f_k \mid X_2 = x_2\) follows the per-family rule above. This single reweight-then-condition scheme is what lets the same conditioner serve Gaussian and Student-t (and, later, Generalized Hyperbolic) mixtures unchanged.

Note

MixtureConditioner (aliased as GMMConditioner) implements this scheme. A fitted sklearn GaussianMixture takes a cached fast path and returns a GaussianMixtureDistribution; a MixtureDistribution or any estimator exposing to_mixture_distribution() (such as StudentTMixture) takes the generic path and returns a MixtureDistribution whose components remain in the original family.

Estimation - EM for the Student-t mixture

StudentTMixture fits weights \(\{w_k\}\), locations \(\{\mu_k\}\), scales \(\{\Sigma_k\}\) and degrees of freedom \(\{\nu_k\}\) by maximum likelihood with the EM algorithm (McLachlan & Peel; ECME for \(\nu\)). It subclasses scikit-learn's BaseMixture, so n_init, tol, max_iter, k-means initialization and convergence bookkeeping are inherited; the parameter dof selects whether \(\nu\) is "free" (one per component), "shared" (a single \(\nu\)) or a fixed float.

Latent-variable form

The Student-t is a scale mixture of normals. Introduce, per observation \(x_i\) and component \(k\), a latent positive scalar \(W_{ik}\):

\[ x_i \mid (z_i = k,\, W_{ik}) \sim \mathcal N(\mu_k,\; W_{ik}\,\Sigma_k), \qquad W_{ik} \sim \text{InverseGamma}\!\big(\tfrac{\nu_k}{2}, \tfrac{\nu_k}{2}\big), \]

with \(z_i\) the component label. Marginalizing \(W_{ik}\) recovers \(x_i \mid z_i = k \sim t_{\nu_k}(\mu_k, \Sigma_k)\). Working with the precision \(u = 1/W\) (which is Gamma-distributed) makes the updates closed-form.

E-step

With \(p\) the dimension and \(\delta_{ik} = (x_i-\mu_k)^{\mathsf T}\Sigma_k^{-1}(x_i-\mu_k)\) the squared Mahalanobis distance, the E-step computes three quantities at the current parameters:

\[ \tau_{ik} = \frac{w_k\, t_{\nu_k}(x_i \mid \mu_k, \Sigma_k)} {\sum_j w_j\, t_{\nu_j}(x_i \mid \mu_j, \Sigma_j)} \qquad\text{(responsibility)} \]
\[ u_{ik} \equiv \mathbb E[\,1/W_{ik} \mid x_i, z_i=k\,] = \frac{\nu_k + p}{\nu_k + \delta_{ik}} \]
\[ \mathbb E[\log u_{ik} \mid x_i, z_i=k] = \psi\!\Big(\tfrac{\nu_k + p}{2}\Big) - \log\!\Big(\tfrac{\nu_k + \delta_{ik}}{2}\Big), \]

where \(\psi\) is the digamma function. Note \(u_{ik}\) down-weights outliers (large \(\delta_{ik}\)) - the mechanism behind the t's robustness.

M-step

Let \(n_k = \sum_i \tau_{ik}\). The weights and the weighted-Gaussian location/scale updates are

\[ w_k = \frac{n_k}{n}, \qquad \mu_k = \frac{\sum_i \tau_{ik}\, u_{ik}\, x_i}{\sum_i \tau_{ik}\, u_{ik}}, \qquad \Sigma_k = \frac{1}{n_k}\sum_i \tau_{ik}\, u_{ik}\,(x_i-\mu_k)(x_i-\mu_k)^{\mathsf T}. \]

(reg_covar is added to the diagonal of \(\Sigma_k\) for numerical stability.) The degrees of freedom \(\nu_k\) are updated by solving the 1-D equation

\[ \log\!\Big(\tfrac{\nu_k}{2}\Big) - \psi\!\Big(\tfrac{\nu_k}{2}\Big) + 1 + \frac{1}{n_k}\sum_i \tau_{ik}\big(\mathbb E[\log u_{ik}] - u_{ik}\big) = 0 \]

for \(\nu_k\) (Brent's method). The left side is strictly decreasing in \(\nu_k\), so the root is unique when it exists; when the data are effectively Gaussian no finite root exists and \(\nu_k\) is driven to a large cap (the \(\nu \to \infty\) limit). For dof="shared" the same equation is solved once with the sums pooled over all components; for a fixed float dof the step is skipped.

Note

Because the \(\nu\) update uses \(u_{ik}\) and \(\mathbb E[\log u_{ik}]\) evaluated at the old parameters, the procedure is a genuine EM and the observed-data log-likelihood is non-decreasing every iteration. The per-iteration values are recorded in lower_bounds_.

Initialization

init_params controls the starting responsibilities: the scikit-learn options ("kmeans", "k-means++", "random", "random_from_data") or "gmm", which warm-starts from a fitted GaussianMixture and tends to converge fastest. The degrees of freedom start large (near-Gaussian) and are pulled down by the data.

Top-level regressors

Two ways to model \(p(y \mid X)\) with these families:

  • Joint + condition - ConditionalMixtureRegressor(family=...) fits a joint mixture over \([X, y]\) and conditions. family="gaussian" is the classic conditional GMM (also available as the back-compatible ConditionalGMMRegressor preset); family="student_t" is exposed as ConditionalStudentTRegressor. The Student-t conditional is heteroskedastic, so predict_cov returns per-sample covariances (requires conditional \(\nu + q > 2\)).

  • Mixture of experts - MixtureOfExpertsRegressor(expert=...) keeps the softmax gating and affine expert means; expert="student_t" swaps the Gaussian emission for a Student-t. Its EM reuses the same latent-scale weights \(u_{ik}\) as above: each expert's weighted least squares and scale update are reweighted by \(\tau_{ik} u_{ik}\) (iteratively-reweighted least squares - outliers are down-weighted), and per-expert \(\nu_k\) solves the same digamma equation. Here the scale \(\Sigma_k\) is constant in \(x\) (only the gating and mean depend on \(x\)), unlike the joint-conditioning route above.

The Generalized Hyperbolic family is fitted by GeneralizedHyperbolicMixture and exposed for regression as ConditionalGHRegressor (also family="gh").

Estimation - EM for the GH mixture (MCECM)

The GH mixture is fitted by an EM that augments the data with both the component label and the latent GIG scale \(W\). Writing \(Q_{ik} = (x_i - \mu_k)^{\mathsf T} \Sigma_k^{-1}(x_i - \mu_k)\) and \(p_k = \gamma_k^{\mathsf T}\Sigma_k^{-1}\gamma_k\), the posterior of the mixing scalar is again GIG,

\[ W \mid (x_i, z_i = k) \sim \mathrm{GIG}\big(\lambda_k - \tfrac d2,\; \chi_k + Q_{ik},\; \psi_k + p_k\big), \]

so the E-step computes the responsibilities \(\tau_{ik}\) together with \(a_{ik}=\mathbb E[W]\), \(b_{ik}=\mathbb E[1/W]\) and \(c_{ik}=\mathbb E[\log W]\) from that GIG (Bessel-K ratios; \(\mathbb E[\log W]\) by differentiating \(\log K_\lambda\)). With \(n_k = \sum_i \tau_{ik}\) and bars denoting \(\tau\)-weighted averages, the M-step has closed forms

\[ \gamma_k = \frac{\overline{b x} - \bar b\,\bar x}{1 - \bar a\,\bar b}, \qquad \mu_k = \bar x - \bar a\,\gamma_k, \qquad \Sigma_k = \frac1{n_k}\sum_i \tau_{ik}\, b_{ik}\,(x_i-\mu_k)(x_i-\mu_k)^{\mathsf T} - \bar a\,\gamma_k\gamma_k^{\mathsf T}, \]

and the GIG parameters \((\lambda_k, \chi_k, \psi_k)\) are updated by maximizing the GIG expected complete-data log-likelihood under the identifiability constraint \(\mathbb E[W]=1\) (the GH parameterization is otherwise redundant: \(W\) can be rescaled against \(\Sigma,\gamma\)). Imposing \(\mathbb E[W]=1\) reduces the GIG update to a 2-D maximization over \((\lambda, \omega)\) with \(\omega=\sqrt{\chi\psi}\), \(\chi=\omega/R_\lambda(\omega)\), \(\psi=\omega R_\lambda(\omega)\), and \(R_\lambda(\omega)=K_{\lambda+1}(\omega)/K_\lambda(\omega)\). The observed-data log-likelihood is non-decreasing each iteration (recorded in lower_bounds_).