Distributions & Conditioning: conventions and equations¶
This page documents the mathematical conventions used throughout cgmm: the
parameter names, the probability density functions, and the closed-form
marginalization and conditioning rules for each distribution family. The code
objects (MultivariateNormal, MultivariateStudentT,
MixtureDistribution, GaussianMixtureDistribution) implement exactly these
formulas, so this is the reference for why a returned object has the parameters
it has.
Notation¶
We split the variables of a \(d\)-dimensional distribution into two blocks: a
conditioning block \(X_2\) (the variables we observe, selected by cond_idx)
and the target block \(X_1\) (the complement). Let \(q = \dim(X_2)\).
For a scale/covariance matrix \(S\) we write the corresponding block partition
where index 1 is the target block and index 2 is the conditioning block.
Three quantities recur in every family:
Note
In the API, cond_idx selects the conditioning block \(X_2\); the target block
\(X_1\) is the complement, returned in increasing index order. condition(cond_idx, x_cond) returns a distribution of the same family over \(X_1\).
Multivariate Normal¶
Object: MultivariateNormal(mean, cov). Parameters: mean \(=\mu\) (shape (d,))
and cov \(=\Sigma\) (shape (d, d), symmetric positive-definite). For the
Gaussian family the scale matrix equals the covariance, so scale_ is an alias
of cov().
Density
Marginal (marginal(idx)): a sub-block is Gaussian with the sub-vector of
\(\mu\) and the sub-matrix of \(\Sigma\).
Conditional (condition(cond_idx, x_cond)):
The conditional covariance does not depend on \(x_2\) (homoskedastic) - this is the special property the heavy-tailed families relax.
Multivariate Student-t¶
Object: MultivariateStudentT(loc, scale, df). Parameters follow
scipy.stats.multivariate_t: location loc \(=\mu\), scale (dispersion) matrix
scale \(=\Sigma\), and degrees of freedom df \(=\nu > 0\).
Important
scale is the dispersion matrix, not the covariance. They differ by a
factor depending on \(\nu\) (see moments below), which is why scale_ and cov()
are exposed separately.
Density in dimension \(p\):
Moments
As \(\nu \to \infty\) the Student-t converges to \(\mathcal N(\mu, \Sigma)\) and \(\operatorname{Cov}[X] \to \Sigma\), recovering the Gaussian.
Marginal (marginal(idx)): same \(\nu\), sub-location and sub-scale.
Conditional (condition(cond_idx, x_cond)):
Two things change relative to the Gaussian, and both are data-dependent:
the degrees of freedom rise to \(\nu + q\) (thinner conditional tails), and
the scale is the Schur complement inflated by the factor \((\nu + d^2)/(\nu + q)\), which grows with the Mahalanobis distance \(d^2\) of the observed \(x_2\) - i.e. conditioning on an outlier widens the predictive distribution (conditional heteroskedasticity).
Letting \(\nu \to \infty\) sends the inflation factor to \(1\) and the degrees of freedom to infinity, so the Student-t conditional reduces to the Gaussian one.
Multivariate Generalized Hyperbolic¶
Object: MultivariateGH(lambda_, chi, psi, loc, scale, gamma). The GH is a normal
mean-variance mixture (NMVM): with a Generalized Inverse Gaussian mixing scalar
\(W \sim \mathrm{GIG}(\lambda, \chi, \psi)\),
Parameters: GIG index \(\lambda\) (lambda_), GIG parameters \(\chi, \psi \ge 0\)
(not both zero), location \(\mu\) (loc), dispersion \(\Sigma\) (scale), and the
skewness vector \(\gamma\) (gamma); \(\gamma = 0\) is symmetric. The family
subsumes the Gaussian, Student-t, variance-gamma, NIG and hyperbolic laws.
Density in dimension \(d\), with \(Q(x) = (x-\mu)^{\mathsf T}\Sigma^{-1}(x-\mu)\) and \(p = \gamma^{\mathsf T}\Sigma^{-1}\gamma\):
where \(K\) is the modified Bessel function of the second kind (evaluated stably via
the exponentially-scaled scipy.special.kve).
Moments use the GIG moments of \(W\) (Bessel-K ratios at \(\eta=\sqrt{\chi\psi}\)):
so cov() differs from scale both by the scalar \(\mathbb E[W]\) and by a
rank-contribution along \(\gamma\).
Marginal (marginal(idx)): keeps \((\lambda, \chi, \psi)\) and takes the
sub-vectors of \(\mu, \gamma\) and the sub-matrix of \(\Sigma\).
Conditional (condition(cond_idx, x_cond)): the GIG is conjugate to
conditioning, so \(X_1 \mid X_2 = x_2\) is again GH with
These same formulas specialize to every sub-family. In particular the symmetric Student-t is the GH point \(\gamma = 0,\ \psi = 0,\ \lambda = -\nu/2,\ \chi = \nu\); applying the rules above gives \(\lambda^\* = -(\nu+q)/2\) and \(\chi^\* = \nu + d^2\), which is exactly the Student-t conditional \(t_{\nu+q}(\mu_{1\mid 2}, \frac{\nu+d^2}{\nu+q} S_{11\cdot 2})\) from the section above.
Important
\(\psi = 0\) is a degenerate boundary of the GIG (it collapses to an
inverse-gamma), where the generic Bessel evaluations break down. MultivariateGH
handles the symmetric-t boundary (\(\psi=0,\ \gamma=0\)) with a dedicated branch that
delegates to MultivariateStudentT; the skew-t boundary (\(\psi=0,\ \gamma\neq 0\))
is not yet supported - use a small \(\psi>0\).
Mixtures¶
Objects: MixtureDistribution(weights, components) (family-agnostic) and
GaussianMixtureDistribution (a back-compatible subclass of scikit-learn's
GaussianMixture). A mixture has weights \(w_k \ge 0\), \(\sum_k w_k = 1\), and
components \(f_k\) of one family:
Conditioning reweights the components by how well each one explains the observed \(x_2\), then conditions each component. With \(f_k^{(2)}\) the marginal of component \(k\) over the conditioning block,
where each conditioned component \(f_k \mid X_2 = x_2\) follows the per-family rule above. This single reweight-then-condition scheme is what lets the same conditioner serve Gaussian and Student-t (and, later, Generalized Hyperbolic) mixtures unchanged.
Note
MixtureConditioner (aliased as GMMConditioner) implements this scheme.
A fitted sklearn GaussianMixture takes a cached fast path and returns a
GaussianMixtureDistribution; a MixtureDistribution or any estimator exposing
to_mixture_distribution() (such as StudentTMixture) takes the generic path and
returns a MixtureDistribution whose components remain in the original family.
Estimation - EM for the Student-t mixture¶
StudentTMixture fits weights \(\{w_k\}\), locations \(\{\mu_k\}\), scales
\(\{\Sigma_k\}\) and degrees of freedom \(\{\nu_k\}\) by maximum likelihood with the
EM algorithm (McLachlan & Peel; ECME for \(\nu\)). It subclasses scikit-learn's
BaseMixture, so n_init, tol, max_iter, k-means initialization and
convergence bookkeeping are inherited; the parameter dof selects whether \(\nu\)
is "free" (one per component), "shared" (a single \(\nu\)) or a fixed float.
Latent-variable form¶
The Student-t is a scale mixture of normals. Introduce, per observation \(x_i\) and component \(k\), a latent positive scalar \(W_{ik}\):
with \(z_i\) the component label. Marginalizing \(W_{ik}\) recovers \(x_i \mid z_i = k \sim t_{\nu_k}(\mu_k, \Sigma_k)\). Working with the precision \(u = 1/W\) (which is Gamma-distributed) makes the updates closed-form.
E-step¶
With \(p\) the dimension and \(\delta_{ik} = (x_i-\mu_k)^{\mathsf T}\Sigma_k^{-1}(x_i-\mu_k)\) the squared Mahalanobis distance, the E-step computes three quantities at the current parameters:
where \(\psi\) is the digamma function. Note \(u_{ik}\) down-weights outliers (large \(\delta_{ik}\)) - the mechanism behind the t's robustness.
M-step¶
Let \(n_k = \sum_i \tau_{ik}\). The weights and the weighted-Gaussian location/scale updates are
(reg_covar is added to the diagonal of \(\Sigma_k\) for numerical stability.)
The degrees of freedom \(\nu_k\) are updated by solving the 1-D equation
for \(\nu_k\) (Brent's method). The left side is strictly decreasing in \(\nu_k\), so
the root is unique when it exists; when the data are effectively Gaussian no
finite root exists and \(\nu_k\) is driven to a large cap (the \(\nu \to \infty\)
limit). For dof="shared" the same equation is solved once with the sums pooled
over all components; for a fixed float dof the step is skipped.
Note
Because the \(\nu\) update uses \(u_{ik}\) and \(\mathbb E[\log u_{ik}]\) evaluated at
the old parameters, the procedure is a genuine EM and the observed-data
log-likelihood is non-decreasing every iteration. The per-iteration values are
recorded in lower_bounds_.
Initialization¶
init_params controls the starting responsibilities: the scikit-learn options
("kmeans", "k-means++", "random", "random_from_data") or "gmm", which
warm-starts from a fitted GaussianMixture and tends to converge fastest. The
degrees of freedom start large (near-Gaussian) and are pulled down by the data.
Top-level regressors¶
Two ways to model \(p(y \mid X)\) with these families:
Joint + condition -
ConditionalMixtureRegressor(family=...)fits a joint mixture over \([X, y]\) and conditions.family="gaussian"is the classic conditional GMM (also available as the back-compatibleConditionalGMMRegressorpreset);family="student_t"is exposed asConditionalStudentTRegressor. The Student-t conditional is heteroskedastic, sopredict_covreturns per-sample covariances (requires conditional \(\nu + q > 2\)).Mixture of experts -
MixtureOfExpertsRegressor(expert=...)keeps the softmax gating and affine expert means;expert="student_t"swaps the Gaussian emission for a Student-t. Its EM reuses the same latent-scale weights \(u_{ik}\) as above: each expert's weighted least squares and scale update are reweighted by \(\tau_{ik} u_{ik}\) (iteratively-reweighted least squares - outliers are down-weighted), and per-expert \(\nu_k\) solves the same digamma equation. Here the scale \(\Sigma_k\) is constant in \(x\) (only the gating and mean depend on \(x\)), unlike the joint-conditioning route above.
The Generalized Hyperbolic family is fitted by GeneralizedHyperbolicMixture and
exposed for regression as ConditionalGHRegressor (also family="gh").
Estimation - EM for the GH mixture (MCECM)¶
The GH mixture is fitted by an EM that augments the data with both the component label and the latent GIG scale \(W\). Writing \(Q_{ik} = (x_i - \mu_k)^{\mathsf T} \Sigma_k^{-1}(x_i - \mu_k)\) and \(p_k = \gamma_k^{\mathsf T}\Sigma_k^{-1}\gamma_k\), the posterior of the mixing scalar is again GIG,
so the E-step computes the responsibilities \(\tau_{ik}\) together with \(a_{ik}=\mathbb E[W]\), \(b_{ik}=\mathbb E[1/W]\) and \(c_{ik}=\mathbb E[\log W]\) from that GIG (Bessel-K ratios; \(\mathbb E[\log W]\) by differentiating \(\log K_\lambda\)). With \(n_k = \sum_i \tau_{ik}\) and bars denoting \(\tau\)-weighted averages, the M-step has closed forms
and the GIG parameters \((\lambda_k, \chi_k, \psi_k)\) are updated by maximizing the
GIG expected complete-data log-likelihood under the identifiability constraint
\(\mathbb E[W]=1\) (the GH parameterization is otherwise redundant: \(W\) can be
rescaled against \(\Sigma,\gamma\)). Imposing \(\mathbb E[W]=1\) reduces the GIG update
to a 2-D maximization over \((\lambda, \omega)\) with \(\omega=\sqrt{\chi\psi}\),
\(\chi=\omega/R_\lambda(\omega)\), \(\psi=\omega R_\lambda(\omega)\), and
\(R_\lambda(\omega)=K_{\lambda+1}(\omega)/K_\lambda(\omega)\). The observed-data
log-likelihood is non-decreasing each iteration (recorded in lower_bounds_).