# Distributions & Conditioning: conventions and equations This page documents the mathematical conventions used throughout `cgmm`: the parameter names, the probability density functions, and the closed-form marginalization and conditioning rules for each distribution family. The code objects ([`MultivariateNormal`](api.md), `MultivariateStudentT`, `MixtureDistribution`, `GaussianMixtureDistribution`) implement exactly these formulas, so this is the reference for *why a returned object has the parameters it has*. ## Notation We split the variables of a $d$-dimensional distribution into two blocks: a **conditioning** block $X_2$ (the variables we observe, selected by `cond_idx`) and the **target** block $X_1$ (the complement). Let $q = \dim(X_2)$. For a scale/covariance matrix $S$ we write the corresponding block partition $$ S = \begin{pmatrix} S_{11} & S_{12} \\ S_{21} & S_{22} \end{pmatrix}, \qquad S_{21} = S_{12}^{\mathsf T}, $$ where index `1` is the target block and index `2` is the conditioning block. Three quantities recur in every family: $$ \textbf{affine mean:}\quad \mu_{1\mid 2} = \mu_1 + S_{12} S_{22}^{-1} (x_2 - \mu_2) $$ $$ \textbf{Schur complement:}\quad S_{11\cdot 2} = S_{11} - S_{12} S_{22}^{-1} S_{21} $$ $$ \textbf{Mahalanobis term:}\quad d^2 = (x_2 - \mu_2)^{\mathsf T} S_{22}^{-1} (x_2 - \mu_2) $$ ```{note} In the API, `cond_idx` selects the conditioning block $X_2$; the target block $X_1$ is the complement, returned in increasing index order. `condition(cond_idx, x_cond)` returns a distribution of the **same family** over $X_1$. ``` ## Multivariate Normal Object: `MultivariateNormal(mean, cov)`. Parameters: `mean` $=\mu$ (shape `(d,)`) and `cov` $=\Sigma$ (shape `(d, d)`, symmetric positive-definite). For the Gaussian family the scale matrix *equals* the covariance, so `scale_` is an alias of `cov()`. **Density** $$ f(x) = (2\pi)^{-d/2}\,|\Sigma|^{-1/2} \exp\!\Big(-\tfrac12 (x-\mu)^{\mathsf T}\Sigma^{-1}(x-\mu)\Big). $$ **Marginal** (`marginal(idx)`): a sub-block is Gaussian with the sub-vector of $\mu$ and the sub-matrix of $\Sigma$. **Conditional** (`condition(cond_idx, x_cond)`): $$ X_1 \mid X_2 = x_2 \;\sim\; \mathcal N\big(\mu_{1\mid 2},\; S_{11\cdot 2}\big). $$ The conditional covariance does **not** depend on $x_2$ (homoskedastic) - this is the special property the heavy-tailed families relax. ## Multivariate Student-t Object: `MultivariateStudentT(loc, scale, df)`. Parameters follow `scipy.stats.multivariate_t`: location `loc` $=\mu$, scale (dispersion) matrix `scale` $=\Sigma$, and degrees of freedom `df` $=\nu > 0$. ```{important} `scale` is the **dispersion matrix, not the covariance**. They differ by a factor depending on $\nu$ (see moments below), which is why `scale_` and `cov()` are exposed separately. ``` **Density** in dimension $p$: $$ f(x) = \frac{\Gamma\!\big(\tfrac{\nu+p}{2}\big)} {\Gamma\!\big(\tfrac{\nu}{2}\big)\,(\nu\pi)^{p/2}\,|\Sigma|^{1/2}} \left[1 + \tfrac{1}{\nu}(x-\mu)^{\mathsf T}\Sigma^{-1}(x-\mu)\right]^{-(\nu+p)/2}. $$ **Moments** $$ \mathbb E[X] = \mu \;\;(\nu > 1), \qquad \operatorname{Cov}[X] = \frac{\nu}{\nu-2}\,\Sigma \;\;(\nu > 2). $$ As $\nu \to \infty$ the Student-t converges to $\mathcal N(\mu, \Sigma)$ and $\operatorname{Cov}[X] \to \Sigma$, recovering the Gaussian. **Marginal** (`marginal(idx)`): same $\nu$, sub-location and sub-scale. **Conditional** (`condition(cond_idx, x_cond)`): $$ X_1 \mid X_2 = x_2 \;\sim\; t_{\nu + q}\!\left(\; \mu_{1\mid 2},\; \frac{\nu + d^2}{\nu + q}\, S_{11\cdot 2} \;\right). $$ Two things change relative to the Gaussian, and both are data-dependent: - the degrees of freedom rise to $\nu + q$ (thinner conditional tails), and - the scale is the Schur complement inflated by the factor $(\nu + d^2)/(\nu + q)$, which **grows with the Mahalanobis distance** $d^2$ of the observed $x_2$ - i.e. conditioning on an outlier widens the predictive distribution (conditional heteroskedasticity). Letting $\nu \to \infty$ sends the inflation factor to $1$ and the degrees of freedom to infinity, so the Student-t conditional reduces to the Gaussian one. ## Multivariate Generalized Hyperbolic Object: `MultivariateGH(lambda_, chi, psi, loc, scale, gamma)`. The GH is a normal mean-variance mixture (NMVM): with a Generalized Inverse Gaussian mixing scalar $W \sim \mathrm{GIG}(\lambda, \chi, \psi)$, $$ X = \mu + W\,\gamma + \sqrt{W}\, A Z, \qquad A A^{\mathsf T} = \Sigma,\;\; Z \sim \mathcal N(0, I). $$ Parameters: GIG index $\lambda$ (`lambda_`), GIG parameters $\chi, \psi \ge 0$ (not both zero), location $\mu$ (`loc`), dispersion $\Sigma$ (`scale`), and the **skewness vector** $\gamma$ (`gamma`); $\gamma = 0$ is symmetric. The family subsumes the Gaussian, Student-t, variance-gamma, NIG and hyperbolic laws. **Density** in dimension $d$, with $Q(x) = (x-\mu)^{\mathsf T}\Sigma^{-1}(x-\mu)$ and $p = \gamma^{\mathsf T}\Sigma^{-1}\gamma$: $$ f(x) = c\; \frac{K_{\lambda - d/2}\!\big(\sqrt{(\chi + Q)(\psi + p)}\big)\, e^{(x-\mu)^{\mathsf T}\Sigma^{-1}\gamma}} {\big(\sqrt{(\chi + Q)(\psi + p)}\big)^{d/2 - \lambda}}, \qquad c = \frac{(\psi/\chi)^{\lambda/2}\,(\psi + p)^{d/2 - \lambda}} {(2\pi)^{d/2}\,|\Sigma|^{1/2}\,K_\lambda(\sqrt{\chi\psi})}, $$ where $K$ is the modified Bessel function of the second kind (evaluated stably via the exponentially-scaled `scipy.special.kve`). **Moments** use the GIG moments of $W$ (Bessel-K ratios at $\eta=\sqrt{\chi\psi}$): $$ \mathbb E[X] = \mu + \mathbb E[W]\,\gamma, \qquad \operatorname{Cov}[X] = \mathbb E[W]\,\Sigma + \operatorname{Var}[W]\,\gamma\gamma^{\mathsf T}, $$ so `cov()` differs from `scale` both by the scalar $\mathbb E[W]$ and by a rank-contribution along $\gamma$. **Marginal** (`marginal(idx)`): keeps $(\lambda, \chi, \psi)$ and takes the sub-vectors of $\mu, \gamma$ and the sub-matrix of $\Sigma$. **Conditional** (`condition(cond_idx, x_cond)`): the GIG is conjugate to conditioning, so $X_1 \mid X_2 = x_2$ is again GH with $$ \lambda^\* = \lambda - q/2,\qquad \chi^\* = \chi + d^2,\qquad \psi^\* = \psi + \gamma_2^{\mathsf T} S_{22}^{-1}\gamma_2, $$ $$ \mu^\* = \mu_{1\mid 2},\qquad \Sigma^\* = S_{11\cdot 2},\qquad \gamma^\* = \gamma_1 - S_{12} S_{22}^{-1}\gamma_2 . $$ These same formulas specialize to every sub-family. In particular the symmetric Student-t is the GH point $\gamma = 0,\ \psi = 0,\ \lambda = -\nu/2,\ \chi = \nu$; applying the rules above gives $\lambda^\* = -(\nu+q)/2$ and $\chi^\* = \nu + d^2$, which is exactly the Student-t conditional $t_{\nu+q}(\mu_{1\mid 2}, \frac{\nu+d^2}{\nu+q} S_{11\cdot 2})$ from the section above. ```{important} $\psi = 0$ is a degenerate boundary of the GIG (it collapses to an inverse-gamma), where the generic Bessel evaluations break down. `MultivariateGH` handles the symmetric-t boundary ($\psi=0,\ \gamma=0$) with a dedicated branch that delegates to `MultivariateStudentT`; the skew-t boundary ($\psi=0,\ \gamma\neq 0$) is not yet supported - use a small $\psi>0$. ``` ## Mixtures Objects: `MixtureDistribution(weights, components)` (family-agnostic) and `GaussianMixtureDistribution` (a back-compatible subclass of scikit-learn's `GaussianMixture`). A mixture has weights $w_k \ge 0$, $\sum_k w_k = 1$, and components $f_k$ of one family: $$ f(x) = \sum_{k} w_k\, f_k(x). $$ **Conditioning** reweights the components by how well each one explains the observed $x_2$, then conditions each component. With $f_k^{(2)}$ the marginal of component $k$ over the conditioning block, $$ w_k' \;\propto\; w_k\, f_k^{(2)}(x_2), \qquad \sum_k w_k' = 1, $$ $$ X_1 \mid X_2 = x_2 \;\sim\; \sum_k w_k'\, \big[\,f_k \mid X_2 = x_2\,\big], $$ where each conditioned component $f_k \mid X_2 = x_2$ follows the per-family rule above. This single reweight-then-condition scheme is what lets the same conditioner serve Gaussian and Student-t (and, later, Generalized Hyperbolic) mixtures unchanged. ```{note} `MixtureConditioner` (aliased as `GMMConditioner`) implements this scheme. A fitted sklearn `GaussianMixture` takes a cached fast path and returns a `GaussianMixtureDistribution`; a `MixtureDistribution` or any estimator exposing `to_mixture_distribution()` (such as `StudentTMixture`) takes the generic path and returns a `MixtureDistribution` whose components remain in the original family. ``` ## Estimation - EM for the Student-t mixture `StudentTMixture` fits weights $\{w_k\}$, locations $\{\mu_k\}$, scales $\{\Sigma_k\}$ and degrees of freedom $\{\nu_k\}$ by maximum likelihood with the EM algorithm (McLachlan & Peel; ECME for $\nu$). It subclasses scikit-learn's `BaseMixture`, so `n_init`, `tol`, `max_iter`, k-means initialization and convergence bookkeeping are inherited; the parameter `dof` selects whether $\nu$ is `"free"` (one per component), `"shared"` (a single $\nu$) or a fixed float. ### Latent-variable form The Student-t is a *scale mixture of normals*. Introduce, per observation $x_i$ and component $k$, a latent positive scalar $W_{ik}$: $$ x_i \mid (z_i = k,\, W_{ik}) \sim \mathcal N(\mu_k,\; W_{ik}\,\Sigma_k), \qquad W_{ik} \sim \text{InverseGamma}\!\big(\tfrac{\nu_k}{2}, \tfrac{\nu_k}{2}\big), $$ with $z_i$ the component label. Marginalizing $W_{ik}$ recovers $x_i \mid z_i = k \sim t_{\nu_k}(\mu_k, \Sigma_k)$. Working with the precision $u = 1/W$ (which is Gamma-distributed) makes the updates closed-form. ### E-step With $p$ the dimension and $\delta_{ik} = (x_i-\mu_k)^{\mathsf T}\Sigma_k^{-1}(x_i-\mu_k)$ the squared Mahalanobis distance, the E-step computes three quantities at the *current* parameters: $$ \tau_{ik} = \frac{w_k\, t_{\nu_k}(x_i \mid \mu_k, \Sigma_k)} {\sum_j w_j\, t_{\nu_j}(x_i \mid \mu_j, \Sigma_j)} \qquad\text{(responsibility)} $$ $$ u_{ik} \equiv \mathbb E[\,1/W_{ik} \mid x_i, z_i=k\,] = \frac{\nu_k + p}{\nu_k + \delta_{ik}} $$ $$ \mathbb E[\log u_{ik} \mid x_i, z_i=k] = \psi\!\Big(\tfrac{\nu_k + p}{2}\Big) - \log\!\Big(\tfrac{\nu_k + \delta_{ik}}{2}\Big), $$ where $\psi$ is the digamma function. Note $u_{ik}$ *down-weights* outliers (large $\delta_{ik}$) - the mechanism behind the t's robustness. ### M-step Let $n_k = \sum_i \tau_{ik}$. The weights and the weighted-Gaussian location/scale updates are $$ w_k = \frac{n_k}{n}, \qquad \mu_k = \frac{\sum_i \tau_{ik}\, u_{ik}\, x_i}{\sum_i \tau_{ik}\, u_{ik}}, \qquad \Sigma_k = \frac{1}{n_k}\sum_i \tau_{ik}\, u_{ik}\,(x_i-\mu_k)(x_i-\mu_k)^{\mathsf T}. $$ (`reg_covar` is added to the diagonal of $\Sigma_k$ for numerical stability.) The degrees of freedom $\nu_k$ are updated by solving the 1-D equation $$ \log\!\Big(\tfrac{\nu_k}{2}\Big) - \psi\!\Big(\tfrac{\nu_k}{2}\Big) + 1 + \frac{1}{n_k}\sum_i \tau_{ik}\big(\mathbb E[\log u_{ik}] - u_{ik}\big) = 0 $$ for $\nu_k$ (Brent's method). The left side is strictly decreasing in $\nu_k$, so the root is unique when it exists; when the data are effectively Gaussian no finite root exists and $\nu_k$ is driven to a large cap (the $\nu \to \infty$ limit). For `dof="shared"` the same equation is solved once with the sums pooled over all components; for a fixed float `dof` the step is skipped. ```{note} Because the $\nu$ update uses $u_{ik}$ and $\mathbb E[\log u_{ik}]$ evaluated at the **old** parameters, the procedure is a genuine EM and the observed-data log-likelihood is non-decreasing every iteration. The per-iteration values are recorded in `lower_bounds_`. ``` ### Initialization `init_params` controls the starting responsibilities: the scikit-learn options (`"kmeans"`, `"k-means++"`, `"random"`, `"random_from_data"`) or `"gmm"`, which warm-starts from a fitted `GaussianMixture` and tends to converge fastest. The degrees of freedom start large (near-Gaussian) and are pulled down by the data. ## Top-level regressors Two ways to model $p(y \mid X)$ with these families: - **Joint + condition** - `ConditionalMixtureRegressor(family=...)` fits a joint mixture over $[X, y]$ and conditions. `family="gaussian"` is the classic conditional GMM (also available as the back-compatible `ConditionalGMMRegressor` preset); `family="student_t"` is exposed as `ConditionalStudentTRegressor`. The Student-t conditional is heteroskedastic, so `predict_cov` returns per-sample covariances (requires conditional $\nu + q > 2$). - **Mixture of experts** - `MixtureOfExpertsRegressor(expert=...)` keeps the softmax gating and affine expert means; `expert="student_t"` swaps the Gaussian emission for a Student-t. Its EM reuses the same latent-scale weights $u_{ik}$ as above: each expert's weighted least squares and scale update are reweighted by $\tau_{ik} u_{ik}$ (iteratively-reweighted least squares - outliers are down-weighted), and per-expert $\nu_k$ solves the same digamma equation. Here the scale $\Sigma_k$ is constant in $x$ (only the gating and mean depend on $x$), unlike the joint-conditioning route above. The Generalized Hyperbolic family is fitted by `GeneralizedHyperbolicMixture` and exposed for regression as `ConditionalGHRegressor` (also `family="gh"`). ### Estimation - EM for the GH mixture (MCECM) The GH mixture is fitted by an EM that augments the data with both the component label and the latent GIG scale $W$. Writing $Q_{ik} = (x_i - \mu_k)^{\mathsf T} \Sigma_k^{-1}(x_i - \mu_k)$ and $p_k = \gamma_k^{\mathsf T}\Sigma_k^{-1}\gamma_k$, the posterior of the mixing scalar is again GIG, $$ W \mid (x_i, z_i = k) \sim \mathrm{GIG}\big(\lambda_k - \tfrac d2,\; \chi_k + Q_{ik},\; \psi_k + p_k\big), $$ so the **E-step** computes the responsibilities $\tau_{ik}$ together with $a_{ik}=\mathbb E[W]$, $b_{ik}=\mathbb E[1/W]$ and $c_{ik}=\mathbb E[\log W]$ from that GIG (Bessel-K ratios; $\mathbb E[\log W]$ by differentiating $\log K_\lambda$). With $n_k = \sum_i \tau_{ik}$ and bars denoting $\tau$-weighted averages, the **M-step** has closed forms $$ \gamma_k = \frac{\overline{b x} - \bar b\,\bar x}{1 - \bar a\,\bar b}, \qquad \mu_k = \bar x - \bar a\,\gamma_k, \qquad \Sigma_k = \frac1{n_k}\sum_i \tau_{ik}\, b_{ik}\,(x_i-\mu_k)(x_i-\mu_k)^{\mathsf T} - \bar a\,\gamma_k\gamma_k^{\mathsf T}, $$ and the GIG parameters $(\lambda_k, \chi_k, \psi_k)$ are updated by maximizing the GIG expected complete-data log-likelihood under the **identifiability constraint $\mathbb E[W]=1$** (the GH parameterization is otherwise redundant: $W$ can be rescaled against $\Sigma,\gamma$). Imposing $\mathbb E[W]=1$ reduces the GIG update to a 2-D maximization over $(\lambda, \omega)$ with $\omega=\sqrt{\chi\psi}$, $\chi=\omega/R_\lambda(\omega)$, $\psi=\omega R_\lambda(\omega)$, and $R_\lambda(\omega)=K_{\lambda+1}(\omega)/K_\lambda(\omega)$. The observed-data log-likelihood is non-decreasing each iteration (recorded in `lower_bounds_`).