{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cell-00",
   "metadata": {},
   "source": [
    "# Heavy-tailed & skewed conditional regression\n",
    "\n",
    "Real residuals are often **heavy-tailed** (occasional large errors) and **skewed** (asymmetric). A Gaussian conditional model under-states the tails and cannot represent asymmetry. `cgmm` provides three drop-in conditional regressors that share the same API but differ in the family of the joint mixture they fit and then condition:\n",
    "\n",
    "| Regressor | Family | Captures |\n",
    "|---|---|---|\n",
    "| `ConditionalGMMRegressor` | Gaussian | mean, (co)variance |\n",
    "| `ConditionalStudentTRegressor` | Student-t | + heavy tails |\n",
    "| `ConditionalGHRegressor` | Generalized Hyperbolic | + skewness |\n",
    "\n",
    "See [Distributions & Conditioning](../conventions.md) for the equations and the EM algorithms. Below we fit all three on synthetic data whose noise is both heavy-tailed and right-skewed, and compare their conditional densities and held-out log-likelihood."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-01",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib\n",
    "matplotlib.use('module://matplotlib_inline.backend_inline')  # capture figures inline (must be before pyplot import)\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from scipy.stats import skewnorm, chi2\n",
    "\n",
    "from cgmm import (\n",
    "    ConditionalGMMRegressor,\n",
    "    ConditionalStudentTRegressor,\n",
    "    ConditionalGHRegressor,\n",
    ")\n",
    "\n",
    "rng = np.random.default_rng(7)\n",
    "\n",
    "def make_data(n, seed, df=4, skew=6.0, scale=0.55):\n",
    "    rs = np.random.default_rng(seed)\n",
    "    x = rs.uniform(-3, 3, size=n)\n",
    "    # skew-t noise: skew-normal divided by sqrt(chi2/df) -> heavy-tailed + skewed\n",
    "    z = skewnorm(a=skew).rvs(n, random_state=seed)\n",
    "    v = chi2(df=df).rvs(n, random_state=seed + 1)\n",
    "    noise = scale * z / np.sqrt(v / df)\n",
    "    y = 1.2 * x + noise\n",
    "    return x.reshape(-1, 1), y\n",
    "\n",
    "X_train, y_train = make_data(1200, seed=0)\n",
    "X_test, y_test = make_data(1200, seed=100)\n",
    "print('train:', X_train.shape, 'test:', X_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-02",
   "metadata": {},
   "source": [
    "## Fit the three regressors\n",
    "\n",
    "Identical call pattern - only the class differs. We use a single mixture component so the comparison isolates the *noise family* rather than multi-modality."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-03",
   "metadata": {},
   "outputs": [],
   "source": [
    "models = {\n",
    "    'Gaussian': ConditionalGMMRegressor(n_components=1, random_state=0),\n",
    "    'Student-t': ConditionalStudentTRegressor(n_components=1, random_state=0,\n",
    "                                              max_iter=300),\n",
    "    'GH': ConditionalGHRegressor(n_components=1, random_state=0, max_iter=150),\n",
    "}\n",
    "COLORS = {'Gaussian': 'C3', 'Student-t': 'C0', 'GH': 'C2'}\n",
    "for m in models.values():\n",
    "    m.fit(X_train, y_train)\n",
    "\n",
    "print('Held-out mean log-likelihood  (higher is better):')\n",
    "for name, m in models.items():\n",
    "    print(f'  {name:10s}: {m.score(X_test, y_test):+.4f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-04",
   "metadata": {},
   "source": [
    "The Student-t and GH models assign higher likelihood to the held-out data: they do not waste probability mass trying to explain the heavy, asymmetric tails with a Gaussian.\n",
    "\n",
    "## How well does each model fit the noise?\n",
    "\n",
    "Because the noise is i.i.d., the residuals $r = y - 1.2x$ are draws from one skewed, heavy-tailed distribution. Each model's *fitted* conditional density (evaluated at $x=0$, where the mean is just the intercept) should match the shape of those residuals. We overlay them on the residual histogram twice:\n",
    "\n",
    "- **linear scale** - exposes the asymmetry (skew) near the peak;\n",
    "- **log scale** - exposes the tail decay (fat tails)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-05",
   "metadata": {},
   "outputs": [],
   "source": [
    "r = y_test - 1.2 * X_test[:, 0]                      # i.i.d. noise residuals\n",
    "# focus the view on the central 99% (a few extreme outliers would otherwise\n",
    "# stretch the axes and hide the shape -- the heavy tail is still visible here)\n",
    "qlo, qhi = np.percentile(r, [0.5, 99.5])\n",
    "grid = np.linspace(qlo - 0.5, qhi + 1.0, 600)\n",
    "rng_plot = (grid.min(), grid.max())\n",
    "# each model's fitted noise density = conditional density at x = 0\n",
    "dens = {name: np.exp(m.condition(np.array([[0.0]])).logpdf(grid.reshape(-1, 1)))\n",
    "        for name, m in models.items()}\n",
    "\n",
    "fig, (axL, axR) = plt.subplots(1, 2, figsize=(13, 4.5))\n",
    "for ax, yscale in [(axL, 'linear'), (axR, 'log')]:\n",
    "    ax.hist(r, bins=80, range=rng_plot, density=True, color='lightgray',\n",
    "            edgecolor='none', label='residuals')\n",
    "    for name, d in dens.items():\n",
    "        ax.plot(grid, d, color=COLORS[name], lw=2, label=name)\n",
    "    ax.set_xlabel('residual  r = y - 1.2x'); ax.set_yscale(yscale)\n",
    "    ax.set_xlim(*rng_plot)\n",
    "    ax.set_title(f'fitted noise density ({yscale} scale)')\n",
    "axL.set_ylabel('density'); axL.legend()\n",
    "axR.set_ylim(1e-4, None); axR.legend()\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-06",
   "metadata": {},
   "source": [
    "On the **linear** panel the Gaussian (red) is symmetric and too flat - it misses the tall, asymmetric peak. The Student-t (blue) is more peaked and a little heavier-tailed, but still symmetric. The GH (green) leans into the skew and matches the asymmetric peak best. On the **log** panel the Gaussian and Student-t right tails fall away while the GH tracks the long, heavy right tail; and where the noise is skewed the differences are starkest - note the GH's short *left* tail versus the Gaussian's symmetric, data-free left tail.\n",
    "\n",
    "## Predictive mean and 5-95% bands\n",
    "\n",
    "Back in data space: we sample $p(y \\mid x)$ on a grid and take percentiles. Heavy-tailed / skewed models produce wider, asymmetric predictive bands."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-07",
   "metadata": {},
   "outputs": [],
   "source": [
    "xg = np.linspace(-3, 3, 60).reshape(-1, 1)\n",
    "fig, axes = plt.subplots(1, 3, figsize=(13, 4), sharey=True)\n",
    "for ax, (name, m) in zip(axes, models.items()):\n",
    "    samples = m.sample(xg, n_samples=4000)        # (n, n_samples, 1)\n",
    "    lo, hi = np.percentile(samples[:, :, 0], [5, 95], axis=1)\n",
    "    ax.scatter(X_train[:, 0], y_train, s=6, alpha=0.2, color='gray')\n",
    "    ax.fill_between(xg[:, 0], lo, hi, alpha=0.30, color=COLORS[name],\n",
    "                    label='5-95%')\n",
    "    ax.plot(xg[:, 0], m.predict(xg), color='k', lw=2, label='mean')\n",
    "    ax.set_title(name); ax.set_xlabel('x'); ax.set_ylim(-6, 9)\n",
    "axes[0].set_ylabel('y'); axes[0].legend(loc='upper left')\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-08",
   "metadata": {},
   "source": [
    "## Takeaways\n",
    "\n",
    "- The three regressors are interchangeable in code (`fit`/`predict`/`sample`/`score`/`condition`); pick the family that matches your residuals.\n",
    "- `ConditionalStudentTRegressor` is the robust choice for heavy tails; `ConditionalGHRegressor` adds skewness.\n",
    "- All return a conditional mixture you can sample from or evaluate, and they integrate with scikit-learn pipelines and `GridSearchCV` like the Gaussian model.\n",
    "\n",
    "For the mathematics - the densities, the closed-form conditioning, and the EM algorithms - see [Distributions & Conditioning](../conventions.md)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}