{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cell-0",
   "metadata": {},
   "source": [
    "# GPU Acceleration: CuPy Integration for Tracking\n",
    "\n",
    "This notebook demonstrates GPU acceleration techniques for tracking algorithms using CuPy. We cover:\n",
    "\n",
    "1. **CuPy Basics** - GPU array operations and NumPy compatibility\n",
    "2. **Accelerating Matrix Operations** - Core operations for Kalman filters\n",
    "3. **Batch Processing** - Processing multiple tracks in parallel\n",
    "4. **Particle Filter Acceleration** - GPU-based resampling and weight computation\n",
    "5. **Performance Comparison** - Benchmarking CPU vs GPU\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "```bash\n",
    "# GPU-enabled version\n",
    "pip install cupy-cuda12x  # For CUDA 12.x\n",
    "# or pip install cupy-cuda11x for CUDA 11.x\n",
    "\n",
    "pip install nrl-tracker plotly numpy\n",
    "```\n",
    "\n",
    "**Note:** This notebook requires an NVIDIA GPU with CUDA support. If running without a GPU, code examples will demonstrate concepts with NumPy fallbacks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "cell-1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CuPy not available. Running in CPU-only mode.\n",
      "Install with: pip install cupy-cuda12x\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import plotly.graph_objects as go\n",
    "from plotly.subplots import make_subplots\n",
    "import time\n",
    "\n",
    "# Check for GPU availability\n",
    "try:\n",
    "    import cupy as cp\n",
    "    GPU_AVAILABLE = True\n",
    "    print(f\"CuPy version: {cp.__version__}\")\n",
    "    print(f\"CUDA version: {cp.cuda.runtime.runtimeGetVersion()}\")\n",
    "    print(f\"GPU: {cp.cuda.Device().name}\")\n",
    "    print(f\"GPU memory: {cp.cuda.Device().mem_info[1] / 1e9:.1f} GB\")\n",
    "except ImportError:\n",
    "    GPU_AVAILABLE = False\n",
    "    print(\"CuPy not available. Running in CPU-only mode.\")\n",
    "    print(\"Install with: pip install cupy-cuda12x\")\n",
    "\n",
    "# Use appropriate array module\n",
    "xp = cp if GPU_AVAILABLE else np\n",
    "\n",
    "np.random.seed(42)\n",
    "\n",
    "# Plotly dark theme template\n",
    "dark_template = go.layout.Template()\n",
    "dark_template.layout = go.Layout(\n",
    "    paper_bgcolor='#0d1117',\n",
    "    plot_bgcolor='#0d1117',\n",
    "    font=dict(color='#e6edf3'),\n",
    "    xaxis=dict(gridcolor='#30363d', zerolinecolor='#30363d'),\n",
    "    yaxis=dict(gridcolor='#30363d', zerolinecolor='#30363d'),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-2",
   "metadata": {},
   "source": [
    "## 1. CuPy Basics\n",
    "\n",
    "CuPy provides a NumPy-compatible interface for GPU computing. Arrays and operations mirror NumPy exactly, making migration straightforward.\n",
    "\n",
    "### Key Concepts\n",
    "\n",
    "| NumPy | CuPy | Location |\n",
    "|-------|------|----------|\n",
    "| `numpy.array()` | `cupy.array()` | GPU |\n",
    "| `numpy.zeros()` | `cupy.zeros()` | GPU |\n",
    "| `numpy.linalg.inv()` | `cupy.linalg.inv()` | GPU |\n",
    "| Transfer to GPU | `cupy.asarray(np_array)` | CPU→GPU |\n",
    "| Transfer to CPU | `cupy.asnumpy(cp_array)` | GPU→CPU |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "cell-3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a + b = [6 6 6 6 6]\n",
      "a · b = 35\n"
     ]
    }
   ],
   "source": [
    "# Basic array operations\n",
    "if GPU_AVAILABLE:\n",
    "    # Create arrays on GPU\n",
    "    a_gpu = cp.array([1, 2, 3, 4, 5])\n",
    "    b_gpu = cp.array([5, 4, 3, 2, 1])\n",
    "    \n",
    "    # Operations happen on GPU\n",
    "    c_gpu = a_gpu + b_gpu\n",
    "    d_gpu = cp.dot(a_gpu, b_gpu)\n",
    "    \n",
    "    print(f\"a + b = {c_gpu}\")\n",
    "    print(f\"a · b = {d_gpu}\")\n",
    "    print(f\"Array type: {type(c_gpu)}\")\n",
    "    \n",
    "    # Transfer back to CPU when needed\n",
    "    c_cpu = cp.asnumpy(c_gpu)\n",
    "    print(f\"Transferred type: {type(c_cpu)}\")\n",
    "else:\n",
    "    # NumPy fallback\n",
    "    a_cpu = np.array([1, 2, 3, 4, 5])\n",
    "    b_cpu = np.array([5, 4, 3, 2, 1])\n",
    "    print(f\"a + b = {a_cpu + b_cpu}\")\n",
    "    print(f\"a · b = {np.dot(a_cpu, b_cpu)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "cell-4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Memory management\n",
    "if GPU_AVAILABLE:\n",
    "    # Check memory before\n",
    "    mempool = cp.get_default_memory_pool()\n",
    "    print(f\"Memory used before: {mempool.used_bytes() / 1e6:.2f} MB\")\n",
    "    \n",
    "    # Create large array\n",
    "    large_array = cp.random.randn(10000, 10000)\n",
    "    print(f\"Memory after 10000x10000 float64: {mempool.used_bytes() / 1e6:.2f} MB\")\n",
    "    \n",
    "    # Free memory explicitly\n",
    "    del large_array\n",
    "    mempool.free_all_blocks()\n",
    "    print(f\"Memory after free: {mempool.used_bytes() / 1e6:.2f} MB\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-5",
   "metadata": {},
   "source": [
    "## 2. Accelerating Matrix Operations\n",
    "\n",
    "Kalman filters involve repeated matrix operations that benefit from GPU parallelization:\n",
    "\n",
    "- Matrix multiplication: $P \\cdot H^T$\n",
    "- Matrix inversion: $(HPH^T + R)^{-1}$\n",
    "- Cholesky decomposition: $P = LL^T$\n",
    "\n",
    "Let's implement core Kalman filter operations for both CPU and GPU."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "cell-6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Kalman filter functions defined.\n"
     ]
    }
   ],
   "source": [
    "def kalman_predict(x, P, F, Q, xp=np):\n",
    "    \"\"\"\n",
    "    Kalman filter prediction step.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    x : array_like\n",
    "        State estimate [n, 1].\n",
    "    P : array_like\n",
    "        State covariance [n, n].\n",
    "    F : array_like\n",
    "        State transition matrix [n, n].\n",
    "    Q : array_like\n",
    "        Process noise covariance [n, n].\n",
    "    xp : module\n",
    "        Array module (numpy or cupy).\n",
    "        \n",
    "    Returns\n",
    "    -------\n",
    "    x_pred : array\n",
    "        Predicted state.\n",
    "    P_pred : array\n",
    "        Predicted covariance.\n",
    "    \"\"\"\n",
    "    x_pred = xp.dot(F, x)\n",
    "    P_pred = xp.dot(xp.dot(F, P), F.T) + Q\n",
    "    return x_pred, P_pred\n",
    "\n",
    "\n",
    "def kalman_update(x, P, z, H, R, xp=np):\n",
    "    \"\"\"\n",
    "    Kalman filter update step.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    x : array_like\n",
    "        Predicted state [n, 1].\n",
    "    P : array_like\n",
    "        Predicted covariance [n, n].\n",
    "    z : array_like\n",
    "        Measurement [m, 1].\n",
    "    H : array_like\n",
    "        Measurement matrix [m, n].\n",
    "    R : array_like\n",
    "        Measurement noise covariance [m, m].\n",
    "    xp : module\n",
    "        Array module (numpy or cupy).\n",
    "        \n",
    "    Returns\n",
    "    -------\n",
    "    x_upd : array\n",
    "        Updated state.\n",
    "    P_upd : array\n",
    "        Updated covariance.\n",
    "    \"\"\"\n",
    "    # Innovation\n",
    "    y = z - xp.dot(H, x)\n",
    "    \n",
    "    # Innovation covariance\n",
    "    S = xp.dot(xp.dot(H, P), H.T) + R\n",
    "    \n",
    "    # Kalman gain\n",
    "    K = xp.dot(xp.dot(P, H.T), xp.linalg.inv(S))\n",
    "    \n",
    "    # State update\n",
    "    x_upd = x + xp.dot(K, y)\n",
    "    \n",
    "    # Covariance update (Joseph form for numerical stability)\n",
    "    I = xp.eye(P.shape[0])\n",
    "    IKH = I - xp.dot(K, H)\n",
    "    P_upd = xp.dot(xp.dot(IKH, P), IKH.T) + xp.dot(xp.dot(K, R), K.T)\n",
    "    \n",
    "    return x_upd, P_upd\n",
    "\n",
    "\n",
    "print(\"Kalman filter functions defined.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "cell-7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "System matrices defined.\n",
      "State dimension: 4\n",
      "Measurement dimension: 2\n"
     ]
    }
   ],
   "source": [
    "# Test Kalman filter on CPU and GPU\n",
    "state_dim = 4  # [x, vx, y, vy]\n",
    "meas_dim = 2   # [x, y]\n",
    "\n",
    "dt = 1.0\n",
    "\n",
    "# State transition (constant velocity)\n",
    "F = np.array([\n",
    "    [1, dt, 0, 0],\n",
    "    [0, 1, 0, 0],\n",
    "    [0, 0, 1, dt],\n",
    "    [0, 0, 0, 1]\n",
    "])\n",
    "\n",
    "# Measurement matrix (observe position only)\n",
    "H = np.array([\n",
    "    [1, 0, 0, 0],\n",
    "    [0, 0, 1, 0]\n",
    "])\n",
    "\n",
    "# Process noise\n",
    "q = 0.1\n",
    "Q = q * np.array([\n",
    "    [dt**3/3, dt**2/2, 0, 0],\n",
    "    [dt**2/2, dt, 0, 0],\n",
    "    [0, 0, dt**3/3, dt**2/2],\n",
    "    [0, 0, dt**2/2, dt]\n",
    "])\n",
    "\n",
    "# Measurement noise\n",
    "R = 10 * np.eye(meas_dim)\n",
    "\n",
    "# Initial state and covariance\n",
    "x = np.array([0, 1, 0, 1]).reshape(-1, 1)\n",
    "P = 100 * np.eye(state_dim)\n",
    "\n",
    "# Measurement\n",
    "z = np.array([5, 5]).reshape(-1, 1)\n",
    "\n",
    "print(\"System matrices defined.\")\n",
    "print(f\"State dimension: {state_dim}\")\n",
    "print(f\"Measurement dimension: {meas_dim}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "cell-8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU Results:\n",
      "Predicted state: [1. 1. 1. 1.]\n",
      "Updated state: [4.80955404 2.90541184 4.80955404 2.90541184]\n"
     ]
    }
   ],
   "source": [
    "# Run on CPU\n",
    "x_pred_cpu, P_pred_cpu = kalman_predict(x, P, F, Q, xp=np)\n",
    "x_upd_cpu, P_upd_cpu = kalman_update(x_pred_cpu, P_pred_cpu, z, H, R, xp=np)\n",
    "\n",
    "print(\"CPU Results:\")\n",
    "print(f\"Predicted state: {x_pred_cpu.flatten()}\")\n",
    "print(f\"Updated state: {x_upd_cpu.flatten()}\")\n",
    "\n",
    "if GPU_AVAILABLE:\n",
    "    # Transfer to GPU\n",
    "    x_gpu = cp.asarray(x)\n",
    "    P_gpu = cp.asarray(P)\n",
    "    F_gpu = cp.asarray(F)\n",
    "    Q_gpu = cp.asarray(Q)\n",
    "    H_gpu = cp.asarray(H)\n",
    "    R_gpu = cp.asarray(R)\n",
    "    z_gpu = cp.asarray(z)\n",
    "    \n",
    "    # Run on GPU\n",
    "    x_pred_gpu, P_pred_gpu = kalman_predict(x_gpu, P_gpu, F_gpu, Q_gpu, xp=cp)\n",
    "    x_upd_gpu, P_upd_gpu = kalman_update(x_pred_gpu, P_pred_gpu, z_gpu, H_gpu, R_gpu, xp=cp)\n",
    "    \n",
    "    print(\"\\nGPU Results:\")\n",
    "    print(f\"Predicted state: {cp.asnumpy(x_pred_gpu).flatten()}\")\n",
    "    print(f\"Updated state: {cp.asnumpy(x_upd_gpu).flatten()}\")\n",
    "    \n",
    "    # Verify results match\n",
    "    error = np.max(np.abs(cp.asnumpy(x_upd_gpu) - x_upd_cpu))\n",
    "    print(f\"\\nMax difference CPU vs GPU: {error:.2e}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-9",
   "metadata": {},
   "source": [
    "## 3. Batch Processing Multiple Tracks\n",
    "\n",
    "The real power of GPU acceleration comes from processing many tracks simultaneously. Instead of looping over tracks, we can stack them into batched arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "cell-10",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch Kalman functions defined.\n"
     ]
    }
   ],
   "source": [
    "def batch_kalman_predict(x_batch, P_batch, F, Q, xp=np):\n",
    "    \"\"\"\n",
    "    Batch Kalman filter prediction.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    x_batch : array_like\n",
    "        Batch of states [batch_size, n, 1].\n",
    "    P_batch : array_like\n",
    "        Batch of covariances [batch_size, n, n].\n",
    "    F : array_like\n",
    "        State transition matrix [n, n] (shared).\n",
    "    Q : array_like\n",
    "        Process noise [n, n] (shared).\n",
    "    xp : module\n",
    "        Array module.\n",
    "        \n",
    "    Returns\n",
    "    -------\n",
    "    x_pred : array\n",
    "        Predicted states [batch_size, n, 1].\n",
    "    P_pred : array\n",
    "        Predicted covariances [batch_size, n, n].\n",
    "    \"\"\"\n",
    "    batch_size = x_batch.shape[0]\n",
    "    \n",
    "    # Vectorized prediction\n",
    "    # x_pred = F @ x for each batch element\n",
    "    x_pred = xp.einsum('ij,bjk->bik', F, x_batch)\n",
    "    \n",
    "    # P_pred = F @ P @ F.T + Q for each batch element\n",
    "    FP = xp.einsum('ij,bjk->bik', F, P_batch)\n",
    "    FPFt = xp.einsum('bij,kj->bik', FP, F)\n",
    "    P_pred = FPFt + Q\n",
    "    \n",
    "    return x_pred, P_pred\n",
    "\n",
    "\n",
    "def batch_kalman_update(x_batch, P_batch, z_batch, H, R, xp=np):\n",
    "    \"\"\"\n",
    "    Batch Kalman filter update.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    x_batch : array_like\n",
    "        Batch of predicted states [batch_size, n, 1].\n",
    "    P_batch : array_like\n",
    "        Batch of predicted covariances [batch_size, n, n].\n",
    "    z_batch : array_like\n",
    "        Batch of measurements [batch_size, m, 1].\n",
    "    H : array_like\n",
    "        Measurement matrix [m, n] (shared).\n",
    "    R : array_like\n",
    "        Measurement noise [m, m] (shared).\n",
    "    xp : module\n",
    "        Array module.\n",
    "        \n",
    "    Returns\n",
    "    -------\n",
    "    x_upd : array\n",
    "        Updated states [batch_size, n, 1].\n",
    "    P_upd : array\n",
    "        Updated covariances [batch_size, n, n].\n",
    "    \"\"\"\n",
    "    batch_size = x_batch.shape[0]\n",
    "    n = x_batch.shape[1]\n",
    "    \n",
    "    # Innovation: y = z - H @ x\n",
    "    Hx = xp.einsum('ij,bjk->bik', H, x_batch)\n",
    "    y = z_batch - Hx\n",
    "    \n",
    "    # Innovation covariance: S = H @ P @ H.T + R\n",
    "    HP = xp.einsum('ij,bjk->bik', H, P_batch)\n",
    "    HPHt = xp.einsum('bij,kj->bik', HP, H)\n",
    "    S = HPHt + R\n",
    "    \n",
    "    # Kalman gain: K = P @ H.T @ S^{-1}\n",
    "    PHt = xp.einsum('bij,kj->bik', P_batch, H)\n",
    "    S_inv = xp.linalg.inv(S)\n",
    "    K = xp.einsum('bij,bjk->bik', PHt, S_inv)\n",
    "    \n",
    "    # State update: x_upd = x + K @ y\n",
    "    Ky = xp.einsum('bij,bjk->bik', K, y)\n",
    "    x_upd = x_batch + Ky\n",
    "    \n",
    "    # Covariance update (simplified form)\n",
    "    I = xp.eye(n)\n",
    "    KH = xp.einsum('bij,jk->bik', K, H)\n",
    "    IKH = I - KH\n",
    "    P_upd = xp.einsum('bij,bjk->bik', IKH, P_batch)\n",
    "    \n",
    "    return x_upd, P_upd\n",
    "\n",
    "\n",
    "print(\"Batch Kalman functions defined.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "cell-11",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batch size    10: CPU=  0.043ms\n",
      "Batch size   100: CPU=  0.146ms\n",
      "Batch size  1000: CPU=  1.205ms\n"
     ]
    }
   ],
   "source": [
    "# Benchmark batch processing\n",
    "batch_sizes = [10, 100, 1000, 5000]\n",
    "\n",
    "if not GPU_AVAILABLE:\n",
    "    batch_sizes = [10, 100, 1000]  # Smaller sizes for CPU-only\n",
    "\n",
    "cpu_times = []\n",
    "gpu_times = []\n",
    "\n",
    "for batch_size in batch_sizes:\n",
    "    # Create batch data\n",
    "    x_batch_np = np.random.randn(batch_size, state_dim, 1)\n",
    "    P_batch_np = np.tile(P.reshape(1, state_dim, state_dim), (batch_size, 1, 1))\n",
    "    z_batch_np = np.random.randn(batch_size, meas_dim, 1)\n",
    "    \n",
    "    # CPU timing\n",
    "    n_iterations = 100\n",
    "    start = time.time()\n",
    "    for _ in range(n_iterations):\n",
    "        x_pred, P_pred = batch_kalman_predict(x_batch_np, P_batch_np, F, Q, xp=np)\n",
    "        x_upd, P_upd = batch_kalman_update(x_pred, P_pred, z_batch_np, H, R, xp=np)\n",
    "    cpu_time = (time.time() - start) / n_iterations\n",
    "    cpu_times.append(cpu_time)\n",
    "    \n",
    "    if GPU_AVAILABLE:\n",
    "        # Transfer to GPU\n",
    "        x_batch_gpu = cp.asarray(x_batch_np)\n",
    "        P_batch_gpu = cp.asarray(P_batch_np)\n",
    "        z_batch_gpu = cp.asarray(z_batch_np)\n",
    "        \n",
    "        # Warm up\n",
    "        x_pred, P_pred = batch_kalman_predict(x_batch_gpu, P_batch_gpu, F_gpu, Q_gpu, xp=cp)\n",
    "        x_upd, P_upd = batch_kalman_update(x_pred, P_pred, z_batch_gpu, H_gpu, R_gpu, xp=cp)\n",
    "        cp.cuda.Stream.null.synchronize()\n",
    "        \n",
    "        # GPU timing\n",
    "        start = time.time()\n",
    "        for _ in range(n_iterations):\n",
    "            x_pred, P_pred = batch_kalman_predict(x_batch_gpu, P_batch_gpu, F_gpu, Q_gpu, xp=cp)\n",
    "            x_upd, P_upd = batch_kalman_update(x_pred, P_pred, z_batch_gpu, H_gpu, R_gpu, xp=cp)\n",
    "        cp.cuda.Stream.null.synchronize()\n",
    "        gpu_time = (time.time() - start) / n_iterations\n",
    "        gpu_times.append(gpu_time)\n",
    "        \n",
    "        speedup = cpu_time / gpu_time\n",
    "        print(f\"Batch size {batch_size:5d}: CPU={cpu_time*1e3:7.3f}ms, \"\n",
    "              f\"GPU={gpu_time*1e3:7.3f}ms, Speedup={speedup:.1f}x\")\n",
    "    else:\n",
    "        print(f\"Batch size {batch_size:5d}: CPU={cpu_time*1e3:7.3f}ms\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "cell-12",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.plotly.v1+json": {
       "config": {
        "plotlyServerURL": "https://plot.ly"
       },
       "data": [
        {
         "line": {
          "color": "#00d4ff",
          "width": 2
         },
         "marker": {
          "size": 8
         },
         "mode": "lines+markers",
         "name": "CPU (NumPy)",
         "type": "scatter",
         "x": [
          10,
          100,
          1000
         ],
         "y": {
          "bdata": "AAAAACAupj8AAAAAKMDCPwAAAACvR/M/",
          "dtype": "f8"
         }
        }
       ],
       "layout": {
        "height": 400,
        "template": {
         "layout": {
          "font": {
           "color": "#e6edf3"
          },
          "paper_bgcolor": "#0d1117",
          "plot_bgcolor": "#0d1117",
          "xaxis": {
           "gridcolor": "#30363d",
           "zerolinecolor": "#30363d"
          },
          "yaxis": {
           "gridcolor": "#30363d",
           "zerolinecolor": "#30363d"
          }
         }
        },
        "title": {
         "text": "Kalman Filter CPU Performance"
        },
        "xaxis": {
         "title": {
          "text": "Batch Size"
         },
         "type": "log"
        },
        "yaxis": {
         "title": {
          "text": "Time (ms)"
         },
         "type": "log"
        }
       }
      }
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Visualize performance\n",
    "if GPU_AVAILABLE and len(gpu_times) > 0:\n",
    "    fig = make_subplots(rows=1, cols=2, \n",
    "                        subplot_titles=['Kalman Filter Batch Performance', 'GPU Speedup Factor'])\n",
    "    \n",
    "    fig.add_trace(\n",
    "        go.Scatter(x=batch_sizes, y=np.array(cpu_times)*1e3, mode='lines+markers',\n",
    "                   name='CPU (NumPy)', line=dict(color='#00d4ff', width=2),\n",
    "                   marker=dict(size=8)),\n",
    "        row=1, col=1\n",
    "    )\n",
    "    fig.add_trace(\n",
    "        go.Scatter(x=batch_sizes, y=np.array(gpu_times)*1e3, mode='lines+markers',\n",
    "                   name='GPU (CuPy)', line=dict(color='#ff4757', width=2),\n",
    "                   marker=dict(size=8, symbol='square')),\n",
    "        row=1, col=1\n",
    "    )\n",
    "    \n",
    "    speedups = np.array(cpu_times) / np.array(gpu_times)\n",
    "    fig.add_trace(\n",
    "        go.Scatter(x=batch_sizes, y=speedups, mode='lines+markers',\n",
    "                   name='Speedup', line=dict(color='#00ff88', width=2),\n",
    "                   marker=dict(size=10, symbol='triangle-up')),\n",
    "        row=1, col=2\n",
    "    )\n",
    "    fig.add_trace(\n",
    "        go.Scatter(x=[batch_sizes[0], batch_sizes[-1]], y=[1, 1], mode='lines',\n",
    "                   line=dict(color='white', dash='dash', width=1),\n",
    "                   showlegend=False),\n",
    "        row=1, col=2\n",
    "    )\n",
    "    \n",
    "    fig.update_xaxes(type='log', title_text='Batch Size (number of tracks)', row=1, col=1)\n",
    "    fig.update_yaxes(type='log', title_text='Time per iteration (ms)', row=1, col=1)\n",
    "    fig.update_xaxes(type='log', title_text='Batch Size (number of tracks)', row=1, col=2)\n",
    "    fig.update_yaxes(title_text='Speedup (CPU time / GPU time)', row=1, col=2)\n",
    "    \n",
    "    fig.update_layout(template=dark_template, height=400)\n",
    "    fig.show()\n",
    "else:\n",
    "    fig = go.Figure()\n",
    "    fig.add_trace(\n",
    "        go.Scatter(x=batch_sizes, y=np.array(cpu_times)*1e3, mode='lines+markers',\n",
    "                   name='CPU (NumPy)', line=dict(color='#00d4ff', width=2),\n",
    "                   marker=dict(size=8))\n",
    "    )\n",
    "    fig.update_layout(\n",
    "        template=dark_template,\n",
    "        title='Kalman Filter CPU Performance',\n",
    "        xaxis_title='Batch Size',\n",
    "        xaxis_type='log',\n",
    "        yaxis_title='Time (ms)',\n",
    "        yaxis_type='log',\n",
    "        height=400\n",
    "    )\n",
    "    fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-13",
   "metadata": {},
   "source": [
    "## 4. GPU-Accelerated Particle Filter\n",
    "\n",
    "Particle filters involve operations on large numbers of particles that parallelize well:\n",
    "\n",
    "1. **Particle propagation**: Apply dynamics to each particle\n",
    "2. **Weight computation**: Evaluate likelihood for each particle\n",
    "3. **Resampling**: Select particles based on weights"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "cell-14",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Particle filter functions defined.\n"
     ]
    }
   ],
   "source": [
    "def particle_filter_step(particles, weights, z, f_dynamics, h_meas, Q_std, R_std, xp=np):\n",
    "    \"\"\"\n",
    "    GPU-accelerated particle filter step.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    particles : array_like\n",
    "        Particle states [n_particles, state_dim].\n",
    "    weights : array_like\n",
    "        Particle weights [n_particles].\n",
    "    z : array_like\n",
    "        Measurement [meas_dim].\n",
    "    f_dynamics : callable\n",
    "        Dynamics function f(x) -> x_next.\n",
    "    h_meas : callable\n",
    "        Measurement function h(x) -> z_pred.\n",
    "    Q_std : float\n",
    "        Process noise standard deviation.\n",
    "    R_std : float\n",
    "        Measurement noise standard deviation.\n",
    "    xp : module\n",
    "        Array module.\n",
    "        \n",
    "    Returns\n",
    "    -------\n",
    "    particles : array\n",
    "        Updated particles.\n",
    "    weights : array\n",
    "        Updated weights.\n",
    "    \"\"\"\n",
    "    n_particles = particles.shape[0]\n",
    "    state_dim = particles.shape[1]\n",
    "    \n",
    "    # Prediction: propagate particles through dynamics + noise\n",
    "    particles_pred = f_dynamics(particles, xp)\n",
    "    particles_pred += Q_std * xp.random.randn(n_particles, state_dim)\n",
    "    \n",
    "    # Update: compute likelihood weights\n",
    "    z_pred = h_meas(particles_pred, xp)\n",
    "    innovation = z - z_pred\n",
    "    \n",
    "    # Gaussian likelihood (vectorized)\n",
    "    log_likelihood = -0.5 * xp.sum(innovation**2, axis=1) / R_std**2\n",
    "    \n",
    "    # Update weights\n",
    "    log_weights = xp.log(weights + 1e-300) + log_likelihood\n",
    "    log_weights = log_weights - xp.max(log_weights)  # Numerical stability\n",
    "    weights = xp.exp(log_weights)\n",
    "    weights = weights / xp.sum(weights)  # Normalize\n",
    "    \n",
    "    # Compute effective sample size\n",
    "    ess = 1.0 / xp.sum(weights**2)\n",
    "    \n",
    "    # Resample if ESS is too low\n",
    "    if float(ess) < n_particles / 2:\n",
    "        # Systematic resampling\n",
    "        indices = systematic_resample(weights, xp)\n",
    "        particles_pred = particles_pred[indices]\n",
    "        weights = xp.ones(n_particles) / n_particles\n",
    "    \n",
    "    return particles_pred, weights, float(ess)\n",
    "\n",
    "\n",
    "def systematic_resample(weights, xp=np):\n",
    "    \"\"\"\n",
    "    Systematic resampling for particle filters.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    weights : array_like\n",
    "        Normalized weights [n_particles].\n",
    "    xp : module\n",
    "        Array module.\n",
    "        \n",
    "    Returns\n",
    "    -------\n",
    "    indices : array\n",
    "        Resampled indices.\n",
    "    \"\"\"\n",
    "    n = len(weights)\n",
    "    \n",
    "    # Generate systematic points\n",
    "    positions = (xp.arange(n) + xp.random.rand()) / n\n",
    "    \n",
    "    # Cumulative sum of weights\n",
    "    cumsum = xp.cumsum(weights)\n",
    "    \n",
    "    # Find indices using searchsorted\n",
    "    indices = xp.searchsorted(cumsum, positions)\n",
    "    indices = xp.clip(indices, 0, n - 1)\n",
    "    \n",
    "    return indices\n",
    "\n",
    "\n",
    "print(\"Particle filter functions defined.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "cell-15",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Benchmarking particle filter...\n",
      "N= 1000: CPU=    2.7ms\n",
      "N= 5000: CPU=    9.8ms\n",
      "N=10000: CPU=   18.5ms\n"
     ]
    }
   ],
   "source": [
    "# Define nonlinear dynamics and measurement\n",
    "def f_nonlinear(x, xp=np):\n",
    "    \"\"\"Nonlinear dynamics: x_next = x/2 + 25*x/(1+x^2) + 8*cos(1.2*k)\"\"\"\n",
    "    return x / 2 + 25 * x / (1 + x**2 + 1e-10)\n",
    "\n",
    "def h_nonlinear(x, xp=np):\n",
    "    \"\"\"Nonlinear measurement: z = x^2 / 20\"\"\"\n",
    "    return x**2 / 20\n",
    "\n",
    "# Parameters\n",
    "n_particles_list = [1000, 5000, 10000, 50000]\n",
    "if not GPU_AVAILABLE:\n",
    "    n_particles_list = [1000, 5000, 10000]\n",
    "\n",
    "state_dim_pf = 1\n",
    "Q_std = 3.0\n",
    "R_std = 1.0\n",
    "\n",
    "n_steps = 50\n",
    "n_trials = 5\n",
    "\n",
    "cpu_pf_times = []\n",
    "gpu_pf_times = []\n",
    "\n",
    "print(\"Benchmarking particle filter...\")\n",
    "\n",
    "for n_particles in n_particles_list:\n",
    "    # Generate true trajectory and measurements\n",
    "    np.random.seed(42)\n",
    "    true_state = [0.0]\n",
    "    measurements = []\n",
    "    for k in range(n_steps):\n",
    "        x_new = f_nonlinear(np.array([[true_state[-1]]]))[0, 0] + Q_std * np.random.randn()\n",
    "        true_state.append(x_new)\n",
    "        z = h_nonlinear(np.array([[x_new]]))[0, 0] + R_std * np.random.randn()\n",
    "        measurements.append(z)\n",
    "    \n",
    "    # CPU timing\n",
    "    cpu_times_trial = []\n",
    "    for _ in range(n_trials):\n",
    "        particles = np.random.randn(n_particles, state_dim_pf) * 5\n",
    "        weights = np.ones(n_particles) / n_particles\n",
    "        \n",
    "        start = time.time()\n",
    "        for z in measurements:\n",
    "            z_arr = np.array([[z]])\n",
    "            particles, weights, ess = particle_filter_step(\n",
    "                particles, weights, z_arr, f_nonlinear, h_nonlinear, Q_std, R_std, xp=np\n",
    "            )\n",
    "        cpu_times_trial.append(time.time() - start)\n",
    "    cpu_pf_times.append(np.mean(cpu_times_trial))\n",
    "    \n",
    "    if GPU_AVAILABLE:\n",
    "        # GPU timing\n",
    "        gpu_times_trial = []\n",
    "        for _ in range(n_trials):\n",
    "            particles_gpu = cp.random.randn(n_particles, state_dim_pf) * 5\n",
    "            weights_gpu = cp.ones(n_particles) / n_particles\n",
    "            \n",
    "            # Warm up\n",
    "            z_arr = cp.array([[measurements[0]]])\n",
    "            _, _, _ = particle_filter_step(\n",
    "                particles_gpu, weights_gpu, z_arr, f_nonlinear, h_nonlinear, Q_std, R_std, xp=cp\n",
    "            )\n",
    "            cp.cuda.Stream.null.synchronize()\n",
    "            \n",
    "            particles_gpu = cp.random.randn(n_particles, state_dim_pf) * 5\n",
    "            weights_gpu = cp.ones(n_particles) / n_particles\n",
    "            \n",
    "            start = time.time()\n",
    "            for z in measurements:\n",
    "                z_arr = cp.array([[z]])\n",
    "                particles_gpu, weights_gpu, ess = particle_filter_step(\n",
    "                    particles_gpu, weights_gpu, z_arr, f_nonlinear, h_nonlinear, Q_std, R_std, xp=cp\n",
    "                )\n",
    "            cp.cuda.Stream.null.synchronize()\n",
    "            gpu_times_trial.append(time.time() - start)\n",
    "        gpu_pf_times.append(np.mean(gpu_times_trial))\n",
    "        \n",
    "        speedup = cpu_pf_times[-1] / gpu_pf_times[-1]\n",
    "        print(f\"N={n_particles:5d}: CPU={cpu_pf_times[-1]*1e3:7.1f}ms, \"\n",
    "              f\"GPU={gpu_pf_times[-1]*1e3:7.1f}ms, Speedup={speedup:.1f}x\")\n",
    "    else:\n",
    "        print(f\"N={n_particles:5d}: CPU={cpu_pf_times[-1]*1e3:7.1f}ms\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "cell-16",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize particle filter performance\n",
    "if GPU_AVAILABLE and len(gpu_pf_times) > 0:\n",
    "    fig = make_subplots(rows=1, cols=2, \n",
    "                        subplot_titles=[f'Particle Filter ({n_steps} steps)', 'GPU Speedup for Particle Filter'])\n",
    "    \n",
    "    fig.add_trace(\n",
    "        go.Scatter(x=n_particles_list, y=np.array(cpu_pf_times)*1e3, mode='lines+markers',\n",
    "                   name='CPU', line=dict(color='#00d4ff', width=2),\n",
    "                   marker=dict(size=8)),\n",
    "        row=1, col=1\n",
    "    )\n",
    "    fig.add_trace(\n",
    "        go.Scatter(x=n_particles_list, y=np.array(gpu_pf_times)*1e3, mode='lines+markers',\n",
    "                   name='GPU', line=dict(color='#ff4757', width=2),\n",
    "                   marker=dict(size=8, symbol='square')),\n",
    "        row=1, col=1\n",
    "    )\n",
    "    \n",
    "    speedups = np.array(cpu_pf_times) / np.array(gpu_pf_times)\n",
    "    fig.add_trace(\n",
    "        go.Scatter(x=n_particles_list, y=speedups, mode='lines+markers',\n",
    "                   name='Speedup', line=dict(color='#00ff88', width=2),\n",
    "                   marker=dict(size=10, symbol='triangle-up')),\n",
    "        row=1, col=2\n",
    "    )\n",
    "    fig.add_trace(\n",
    "        go.Scatter(x=[n_particles_list[0], n_particles_list[-1]], y=[1, 1], mode='lines',\n",
    "                   line=dict(color='white', dash='dash', width=1),\n",
    "                   showlegend=False),\n",
    "        row=1, col=2\n",
    "    )\n",
    "    \n",
    "    fig.update_xaxes(type='log', title_text='Number of Particles', row=1, col=1)\n",
    "    fig.update_yaxes(type='log', title_text='Time (ms)', row=1, col=1)\n",
    "    fig.update_xaxes(type='log', title_text='Number of Particles', row=1, col=2)\n",
    "    fig.update_yaxes(title_text='Speedup Factor', row=1, col=2)\n",
    "    \n",
    "    fig.update_layout(template=dark_template, height=400)\n",
    "    fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-17",
   "metadata": {},
   "source": [
    "## 5. Memory Management Best Practices\n",
    "\n",
    "Efficient GPU memory management is crucial for large-scale tracking applications."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "cell-18",
   "metadata": {},
   "outputs": [],
   "source": [
    "if GPU_AVAILABLE:\n",
    "    print(\"GPU Memory Management Best Practices\")\n",
    "    print(\"=\" * 50)\n",
    "    \n",
    "    # 1. Memory pools\n",
    "    print(\"\\n1. Memory Pools\")\n",
    "    mempool = cp.get_default_memory_pool()\n",
    "    pinned_mempool = cp.get_default_pinned_memory_pool()\n",
    "    \n",
    "    print(f\"   GPU memory pool used: {mempool.used_bytes() / 1e6:.2f} MB\")\n",
    "    print(f\"   Pinned memory pool used: {pinned_mempool.n_free_blocks()} free blocks\")\n",
    "    \n",
    "    # 2. Pre-allocation\n",
    "    print(\"\\n2. Pre-allocation for Repeated Operations\")\n",
    "    \n",
    "    n = 1000\n",
    "    \n",
    "    # Bad: Allocate new arrays each iteration\n",
    "    start = time.time()\n",
    "    for _ in range(1000):\n",
    "        temp = cp.zeros((n, n))\n",
    "        result = cp.dot(temp, temp)\n",
    "    cp.cuda.Stream.null.synchronize()\n",
    "    time_alloc = time.time() - start\n",
    "    \n",
    "    # Good: Reuse pre-allocated arrays\n",
    "    temp = cp.zeros((n, n))\n",
    "    result = cp.zeros((n, n))\n",
    "    start = time.time()\n",
    "    for _ in range(1000):\n",
    "        cp.dot(temp, temp, out=result)\n",
    "    cp.cuda.Stream.null.synchronize()\n",
    "    time_reuse = time.time() - start\n",
    "    \n",
    "    print(f\"   With allocation each time: {time_alloc*1e3:.1f} ms\")\n",
    "    print(f\"   With pre-allocated arrays: {time_reuse*1e3:.1f} ms\")\n",
    "    print(f\"   Speedup: {time_alloc/time_reuse:.1f}x\")\n",
    "    \n",
    "    # 3. Async transfers\n",
    "    print(\"\\n3. Asynchronous Data Transfers\")\n",
    "    \n",
    "    data_cpu = np.random.randn(10000, 1000)\n",
    "    \n",
    "    # Synchronous\n",
    "    start = time.time()\n",
    "    data_gpu = cp.asarray(data_cpu)\n",
    "    cp.cuda.Stream.null.synchronize()\n",
    "    time_sync = time.time() - start\n",
    "    \n",
    "    # With pinned memory (faster transfers)\n",
    "    data_pinned = cp.cuda.alloc_pinned_memory(data_cpu.nbytes)\n",
    "    data_view = np.frombuffer(data_pinned, dtype=data_cpu.dtype).reshape(data_cpu.shape)\n",
    "    np.copyto(data_view, data_cpu)\n",
    "    \n",
    "    start = time.time()\n",
    "    data_gpu2 = cp.asarray(data_view)\n",
    "    cp.cuda.Stream.null.synchronize()\n",
    "    time_pinned = time.time() - start\n",
    "    \n",
    "    print(f\"   Standard transfer: {time_sync*1e3:.2f} ms\")\n",
    "    print(f\"   Pinned memory transfer: {time_pinned*1e3:.2f} ms\")\n",
    "    \n",
    "    # Clean up\n",
    "    del data_gpu, data_gpu2, temp, result\n",
    "    mempool.free_all_blocks()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-19",
   "metadata": {},
   "source": [
    "## 6. Integration with pyTCL\n",
    "\n",
    "Here's how to integrate GPU acceleration with existing pyTCL code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "cell-20",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GPU-accelerated tracking function defined.\n"
     ]
    }
   ],
   "source": [
    "# Import from pyTCL\n",
    "from pytcl.dynamic_estimation.kalman import kf_predict, kf_update\n",
    "\n",
    "def gpu_accelerated_tracking(measurements, F, H, Q, R, x0, P0, use_gpu=True):\n",
    "    \"\"\"\n",
    "    Run Kalman filter tracking with optional GPU acceleration.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    measurements : array_like\n",
    "        List of measurements [n_timesteps, meas_dim].\n",
    "    F, H, Q, R : array_like\n",
    "        Kalman filter matrices.\n",
    "    x0, P0 : array_like\n",
    "        Initial state and covariance.\n",
    "    use_gpu : bool\n",
    "        Whether to use GPU acceleration.\n",
    "        \n",
    "    Returns\n",
    "    -------\n",
    "    estimates : array\n",
    "        State estimates [n_timesteps, state_dim].\n",
    "    covariances : array\n",
    "        Covariance estimates [n_timesteps, state_dim, state_dim].\n",
    "    \"\"\"\n",
    "    if use_gpu and GPU_AVAILABLE:\n",
    "        xp = cp\n",
    "        F = cp.asarray(F)\n",
    "        H = cp.asarray(H)\n",
    "        Q = cp.asarray(Q)\n",
    "        R = cp.asarray(R)\n",
    "        x = cp.asarray(x0)\n",
    "        P = cp.asarray(P0)\n",
    "        measurements = cp.asarray(measurements)\n",
    "    else:\n",
    "        xp = np\n",
    "        x = x0.copy()\n",
    "        P = P0.copy()\n",
    "    \n",
    "    estimates = []\n",
    "    covariances = []\n",
    "    \n",
    "    for z in measurements:\n",
    "        # Predict\n",
    "        x, P = kalman_predict(x, P, F, Q, xp)\n",
    "        \n",
    "        # Update\n",
    "        z = z.reshape(-1, 1)\n",
    "        x, P = kalman_update(x, P, z, H, R, xp)\n",
    "        \n",
    "        estimates.append(x.flatten())\n",
    "        covariances.append(P.copy())\n",
    "    \n",
    "    if use_gpu and GPU_AVAILABLE:\n",
    "        estimates = cp.asnumpy(cp.array(estimates))\n",
    "        covariances = cp.asnumpy(cp.array(covariances))\n",
    "    else:\n",
    "        estimates = np.array([xp.asnumpy(e) if hasattr(xp, 'asnumpy') else e for e in estimates])\n",
    "        covariances = np.array([xp.asnumpy(c) if hasattr(xp, 'asnumpy') else c for c in covariances])\n",
    "    \n",
    "    return np.array(estimates), np.array(covariances)\n",
    "\n",
    "print(\"GPU-accelerated tracking function defined.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "cell-21",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tracking complete: 100 time steps\n"
     ]
    }
   ],
   "source": [
    "# Run tracking example\n",
    "# Generate trajectory\n",
    "n_timesteps = 100\n",
    "true_state = []\n",
    "measurements_list = []\n",
    "\n",
    "x_true = np.array([0, 1, 0, 0.5])  # Start at origin, moving NE\n",
    "for t in range(n_timesteps):\n",
    "    true_state.append(x_true.copy())\n",
    "    \n",
    "    # Propagate\n",
    "    x_true = F @ x_true + np.random.multivariate_normal(np.zeros(4), Q * 0.1)\n",
    "    \n",
    "    # Generate noisy measurement\n",
    "    z = H @ x_true + np.random.multivariate_normal(np.zeros(2), R)\n",
    "    measurements_list.append(z)\n",
    "\n",
    "true_state = np.array(true_state)\n",
    "measurements_arr = np.array(measurements_list)\n",
    "\n",
    "# Run tracking\n",
    "x0 = np.array([0, 0, 0, 0]).reshape(-1, 1)\n",
    "P0 = 100 * np.eye(4)\n",
    "\n",
    "estimates_cpu, _ = gpu_accelerated_tracking(\n",
    "    measurements_arr, F, H, Q, R, x0, P0, use_gpu=False\n",
    ")\n",
    "\n",
    "if GPU_AVAILABLE:\n",
    "    estimates_gpu, _ = gpu_accelerated_tracking(\n",
    "        measurements_arr, F, H, Q, R, x0, P0, use_gpu=True\n",
    "    )\n",
    "\n",
    "print(f\"Tracking complete: {n_timesteps} time steps\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "cell-22",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.plotly.v1+json": {
       "config": {
        "plotlyServerURL": "https://plot.ly"
       },
       "data": [
        {
         "line": {
          "color": "#00ff88",
          "width": 2
         },
         "mode": "lines",
         "name": "True",
         "type": "scatter",
         "x": {
          "bdata": "AAAAAAAAAACLX4AOJW3uP63OL6zr5fw/2bxHyvw6BUADo//Og2sMQIo5+yUAexFAC7EOAz54FEA/fGzXEDcXQK2t6dLryBpA0so2X2EYH0D7yTNr860hQGKXdo+S6yNATKq2IscoJkCRtyaYjVkoQHNq7YXGdypAmX69UQ+JLECyF4RxN7guQG/1L857lTBAhqd5bbXnMUDlbgWbMTozQDLFUDPNgDRADiR2me3jNUC7rjtJZ0o3QCaD19bQiThA3NWnC0q/OUB2pGvMYfE6QMR0f/2xCzxACDcWiwT6PEAX38XT68E9QKOGVeJufz5Aew8sl+VZP0BxvbcyEiNAQEvabvolnUBAHCRvNVkRQUDw5JvthYRBQKAHkVJ5+UFAgsUjvEtvQkCYwHwf2t9CQHOBIKW2U0NAuIoOoSHJQ0DqosGUpjlEQCG51Qlro0RArcXriZX+RECKmL7lW3FFQC0PT86K+EVAoesSIzqBRkAcoPmKmwxHQPG/k7lCnkdARMPv/78mSEBgFNdNP6NIQM4b3kFuIUlAeHNv9JuaSUCDmTv+vgVKQBJNHwcpd0pAuDS7pYzxSkALnF2XAV9LQC5C9gA+zktAbIgLRQQ7TEC4fvyqT69MQEbYBd41LE1A0NtsS0iPTUACOTAzJuZNQM/louKeNU5AiPm6pBKSTkAFTTjQhvFOQMoDWESBZk9AM+y9OBTST0BA1kUFVxxQQK4czi8WUlBA/wSVk9WIUEAsYA3SLcBQQHerIBgS9VBADld/nrslUUDYGzTVvFBRQKk5M4AqeFFAamRPBmWiUUAMSYuDstBRQOpSU/uBClJAvoERJ6xFUkCer86TR4dSQHccqRBKz1JACWpsgCkbU0B8s0UTZ2tTQDC03SOjwVNAUFFdYJUUVECVD2nksGVUQINbQUbQtFRABOiha/UIVUAnsoNd+mNVQKJVzdzww1VAs7Yj8NkeVkDapa+6Qm9WQCajkp2qvlZAnh4izPcKV0CPi/vno1JXQPDOpc5Ol1dAXFfuyk3kV0Du4TZaCTNYQNx6WEc4flhA0VelPIzJWEA=",
          "dtype": "f8"
         },
         "xaxis": "x",
         "y": {
          "bdata": "AAAAAAAAAAAC0g3pQ07iP1nBSsnQw/E/mSkFj0Op+D8XPGW6gC8BQFWzGPcO5wVAp2X0TXkrC0CetIEWDKAQQAu153xrhxNAZ8Qrefu7FkA893KNyaUaQDONaJpJ8h5Ae34La33PIUBPRyKj1wUkQP4taSkaNCZAg6qa/15oKEDorMWqvMwqQFhHTd3IOS1ATANKSlB+L0DdSQNnC90wQEekmqDY6jFAgWL49p/xMkCbzDk71eYzQNRdSahS0DRAhJxGuRrGNUBI5+ebAMM2QPCszUCi4DdALGmdIen/OEAVCcERcgo6QPKUdrjRGTtAma3xfBEmPECDPC0Orx49QFCddoHtDD5A0wvnfH4EP0DtxKJ/0+4/QD3LfJw6fUBA8v62YOD+QECrYR3T539BQIZPHblx+EFAOI4jd4xzQkCQVE/6pPNCQETNgQYKeENAhd3PSd78Q0AXdEqL+YhEQJ7x+fGUD0VABr3f5l+aRUBTd1TeazFGQPsiRBjR10ZAPlhQezF6R0C1fgUzfBdIQHyygU3ns0hAN7ppQDNKSUBXMnzLZudJQDmpYIxikkpADLGBQrhVS0AE1OQTSRxMQNmiyvDU2kxA3M1+gZySTUCMlmfIREZOQIlDkr1j7E5AZK/JpMaPT0DxnSuS4xhQQGLLYd+AYlBATTALSTunUECtXs43uO5QQKD/rFGtLVFAYIbPvJFqUUC++6hCCadRQCyvDcKD21FAOSJu4lYNUkCEHaV16T9SQBtRbFl0dVJAYFA0QdqxUkCWbglrDepSQFlBE7ryH1NAwuguMk9YU0DmBjCsSoxTQDUq1ljFwFNAGMXL36j8U0Alv3yYsjFUQGcfQuNNYlRAeAmjORqPVECdW6xjk7hUQMfsJoQ551RA7Du0YaMaVUC1ro7HQEtVQHhcbXBqc1VAeYoG9iuaVUBzHiDyaMNVQMYFrrou71VAX39UKjAVVkClXHkOez5WQBzHIkaFY1ZAzhvlQlaGVkAUHZ9QOaJWQJkL8kcExlZAJAX/neXoVkC40H44lwZXQNM6Axu9H1dAkr1duf05V0A=",
          "dtype": "f8"
         },
         "yaxis": "y"
        },
        {
         "marker": {
          "color": "gray",
          "opacity": 0.5,
          "size": 4
         },
         "mode": "markers",
         "name": "Measurements",
         "type": "scatter",
         "x": {
          "bdata": "xyyiUHtpDcAo44dHQk8ZQOCveE30r+w/osioooenFUDe6D2hAggKQLZtSPzr8yVAxg3x0vMzKED+0lzV5rgWQKhJvSDv6BdAXBW3BU1mLUCJpoecvYEkQAl6dHu5lyZAECxr+gmwMEBIz1LILWAgQGiuwVspBDNAwXaQhz9fLUCTMdpZTUAuQO45cBrcVDFACYESakxeMkDODj9QizIvQOjXnmhtiTJATKbLoqhcN0C1MJmv2dc1QJRFb4gBqDxAPmsn8HpYPkDKzeFcDig9QJJ9jxdRtT5AEKs+mEumOkD9vtRqRfQ/QM5swI/qwjxA7jmV7kOtQUBu0NgCMAdAQA0JZM8tjkBAq0t+tNbQQEAsSmVIOQxDQNGWvR10oEJAmjV9QC28QUDSU9lwHuZCQHTg7zAYj0FADWWH7zApRkAGY2FF03FEQMMmzzy0/0JAmlBDXT6/RkDvjKq2ielFQN5IovRz1EVA+IWVFsxPR0DDdYMgBvVFQPcFZj91HUlAOtMPVm4HSkC+4Lb1jTNKQNvxEUgN1kpA+EpWDH32SkCH9Ogl3fFKQGqj+pdoLUlAS6Hm1b5vS0DXt6eKJ9lLQCCeuAs1zktAVk/hb/zsTECCSl3Q1eNOQDUGdsWIUE9ASZaQcB6NTUD2AW1bBv9PQLIRHAjmpk9AYKpTN4+HUEBwt4gLGzBPQOheqqTH/U5Aj2CfBt5HUEBMsiw0oFFQQISC8x8N31BALZfGpL8cUUCC88JxhYFQQLZvEeQZV1FAKoitnzP8UECKYykPsvtQQNpl8QVH3VFAylgPEb6PUEDTinBPVfRSQAiTO5wtMVRAcnkcv6ZxUkDUr9+G0jlSQCGFJqul+1JAb0atNBfTVEBs2tfVzh9TQJ9rglnW8FRAxtqLu5FNVEC99o1PtnFVQIEu9WeWdVZAQqP9JADsVECx6NuWbSpXQCeWNk8F6VRAgR07mPksVkCltax+MrFXQH44JWrEjFhAQq+WLd6zVkBdmIdgcHtYQHGo52F8M1dAsWt21XfxV0DVz50hRCxXQH0U2TCQXFpAm4nFXkeeWEA=",
          "dtype": "f8"
         },
         "xaxis": "x",
         "y": {
          "bdata": "QDLyrQINqL/eCltZk2oJQAXqqX033hpA/KClDLmpFUChyotq0kcSQFzzn8v9HQtASqqvMwpIB0DflrJeNRUUQJzdcktdmxZAT5BhqHuWB0CCEq33vjAgQAQ7UP2OtBVAJKpOwqF0IUC346b0GusqQEDqLZM/uiRAr9jNKg81KkDIK2h/ayMqQBQzT2lvuS5AUAFwcRaELEA/dRbpjB8uQIUKdJqaYTRAVIP53jhIM0Cku+grX6Y2QO4HS7fvyzBAEzooP3q5NUAGgNRgtos5QAPAJcASfzdAWh8p2drTP0BudjoZ/zU2QEkznZthIkBA403In/bIQUCXkBZlAeBCQLw27NRSoTZAKnqyWz0GPkCP732Tt8k9QMO0wtrVaEFAuKSSrGkIQUCdmBFo82ZBQMcPlkF6ST1AqHKEs9byQ0CQtrEQjmhEQH6W0xNu8UFApjn4tfc5Q0CmZm+JQZFDQF3/QMuCikNAF1LptQMsRUAuaeH+N/FGQAvCwFyliEdASB1QWfwwSUAW6QE3CT9KQDa3P3nOSklA4Aj1jg4nSkCm/7JYKNJLQFzrZqqmEUxAoF1/ot7pS0DMX4315JhOQKqmTWHPSExAEJ0GrI6EUED1M51ado1QQMOXR0F0FU9AsogpQsddT0AI+cjG2UpQQJBBWlx7oU9A4F5qpA2vT0AMvRsRYmxSQNM1jN66r1FA1QVAbRIlUUDAYHsTfNNSQKmmyha+kVJAFNvXiF/hUkCME9Sd369SQCbQj8N+EFNAG+iGJOLsUkAPCfYQ+dJSQKlB46W271RALlV8hIpyU0A75145xzJTQPVzVbkCyFNAGCcL9LdsVEDg0hEOeRNVQExnhEIislRAtDEDlss/VEBmvySNKhNUQJMuigCLRlVAJNM/zLBYVUDISw0FrZ1UQKAptdWCeFVAq8++ZZd9VkA6V7G0Y89UQMc6lFdTlFZATQ2sQhLXVUDn54FgyipWQJIRIHMzvVZAckGom6/ZVEDLARTJhMpWQAJgGYTKXFZAJ5D/sCN/V0AorcGpKSFWQIUfndicR1dAHpBwocioV0A=",
          "dtype": "f8"
         },
         "yaxis": "y"
        },
        {
         "line": {
          "color": "#00d4ff",
          "dash": "dash",
          "width": 1.5
         },
         "mode": "lines",
         "name": "CPU Estimate",
         "type": "scatter",
         "x": {
          "bdata": "yGkw6v0CDMC0WNvzuaATQAxgfeAlkglAH5t5JkRDFUC4SiDen74TQOzoudI8USFAM+oUIG6WJkDhImoQPhskQPIVgZ24XCJAy1EhWSKqJ0BY6ti/PwMoQB97JS+gwChAUSYSfb79LECwO0HKn98pQGl4NEbw9S5ASus0ftelL0A1UevTeBwwQHN0IqgKAzFAhLeWOr78MUCS8+9nW6QxQIjXe9VaUTJAvnArWf6BNEDb7BvZmpY1QK2Ys6SwwThA1RIiEQy0O0AxXk3brV49QJG+bENa+j5AiFYkm2SFPkDg1AtVCdU/QGiC7Ulqhz9A/7Czz9W+QEDeHwKhRtpAQAbTJ4WeC0FA1I15FS05QUDH7XPQ/xtCQFRq2z5hmUJA2GgB1cyXQkA8tgrtqfBCQDmfSe/yrEJAkbKkC5sTREBuy/9LK4VEQOH1muzFRkRAHIqmpLleRUBpTgRW2+BFQI7H2LUnLUZApQce66LfRkCqKkgZr91GQFeKsJFY7UdAOr2CP7wHSUCPnNx7p+BJQG0SKACUrEpAawM5Q/s9S0B+38u4kZNLQL8ROHI2GktA8rpK78B1S0Dhw3Cc0NVLQL64RYqOD0xAQFhUd6mWTECRk7l5RqlNQOJQgec2mU5AjgJACYSeTkAmCSuDvW1PQBMSusYU3k9APwPSRLJRUEDGfgd9uD5QQNnhwscvHFBAHRmAhv9BUEBpHTvNKV5QQDNx86cqolBAyDZJvMPoUEDMnKywFuJQQNPpc/8cI1FASKTK8BEwUUDgXctTBjRRQC4D4Vxkg1FALAk8yA5FUUDILPrnmOtRQMmXqzmB3VJAfAepJfjxUkDv0amQqeBSQB1QRI8lDlNAT+qUxb/TU0C8nRdpjslTQB5hcqomXVRAXwIs8HWMVEATiyC0Bg9VQORAdGgcyFVAt1yUO2e+VUBUjPBhgXZWQFPKdJZmKlZAG5x1/9pUVkAb7IVXzfhWQPH5QgaNv1dA/X2zxYSkV0BkunkXRCRYQJjnJYOpBlhA1EVgWlonWECqc+pn1vBXQH6r5ISc5FhAoC8ERZ79WEA=",
          "dtype": "f8"
         },
         "xaxis": "x",
         "y": {
          "bdata": "mVqus93npr/gvEcBnDsGQCTAg88KYRlAkGHZkjrJGkBN4mx5ofEYQNHQ24ZCLBVA+62S0NPVEUCPqRZ2uz8TQMYtxNyxKRVA/Jy/T046EkCacU6+GbUXQHqt0Y4N2RdAIVkeOAqXHECV1SIpWHojQBrAXKnDIiVA7JQPb/8jKECZnRcYkjMqQNybdDkYKy1AKG+IPn1kLkCgqsdy0ZovQEpQumm1DDJAj/2n6LhGM0BupkeXjUY1QP1dlHiFjTRALLhWgUiRNUAIiOtLWJs3QBKJzzWnUzhA57we25K5O0A2E6AGD8I6QMFHAuQnZz1AfB6dScJBQEBTdi7nQsRBQCTCvas2GEBAFkgg1+zzP0BGZZRs8I8/QMMxEFUOfkBADh/Y7HziQEA4hJnG00dBQFgxtnx1jEBALLGAwl/XQUAmPeJ0jABDQO97c/rF90JAgkhaC6dTQ0Dtuo+L0qtDQDQhA3mk30NA03Yxf4SQREBJ6WX5d7FFQIvlJeBlukZAex7uMk0NSEBrwbB/s15JQDqOEQVP8ElAgIxXA5mPSkD4ciEXGYdLQFIc0rs/QkxA1L6wGK+nTECQhbeAGdFNQNkjHcW+yU1AyiTr9tlcT0AujoVWokNQQPg33h2lS1BA8DJ7cmFOUECNBbBnnXtQQFViMl6baFBAm+XQmsFSUECqQ7X0OilRQETqjSguiFFA1QV0RA6WUUDiO2u0GDFSQDA1MYWCiVJAnYOvHmrfUkBtukVwpgRTQB1S+FiFOFNAMipuRKVJU0Doh5Co1UNTQLgI/a4U+lNAfxfz8c34U0DTRxaKQtZTQNszZkzp6FNAJkVFQZktVEDX1wgEsppUQAwclsQCx1RAh5KyknK4VEC4yDYlZZVUQPjxhQFo41RA3R0pKMIjVUBvbVLPsw1VQK0I3z6NRVVAwP5kBHvLVUDPK0gUO5RVQPQuLhN9BFZAKZoreYYTVkAfMVqHfjZWQEoXIlS2gFZA5hCFN6cGVkBk0ZHzsVNWQIsezxjkZlZAsUwE+Q3bVkBfg5OpQLRWQN5To1Hb+VZAvTA/bJ5PV0A=",
          "dtype": "f8"
         },
         "yaxis": "y"
        },
        {
         "line": {
          "color": "#00d4ff",
          "width": 1.5
         },
         "mode": "lines",
         "name": "CPU Error",
         "type": "scatter",
         "xaxis": "x2",
         "y": {
          "bdata": "1AelwJMDDEDuB/jO2h4SQA68GxXqqRVAhynAeRs1F0Af6VqFLEMRQPh8JZr69xNAI5F+jogRGUDUw1Re9DIRQN1PoE1XJQRATJqgBgzZEEB7EQEhngEKQJuXou2O+wdAfUEBQ7y7DkCkgPuhO+TpP695ZErJeAJAvOiGa679+D8fH8Moh+HpP+ljZedzc9s/toIHwybP4T9F1rd0qIX+P+Idk9jDgwFAna8D11XA9j+SAvw8MoABQFBvU6+mxdU/5yNFoZN4/z/vi+3SFo8EQGTulkhYuwdA2u//CEIPCUA4EZWzqo8BQK4ZIJYRLgRAgksTTvNxE0CE2y9s6UgaQPsYh9YwcwJAZxzIEKyK7z+mhUHmbdfzPxObew4P/fM/UUFHaye72D85tXAcpUXdP3jVHD/pBQlADl4/17ug9T957ARGOyfjP5kDw7k0x/M/kC3Z+tlT+D/4hqqAa/X+Px4QIPxZRwNAD36oovWhAUDUdPFWIgnxPxPxUDZPGOU/hQllDvnMAEBQy4Wu630MQAgYePtwow9ABhnS3nWWEEC+g/WOEvsRQDQLfK3p2QxAjlVoD26uBkCTsLqH7koMQIib2UbG9f4/fOtwHgg1DUARQW3NiKYTQIeOk8RZjRFAk6/NFu7fB0B7fgis4mgLQOgyfBMtigpAgW+1twhdEUCzJW1fjM0JQFmgUJ1IUgFAuaiyKTPB+D/TAIF9XR4DQMMlbfY98QdAKAxKhCTeDEB2zOjFc/QIQGkwAK2mDQlARGZBhqAEA0ASYzkZwJD3PxlXP3d7TQtAABgvzzA1B0A48GKDS63zPzYBb2wA2QpA4aFPSZhjBkBSvhFcfzwBQNcN/+Knrf0/cjLfkiSlB0AY6mSOviD5P+zf4sjwcQNAW+kI4E8O/j8uFTyAloUGQBxT7IZ3cRFAmCOw186AB0BHljWs1GgRQBx6deG0Kfo/GyVnWc4D6z+PWe55vTgBQFxoXnyaKBBAffxNLXj2CEAtAe51Q/sLQEyJXLaETgJAdha6dKId8T83pdvilWn6P4uRpGcrS/s/UFnHuRMx7D8=",
          "dtype": "f8"
         },
         "yaxis": "y2"
        }
       ],
       "layout": {
        "annotations": [
         {
          "font": {
           "size": 16
          },
          "showarrow": false,
          "text": "Tracking Results",
          "x": 0.225,
          "xanchor": "center",
          "xref": "paper",
          "y": 1,
          "yanchor": "bottom",
          "yref": "paper"
         },
         {
          "font": {
           "size": 16
          },
          "showarrow": false,
          "text": "Tracking Error",
          "x": 0.775,
          "xanchor": "center",
          "xref": "paper",
          "y": 1,
          "yanchor": "bottom",
          "yref": "paper"
         }
        ],
        "height": 450,
        "template": {
         "layout": {
          "font": {
           "color": "#e6edf3"
          },
          "paper_bgcolor": "#0d1117",
          "plot_bgcolor": "#0d1117",
          "xaxis": {
           "gridcolor": "#30363d",
           "zerolinecolor": "#30363d"
          },
          "yaxis": {
           "gridcolor": "#30363d",
           "zerolinecolor": "#30363d"
          }
         }
        },
        "xaxis": {
         "anchor": "y",
         "domain": [
          0,
          0.45
         ],
         "title": {
          "text": "X Position"
         }
        },
        "xaxis2": {
         "anchor": "y2",
         "domain": [
          0.55,
          1
         ],
         "title": {
          "text": "Time Step"
         }
        },
        "yaxis": {
         "anchor": "x",
         "domain": [
          0,
          1
         ],
         "scaleanchor": "x",
         "scaleratio": 1,
         "title": {
          "text": "Y Position"
         }
        },
        "yaxis2": {
         "anchor": "x2",
         "domain": [
          0,
          1
         ],
         "title": {
          "text": "Position Error"
         }
        }
       }
      }
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU RMSE: 2.9987\n"
     ]
    }
   ],
   "source": [
    "# Visualize results\n",
    "fig = make_subplots(rows=1, cols=2, subplot_titles=['Tracking Results', 'Tracking Error'])\n",
    "\n",
    "# Trajectory plot\n",
    "fig.add_trace(\n",
    "    go.Scatter(x=true_state[:, 0], y=true_state[:, 2], mode='lines',\n",
    "               name='True', line=dict(color='#00ff88', width=2)),\n",
    "    row=1, col=1\n",
    ")\n",
    "fig.add_trace(\n",
    "    go.Scatter(x=measurements_arr[:, 0], y=measurements_arr[:, 1], mode='markers',\n",
    "               name='Measurements', marker=dict(color='gray', size=4, opacity=0.5)),\n",
    "    row=1, col=1\n",
    ")\n",
    "fig.add_trace(\n",
    "    go.Scatter(x=estimates_cpu[:, 0], y=estimates_cpu[:, 2], mode='lines',\n",
    "               name='CPU Estimate', line=dict(color='#00d4ff', width=1.5, dash='dash')),\n",
    "    row=1, col=1\n",
    ")\n",
    "if GPU_AVAILABLE:\n",
    "    fig.add_trace(\n",
    "        go.Scatter(x=estimates_gpu[:, 0], y=estimates_gpu[:, 2], mode='lines',\n",
    "                   name='GPU Estimate', line=dict(color='#ff4757', width=1.5, dash='dot')),\n",
    "        row=1, col=1\n",
    "    )\n",
    "\n",
    "# Error plot\n",
    "error_cpu = np.sqrt((estimates_cpu[:, 0] - true_state[:, 0])**2 + \n",
    "                    (estimates_cpu[:, 2] - true_state[:, 2])**2)\n",
    "fig.add_trace(\n",
    "    go.Scatter(y=error_cpu, mode='lines', name='CPU Error',\n",
    "               line=dict(color='#00d4ff', width=1.5)),\n",
    "    row=1, col=2\n",
    ")\n",
    "if GPU_AVAILABLE:\n",
    "    error_gpu = np.sqrt((estimates_gpu[:, 0] - true_state[:, 0])**2 + \n",
    "                        (estimates_gpu[:, 2] - true_state[:, 2])**2)\n",
    "    fig.add_trace(\n",
    "        go.Scatter(y=error_gpu, mode='lines', name='GPU Error',\n",
    "                   line=dict(color='#ff4757', width=1.5, dash='dash')),\n",
    "        row=1, col=2\n",
    "    )\n",
    "\n",
    "fig.update_xaxes(title_text='X Position', row=1, col=1)\n",
    "fig.update_yaxes(title_text='Y Position', row=1, col=1, scaleanchor='x', scaleratio=1)\n",
    "fig.update_xaxes(title_text='Time Step', row=1, col=2)\n",
    "fig.update_yaxes(title_text='Position Error', row=1, col=2)\n",
    "\n",
    "fig.update_layout(template=dark_template, height=450)\n",
    "fig.show()\n",
    "\n",
    "print(f\"CPU RMSE: {np.sqrt(np.mean(error_cpu**2)):.4f}\")\n",
    "if GPU_AVAILABLE:\n",
    "    print(f\"GPU RMSE: {np.sqrt(np.mean(error_gpu**2)):.4f}\")\n",
    "    print(f\"Max difference: {np.max(np.abs(estimates_cpu - estimates_gpu)):.2e}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-23",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "Key takeaways for GPU acceleration:\n",
    "\n",
    "1. **CuPy provides NumPy compatibility** - Easy migration with minimal code changes\n",
    "2. **Batch processing is key** - GPU shines when processing many tracks/particles simultaneously\n",
    "3. **Memory management matters** - Pre-allocate, use pinned memory, free unused memory\n",
    "4. **Transfer overhead** - Minimize CPU↔GPU transfers for best performance\n",
    "5. **Problem size determines speedup** - Small problems may not benefit from GPU\n",
    "\n",
    "### Performance Guidelines\n",
    "\n",
    "| Scenario | Recommendation |\n",
    "|----------|---------------|\n",
    "| < 100 tracks | CPU (transfer overhead dominates) |\n",
    "| 100-1000 tracks | GPU beneficial for complex filters |\n",
    "| > 1000 tracks | GPU strongly recommended |\n",
    "| < 1000 particles | CPU usually faster |\n",
    "| > 10000 particles | GPU provides significant speedup |\n",
    "\n",
    "## Exercises\n",
    "\n",
    "1. Implement GPU-accelerated UKF with sigma point generation on GPU\n",
    "2. Add GPU memory pooling for repeated filter operations\n",
    "3. Benchmark different matrix sizes to find the CPU/GPU crossover point\n",
    "4. Implement GPU-accelerated JPDA likelihood computation\n",
    "\n",
    "## References\n",
    "\n",
    "1. CuPy Documentation: https://docs.cupy.dev/\n",
    "2. CUDA Programming Guide: https://docs.nvidia.com/cuda/\n",
    "3. Gustafsson, F. (2010). *Particle Methods for Tracking*."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}