{ "cells": [ { "cell_type": "markdown", "id": "cell-0", "metadata": {}, "source": [ "# Performance Optimization: Profiling and Benchmarking\n", "\n", "This notebook covers performance optimization techniques for tracking applications. We explore:\n", "\n", "1. **Profiling Basics** - Identifying bottlenecks\n", "2. **Numba JIT Compilation** - Accelerating numerical code\n", "3. **Vectorization** - Efficient NumPy operations\n", "4. **Caching Strategies** - Avoiding redundant computations\n", "5. **Memory Optimization** - Reducing allocations\n", "6. **Benchmarking Best Practices** - Reliable measurements\n", "\n", "## Prerequisites\n", "\n", "```bash\n", "pip install nrl-tracker plotly numpy numba line_profiler memory_profiler\n", "```" ] }, { "cell_type": "code", "execution_count": 2, "id": "cell-1", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import plotly.graph_objects as go\n", "from plotly.subplots import make_subplots\n", "import time\n", "import cProfile\n", "import pstats\n", "from io import StringIO\n", "from functools import lru_cache\n", "\n", "try:\n", " from numba import njit, prange\n", " NUMBA_AVAILABLE = True\n", "except ImportError:\n", " NUMBA_AVAILABLE = False\n", " print(\"Numba not available. Install with: pip install numba\")\n", "\n", "# Import tracking functions\n", "from pytcl.dynamic_estimation.kalman import kf_predict, kf_update\n", "from pytcl.assignment_algorithms import hungarian\n", "from pytcl.dynamic_estimation.particle_filters import (\n", " resample_systematic, effective_sample_size\n", ")\n", "\n", "np.random.seed(42)\n", "\n", "# Plotly dark theme template\n", "dark_template = go.layout.Template()\n", "dark_template.layout = go.Layout(\n", " paper_bgcolor='#0d1117',\n", " plot_bgcolor='#0d1117',\n", " font=dict(color='#e6edf3'),\n", " xaxis=dict(gridcolor='#30363d', zerolinecolor='#30363d'),\n", " yaxis=dict(gridcolor='#30363d', zerolinecolor='#30363d'),\n", ")" ] }, { "cell_type": "markdown", "id": "cell-2", "metadata": {}, "source": [ "## 1. Profiling Basics\n", "\n", "Before optimizing, identify where time is actually spent.\n", "\n", "### Profiling Tools\n", "\n", "| Tool | Purpose | Overhead |\n", "|------|---------|----------|\n", "| `time.time()` | Quick timing | Minimal |\n", "| `cProfile` | Function-level profiling | Moderate |\n", "| `line_profiler` | Line-by-line profiling | High |\n", "| `memory_profiler` | Memory usage | High |" ] }, { "cell_type": "code", "execution_count": 3, "id": "cell-3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Matrix multiplication (500x500):\n", " Mean: 0.42 ms\n", " Std: 0.08 ms\n" ] } ], "source": [ "def simple_timer(func, *args, n_runs=10, **kwargs):\n", " \"\"\"\n", " Time a function call with multiple runs.\n", " \n", " Parameters\n", " ----------\n", " func : callable\n", " Function to time.\n", " *args : tuple\n", " Arguments to pass to function.\n", " n_runs : int\n", " Number of timing runs.\n", " **kwargs : dict\n", " Keyword arguments to pass to function.\n", " \n", " Returns\n", " -------\n", " dict\n", " Timing statistics.\n", " \"\"\"\n", " # Warmup\n", " func(*args, **kwargs)\n", " \n", " times = []\n", " for _ in range(n_runs):\n", " start = time.perf_counter()\n", " result = func(*args, **kwargs)\n", " end = time.perf_counter()\n", " times.append(end - start)\n", " \n", " return {\n", " 'mean': np.mean(times),\n", " 'std': np.std(times),\n", " 'min': np.min(times),\n", " 'max': np.max(times),\n", " 'n_runs': n_runs\n", " }\n", "\n", "# Example: Time matrix multiplication\n", "n = 500\n", "A = np.random.randn(n, n)\n", "B = np.random.randn(n, n)\n", "\n", "stats = simple_timer(np.dot, A, B, n_runs=20)\n", "print(f\"Matrix multiplication ({n}x{n}):\")\n", "print(f\" Mean: {stats['mean']*1e3:.2f} ms\")\n", "print(f\" Std: {stats['std']*1e3:.2f} ms\")" ] }, { "cell_type": "code", "execution_count": 6, "id": "cell-4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 271792 function calls (271789 primitive calls) in 0.213 seconds\n", "\n", " Ordered by: cumulative time\n", " List reduced from 102 to 10 due to restriction <10>\n", "\n", " ncalls tottime percall cumtime percall filename:lineno(function)\n", " 100 0.135 0.001 0.212 0.002 /var/folders/01/nxrmp0tn3tn73c5ty1j1ywm40000gn/T/ipykernel_51637/3480273987.py:2(run_kalman_tracking)\n", " 1 0.000 0.000 0.104 0.104 /Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/asyncio/events.py:87(_run)\n", " 1 0.000 0.000 0.104 0.104 {method 'run' of '_contextvars.Context' objects}\n", " 1 0.000 0.000 0.104 0.104 /Users/nedonatelli/Library/Python/3.13/lib/python/site-packages/tornado/platform/asyncio.py:206(_handle_events)\n", " 1 0.000 0.000 0.104 0.104 /Users/nedonatelli/Library/Python/3.13/lib/python/site-packages/zmq/eventloop/zmqstream.py:573(_handle_events)\n", " 1 0.000 0.000 0.104 0.104 /Users/nedonatelli/Library/Python/3.13/lib/python/site-packages/zmq/eventloop/zmqstream.py:614(_handle_recv)\n", " 1 0.000 0.000 0.104 0.104 /Users/nedonatelli/Library/Python/3.13/lib/python/site-packages/zmq/eventloop/zmqstream.py:546(_run_callback)\n", " 1 0.000 0.000 0.104 0.104 /Users/nedonatelli/Library/Python/3.13/lib/python/site-packages/ipykernel/iostream.py:157(_handle_event)\n", " 1 0.000 0.000 0.104 0.104 /Users/nedonatelli/Library/Python/3.13/lib/python/site-packages/ipykernel/iostream.py:276()\n", " 1 0.000 0.000 0.104 0.104 /Users/nedonatelli/Library/Python/3.13/lib/python/site-packages/ipykernel/iostream.py:278(_really_send)\n", "\n", "\n", "\n" ] } ], "source": [ "# Profile a tracking scenario\n", "def run_kalman_tracking(n_steps=100):\n", " \"\"\"Run Kalman filter tracking.\"\"\"\n", " # System matrices\n", " dt = 1.0\n", " F = np.array([[1, dt, 0, 0],\n", " [0, 1, 0, 0],\n", " [0, 0, 1, dt],\n", " [0, 0, 0, 1]])\n", " H = np.array([[1, 0, 0, 0],\n", " [0, 0, 1, 0]])\n", " Q = np.eye(4) * 0.1\n", " R = np.eye(2) * 1.0\n", " \n", " x = np.zeros(4)\n", " P = np.eye(4) * 100\n", " \n", " for _ in range(n_steps):\n", " # Predict\n", " x = F @ x\n", " P = F @ P @ F.T + Q\n", " \n", " # Update\n", " z = np.random.randn(2) * np.sqrt(R[0, 0])\n", " y = z - H @ x\n", " S = H @ P @ H.T + R\n", " K = P @ H.T @ np.linalg.inv(S)\n", " x = x + K @ y\n", " P = (np.eye(4) - K @ H) @ P\n", " \n", " return x, P\n", "\n", "# Profile with cProfile\n", "profiler = cProfile.Profile()\n", "profiler.enable()\n", "\n", "for _ in range(100):\n", " run_kalman_tracking(100)\n", "\n", "profiler.disable()\n", "\n", "# Print top 10 functions by cumulative time\n", "s = StringIO()\n", "ps = pstats.Stats(profiler, stream=s).sort_stats('cumulative')\n", "ps.print_stats(10)\n", "print(s.getvalue())" ] }, { "cell_type": "markdown", "id": "cell-5", "metadata": {}, "source": [ "## 2. Numba JIT Compilation\n", "\n", "Numba compiles Python functions to optimized machine code at runtime.\n", "\n", "### Key Features\n", "- `@njit` - Compile without Python interpreter (\"nopython\" mode)\n", "- `@njit(parallel=True)` - Automatic parallelization\n", "- `prange` - Parallel range for loops\n", "- `cache=True` - Cache compiled code between runs" ] }, { "cell_type": "code", "execution_count": 7, "id": "cell-6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Distance functions defined.\n" ] } ], "source": [ "# Pure Python implementation\n", "def compute_distances_python(points, center):\n", " \"\"\"Compute distances from points to center (pure Python).\"\"\"\n", " n = len(points)\n", " distances = np.zeros(n)\n", " for i in range(n):\n", " d = 0.0\n", " for j in range(len(center)):\n", " d += (points[i, j] - center[j]) ** 2\n", " distances[i] = np.sqrt(d)\n", " return distances\n", "\n", "# Vectorized NumPy implementation\n", "def compute_distances_numpy(points, center):\n", " \"\"\"Compute distances from points to center (NumPy).\"\"\"\n", " return np.sqrt(np.sum((points - center) ** 2, axis=1))\n", "\n", "if NUMBA_AVAILABLE:\n", " # Numba JIT implementation\n", " @njit(cache=True)\n", " def compute_distances_numba(points, center):\n", " \"\"\"Compute distances from points to center (Numba).\"\"\"\n", " n = len(points)\n", " dim = len(center)\n", " distances = np.zeros(n)\n", " for i in range(n):\n", " d = 0.0\n", " for j in range(dim):\n", " d += (points[i, j] - center[j]) ** 2\n", " distances[i] = np.sqrt(d)\n", " return distances\n", "\n", " # Parallel Numba implementation\n", " @njit(parallel=True, cache=True)\n", " def compute_distances_numba_parallel(points, center):\n", " \"\"\"Compute distances from points to center (parallel Numba).\"\"\"\n", " n = len(points)\n", " dim = len(center)\n", " distances = np.zeros(n)\n", " for i in prange(n):\n", " d = 0.0\n", " for j in range(dim):\n", " d += (points[i, j] - center[j]) ** 2\n", " distances[i] = np.sqrt(d)\n", " return distances\n", "\n", "print(\"Distance functions defined.\")" ] }, { "cell_type": "code", "execution_count": 8, "id": "cell-7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "n= 1000: NumPy= 0.04ms, Numba= 0.00ms, Parallel= 0.10ms\n", "n= 10000: NumPy= 0.10ms, Numba= 0.02ms, Parallel= 0.11ms\n", "n= 100000: NumPy= 0.91ms, Numba= 0.20ms, Parallel= 0.15ms\n", "n=1000000: NumPy= 9.38ms, Numba= 2.08ms, Parallel= 0.33ms\n" ] } ], "source": [ "# Benchmark different implementations\n", "n_points_list = [1000, 10000, 100000, 1000000]\n", "dim = 3\n", "\n", "python_times = []\n", "numpy_times = []\n", "numba_times = []\n", "numba_parallel_times = []\n", "\n", "for n_points in n_points_list:\n", " points = np.random.randn(n_points, dim)\n", " center = np.random.randn(dim)\n", " \n", " # Python (only for smaller sizes)\n", " if n_points <= 10000:\n", " stats = simple_timer(compute_distances_python, points, center, n_runs=3)\n", " python_times.append(stats['mean'])\n", " else:\n", " python_times.append(np.nan)\n", " \n", " # NumPy\n", " stats = simple_timer(compute_distances_numpy, points, center, n_runs=10)\n", " numpy_times.append(stats['mean'])\n", " \n", " if NUMBA_AVAILABLE:\n", " # Numba (warmup first)\n", " _ = compute_distances_numba(points, center)\n", " stats = simple_timer(compute_distances_numba, points, center, n_runs=10)\n", " numba_times.append(stats['mean'])\n", " \n", " # Parallel Numba\n", " _ = compute_distances_numba_parallel(points, center)\n", " stats = simple_timer(compute_distances_numba_parallel, points, center, n_runs=10)\n", " numba_parallel_times.append(stats['mean'])\n", " \n", " print(f\"n={n_points:7d}: NumPy={numpy_times[-1]*1e3:7.2f}ms\", end='')\n", " if NUMBA_AVAILABLE:\n", " print(f\", Numba={numba_times[-1]*1e3:7.2f}ms, \"\n", " f\"Parallel={numba_parallel_times[-1]*1e3:7.2f}ms\")\n", " else:\n", " print()" ] }, { "cell_type": "code", "execution_count": 9, "id": "cell-8", "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "line": { "color": "#00d4ff", "width": 2 }, "marker": { "size": 8 }, "mode": "lines+markers", "name": "NumPy", "type": "scatter", "x": [ 1000, 10000, 100000, 1000000 ], "y": { "bdata": "AAAA2zMRoj8AAAC9seq5PwAASPyPMe0/AACJVt3BIkA=", "dtype": "f8" } }, { "line": { "color": "#00ff88", "width": 2 }, "marker": { "size": 8, "symbol": "square" }, "mode": "lines+markers", "name": "Numba", "type": "scatter", "x": [ 1000, 10000, 100000, 1000000 ], "y": { "bdata": "AAAAUIzSZD8AAADnaFKXPwAAILeGGco/AAB0e4OfAEA=", "dtype": "f8" } }, { "line": { "color": "#ff4757", "width": 2 }, "marker": { "size": 8, "symbol": "triangle-up" }, "mode": "lines+markers", "name": "Numba Parallel", "type": "scatter", "x": [ 1000, 10000, 100000, 1000000 ], "y": { "bdata": "AABANO1Ruj8AAAC2S1+8PwAAQKSUtcI/AAAgTRUx1T8=", "dtype": "f8" } } ], "layout": { "height": 450, "template": { "layout": { "font": { "color": "#e6edf3" }, "paper_bgcolor": "#0d1117", "plot_bgcolor": "#0d1117", "xaxis": { "gridcolor": "#30363d", "zerolinecolor": "#30363d" }, "yaxis": { "gridcolor": "#30363d", "zerolinecolor": "#30363d" } } }, "title": { "text": "Distance Computation Performance" }, "xaxis": { "title": { "text": "Number of Points" }, "type": "log" }, "yaxis": { "title": { "text": "Time (ms)" }, "type": "log" } } } }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Visualize performance comparison\n", "fig = go.Figure()\n", "\n", "fig.add_trace(\n", " go.Scatter(x=n_points_list, y=np.array(numpy_times)*1e3, mode='lines+markers',\n", " name='NumPy', line=dict(color='#00d4ff', width=2),\n", " marker=dict(size=8))\n", ")\n", "if NUMBA_AVAILABLE:\n", " fig.add_trace(\n", " go.Scatter(x=n_points_list, y=np.array(numba_times)*1e3, mode='lines+markers',\n", " name='Numba', line=dict(color='#00ff88', width=2),\n", " marker=dict(size=8, symbol='square'))\n", " )\n", " fig.add_trace(\n", " go.Scatter(x=n_points_list, y=np.array(numba_parallel_times)*1e3, mode='lines+markers',\n", " name='Numba Parallel', line=dict(color='#ff4757', width=2),\n", " marker=dict(size=8, symbol='triangle-up'))\n", " )\n", "\n", "fig.update_layout(\n", " template=dark_template,\n", " title='Distance Computation Performance',\n", " xaxis_title='Number of Points',\n", " yaxis_title='Time (ms)',\n", " xaxis_type='log',\n", " yaxis_type='log',\n", " height=450,\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "id": "cell-9", "metadata": {}, "source": [ "## 3. Vectorization Patterns\n", "\n", "Replacing loops with vectorized NumPy operations can provide significant speedups." ] }, { "cell_type": "code", "execution_count": 10, "id": "cell-10", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pairwise distances for 500 points:\n", " Loop: 245.64 ms\n", " Broadcast: 3.34 ms\n", " Einsum: 1.00 ms\n", "\n", "Max difference: 5.96e-08\n" ] } ], "source": [ "# Example: Compute pairwise distances\n", "\n", "def pairwise_distances_loop(X):\n", " \"\"\"Compute pairwise distances using loops.\"\"\"\n", " n = len(X)\n", " D = np.zeros((n, n))\n", " for i in range(n):\n", " for j in range(i+1, n):\n", " d = np.sqrt(np.sum((X[i] - X[j])**2))\n", " D[i, j] = d\n", " D[j, i] = d\n", " return D\n", "\n", "def pairwise_distances_broadcast(X):\n", " \"\"\"Compute pairwise distances using broadcasting.\"\"\"\n", " # Shape: (n, 1, d) - (1, n, d) = (n, n, d)\n", " diff = X[:, np.newaxis, :] - X[np.newaxis, :, :]\n", " return np.sqrt(np.sum(diff**2, axis=-1))\n", "\n", "def pairwise_distances_einsum(X):\n", " \"\"\"Compute pairwise distances using einsum.\"\"\"\n", " # D[i,j] = ||X[i] - X[j]||² = ||X[i]||² + ||X[j]||² - 2*X[i]·X[j]\n", " norms_sq = np.sum(X**2, axis=1)\n", " dot_products = np.einsum('ik,jk->ij', X, X)\n", " D_sq = norms_sq[:, np.newaxis] + norms_sq[np.newaxis, :] - 2 * dot_products\n", " D_sq = np.maximum(D_sq, 0) # Handle numerical errors\n", " return np.sqrt(D_sq)\n", "\n", "# Benchmark\n", "n = 500\n", "X = np.random.randn(n, 3)\n", "\n", "print(f\"Pairwise distances for {n} points:\")\n", "\n", "stats = simple_timer(pairwise_distances_loop, X, n_runs=3)\n", "print(f\" Loop: {stats['mean']*1e3:8.2f} ms\")\n", "\n", "stats = simple_timer(pairwise_distances_broadcast, X, n_runs=10)\n", "print(f\" Broadcast: {stats['mean']*1e3:8.2f} ms\")\n", "\n", "stats = simple_timer(pairwise_distances_einsum, X, n_runs=10)\n", "print(f\" Einsum: {stats['mean']*1e3:8.2f} ms\")\n", "\n", "# Verify results match\n", "D1 = pairwise_distances_loop(X)\n", "D2 = pairwise_distances_broadcast(X)\n", "D3 = pairwise_distances_einsum(X)\n", "print(f\"\\nMax difference: {max(np.max(np.abs(D1-D2)), np.max(np.abs(D1-D3))):.2e}\")" ] }, { "cell_type": "code", "execution_count": 4, "id": "cell-11", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Batch KF predict (1000 tracks, 6D state):\n", " Loop: 2.39 ms\n", " Vectorized: 0.49 ms\n" ] } ], "source": [ "# Batch Kalman filter - vectorized version\n", "\n", "def batch_kf_predict_loop(x_batch, P_batch, F, Q):\n", " \"\"\"Predict step using loop.\"\"\"\n", " n_batch = len(x_batch)\n", " x_pred = np.zeros_like(x_batch)\n", " P_pred = np.zeros_like(P_batch)\n", " \n", " for i in range(n_batch):\n", " x_pred[i] = F @ x_batch[i]\n", " P_pred[i] = F @ P_batch[i] @ F.T + Q\n", " \n", " return x_pred, P_pred\n", "\n", "def batch_kf_predict_vectorized(x_batch, P_batch, F, Q):\n", " \"\"\"Predict step using vectorized operations.\"\"\"\n", " # x_pred[i] = F @ x_batch[i] -> use einsum\n", " x_pred = np.einsum('ij,bj->bi', F, x_batch)\n", " \n", " # P_pred[i] = F @ P_batch[i] @ F.T + Q\n", " FP = np.einsum('ij,bjk->bik', F, P_batch) # F @ P for each batch\n", " FPFt = np.einsum('bij,kj->bik', FP, F) # ... @ F.T\n", " P_pred = FPFt + Q\n", " \n", " return x_pred, P_pred\n", "\n", "# Benchmark\n", "n_batch = 1000\n", "state_dim = 6\n", "\n", "x_batch = np.random.randn(n_batch, state_dim)\n", "P_batch = np.tile(np.eye(state_dim), (n_batch, 1, 1))\n", "F = np.random.randn(state_dim, state_dim)\n", "Q = np.eye(state_dim) * 0.1\n", "\n", "print(f\"Batch KF predict ({n_batch} tracks, {state_dim}D state):\")\n", "\n", "stats = simple_timer(batch_kf_predict_loop, x_batch, P_batch, F, Q, n_runs=10)\n", "print(f\" Loop: {stats['mean']*1e3:8.2f} ms\")\n", "\n", "stats = simple_timer(batch_kf_predict_vectorized, x_batch, P_batch, F, Q, n_runs=10)\n", "print(f\" Vectorized: {stats['mean']*1e3:8.2f} ms\")" ] }, { "cell_type": "markdown", "id": "cell-12", "metadata": {}, "source": [ "## 4. Caching Strategies\n", "\n", "Avoid recomputing expensive results that are used multiple times." ] }, { "cell_type": "code", "execution_count": 5, "id": "cell-13", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Legendre polynomials (n_max=20, 1000 evaluations):\n", " Uncached: 79.07 ms\n", " Cached: 3.56 ms\n", " Speedup: 22.2x\n", " Cache hits: 952\n", " Cache misses: 48\n" ] } ], "source": [ "# Example: Spherical harmonic coefficients\n", "\n", "def compute_legendre_uncached(n_max, theta):\n", " \"\"\"Compute associated Legendre polynomials without caching.\"\"\"\n", " x = np.cos(theta)\n", " P = np.zeros((n_max + 1, n_max + 1))\n", " \n", " P[0, 0] = 1.0\n", " if n_max > 0:\n", " P[1, 0] = x\n", " P[1, 1] = -np.sin(theta)\n", " \n", " for n in range(2, n_max + 1):\n", " for m in range(n + 1):\n", " if m == n:\n", " P[n, m] = -(2*n - 1) * np.sin(theta) * P[n-1, m-1]\n", " elif m == n - 1:\n", " P[n, m] = (2*n - 1) * x * P[n-1, m]\n", " else:\n", " P[n, m] = ((2*n - 1) * x * P[n-1, m] - (n + m - 1) * P[n-2, m]) / (n - m)\n", " \n", " return P\n", "\n", "@lru_cache(maxsize=128)\n", "def compute_legendre_cached(n_max, theta_index):\n", " \"\"\"Compute associated Legendre polynomials with caching.\n", " \n", " Note: Using theta_index (discretized) for hashability.\n", " \"\"\"\n", " theta = theta_index * 0.01 # Convert index back to angle\n", " return compute_legendre_uncached(n_max, theta)\n", "\n", "# Simulate repeated evaluation at same locations (common in tracking)\n", "n_max = 20\n", "n_evals = 1000\n", "n_unique_locations = 50 # Many evaluations at same locations\n", "\n", "# Random angles, but with repetition\n", "theta_unique = np.random.rand(n_unique_locations) * np.pi\n", "theta_indices = np.random.randint(0, n_unique_locations, n_evals)\n", "\n", "# Uncached\n", "start = time.time()\n", "for i in range(n_evals):\n", " _ = compute_legendre_uncached(n_max, theta_unique[theta_indices[i]])\n", "uncached_time = time.time() - start\n", "\n", "# Cached (clear cache first)\n", "compute_legendre_cached.cache_clear()\n", "start = time.time()\n", "for i in range(n_evals):\n", " # Convert theta to discrete index for caching\n", " theta_idx = int(theta_unique[theta_indices[i]] * 100)\n", " _ = compute_legendre_cached(n_max, theta_idx)\n", "cached_time = time.time() - start\n", "\n", "print(f\"Legendre polynomials (n_max={n_max}, {n_evals} evaluations):\")\n", "print(f\" Uncached: {uncached_time*1e3:.2f} ms\")\n", "print(f\" Cached: {cached_time*1e3:.2f} ms\")\n", "print(f\" Speedup: {uncached_time/cached_time:.1f}x\")\n", "print(f\" Cache hits: {compute_legendre_cached.cache_info().hits}\")\n", "print(f\" Cache misses: {compute_legendre_cached.cache_info().misses}\")" ] }, { "cell_type": "markdown", "id": "cell-14", "metadata": {}, "source": [ "## 5. Memory Optimization\n", "\n", "Reducing memory allocations can significantly improve performance." ] }, { "cell_type": "code", "execution_count": 6, "id": "cell-15", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Trajectory simulation (10000 steps):\n", " List append: 11.78 ms\n", " Pre-allocated: 10.15 ms\n", " Speedup: 1.2x\n" ] } ], "source": [ "# Pre-allocation vs. dynamic allocation\n", "\n", "def simulate_with_append(n_steps):\n", " \"\"\"Simulate trajectory using list append.\"\"\"\n", " trajectory = []\n", " x = np.zeros(4)\n", " \n", " for _ in range(n_steps):\n", " x = x + np.random.randn(4) * 0.1\n", " trajectory.append(x.copy())\n", " \n", " return np.array(trajectory)\n", "\n", "def simulate_with_prealloc(n_steps):\n", " \"\"\"Simulate trajectory with pre-allocated array.\"\"\"\n", " trajectory = np.zeros((n_steps, 4))\n", " x = np.zeros(4)\n", " \n", " for i in range(n_steps):\n", " x = x + np.random.randn(4) * 0.1\n", " trajectory[i] = x\n", " \n", " return trajectory\n", "\n", "# Benchmark\n", "n_steps = 10000\n", "\n", "stats1 = simple_timer(simulate_with_append, n_steps, n_runs=20)\n", "stats2 = simple_timer(simulate_with_prealloc, n_steps, n_runs=20)\n", "\n", "print(f\"Trajectory simulation ({n_steps} steps):\")\n", "print(f\" List append: {stats1['mean']*1e3:.2f} ms\")\n", "print(f\" Pre-allocated: {stats2['mean']*1e3:.2f} ms\")\n", "print(f\" Speedup: {stats1['mean']/stats2['mean']:.1f}x\")" ] }, { "cell_type": "code", "execution_count": 7, "id": "cell-16", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Update operations (100000 iterations):\n", " With copy: 28.16 ms\n", " In place: 24.99 ms\n", " Speedup: 1.1x\n" ] } ], "source": [ "# In-place operations\n", "\n", "def update_with_copy(x, dx):\n", " \"\"\"Update using copy.\"\"\"\n", " return x + dx\n", "\n", "def update_in_place(x, dx):\n", " \"\"\"Update in place.\"\"\"\n", " x += dx\n", " return x\n", "\n", "# Benchmark with many iterations\n", "n_iter = 100000\n", "x = np.zeros(100)\n", "dx = np.random.randn(100) * 0.001\n", "\n", "# With copy\n", "x_copy = x.copy()\n", "start = time.time()\n", "for _ in range(n_iter):\n", " x_copy = update_with_copy(x_copy, dx)\n", "copy_time = time.time() - start\n", "\n", "# In place\n", "x_inplace = x.copy()\n", "start = time.time()\n", "for _ in range(n_iter):\n", " update_in_place(x_inplace, dx)\n", "inplace_time = time.time() - start\n", "\n", "print(f\"Update operations ({n_iter} iterations):\")\n", "print(f\" With copy: {copy_time*1e3:.2f} ms\")\n", "print(f\" In place: {inplace_time*1e3:.2f} ms\")\n", "print(f\" Speedup: {copy_time/inplace_time:.1f}x\")" ] }, { "cell_type": "markdown", "id": "cell-17", "metadata": {}, "source": [ "## 6. Benchmarking Best Practices\n", "\n", "Guidelines for reliable performance measurements." ] }, { "cell_type": "code", "execution_count": 8, "id": "cell-18", "metadata": {}, "outputs": [], "source": [ "def reliable_benchmark(func, *args, n_warmup=5, n_runs=50, **kwargs):\n", " \"\"\"\n", " Perform a reliable benchmark with proper warmup and statistics.\n", " \n", " Parameters\n", " ----------\n", " func : callable\n", " Function to benchmark.\n", " n_warmup : int\n", " Number of warmup runs.\n", " n_runs : int\n", " Number of measurement runs.\n", " \n", " Returns\n", " -------\n", " dict\n", " Benchmark results with statistics.\n", " \"\"\"\n", " import gc\n", " \n", " # Disable garbage collection during timing\n", " gc_was_enabled = gc.isenabled()\n", " gc.disable()\n", " \n", " try:\n", " # Warmup\n", " for _ in range(n_warmup):\n", " func(*args, **kwargs)\n", " \n", " # Measurement\n", " times = []\n", " for _ in range(n_runs):\n", " start = time.perf_counter_ns()\n", " result = func(*args, **kwargs)\n", " end = time.perf_counter_ns()\n", " times.append((end - start) / 1e6) # Convert to ms\n", " \n", " finally:\n", " if gc_was_enabled:\n", " gc.enable()\n", " \n", " times = np.array(times)\n", " \n", " return {\n", " 'mean': np.mean(times),\n", " 'std': np.std(times),\n", " 'median': np.median(times),\n", " 'min': np.min(times),\n", " 'max': np.max(times),\n", " 'p25': np.percentile(times, 25),\n", " 'p75': np.percentile(times, 75),\n", " 'n_runs': n_runs,\n", " 'times': times\n", " }\n", "\n", "def print_benchmark_results(name, results):\n", " \"\"\"Print formatted benchmark results.\"\"\"\n", " print(f\"{name}:\")\n", " print(f\" Mean: {results['mean']:.3f} ms (±{results['std']:.3f})\")\n", " print(f\" Median: {results['median']:.3f} ms\")\n", " print(f\" Range: [{results['min']:.3f}, {results['max']:.3f}] ms\")\n", " print(f\" IQR: [{results['p25']:.3f}, {results['p75']:.3f}] ms\")" ] }, { "cell_type": "code", "execution_count": 10, "id": "cell-19", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Comprehensive Benchmark Results\n", "============================================================\n", "Matrix multiply (500x500):\n", " Mean: 0.378 ms (±0.051)\n", " Median: 0.362 ms\n", " Range: [0.345, 0.648] ms\n", " IQR: [0.353, 0.381] ms\n", "\n", "Matrix inversion (500x500):\n", " Mean: 2.401 ms (±0.077)\n", " Median: 2.389 ms\n", " Range: [2.302, 2.764] ms\n", " IQR: [2.352, 2.434] ms\n", "\n", "Hungarian assignment (100x100):\n", " Mean: 0.067 ms (±0.006)\n", " Median: 0.064 ms\n", " Range: [0.063, 0.085] ms\n", " IQR: [0.063, 0.068] ms\n" ] } ], "source": [ "# Run comprehensive benchmark\n", "print(\"Comprehensive Benchmark Results\")\n", "print(\"=\" * 60)\n", "\n", "# Matrix operations\n", "n = 500\n", "A = np.random.randn(n, n)\n", "B = np.random.randn(n, n)\n", "\n", "results = reliable_benchmark(np.dot, A, B)\n", "print_benchmark_results(f\"Matrix multiply ({n}x{n})\", results)\n", "\n", "print()\n", "\n", "# Matrix inversion\n", "results = reliable_benchmark(np.linalg.inv, A)\n", "print_benchmark_results(f\"Matrix inversion ({n}x{n})\", results)\n", "\n", "print()\n", "\n", "# Hungarian assignment\n", "m = 100\n", "cost = np.random.rand(m, m) * 100\n", "results = reliable_benchmark(hungarian, cost)\n", "print_benchmark_results(f\"Hungarian assignment ({m}x{m})\", results)" ] }, { "cell_type": "code", "execution_count": 11, "id": "cell-20", "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "hoverinfo": "skip", "line": { "width": 0 }, "mode": "lines", "showlegend": false, "type": "scatter", "x": [ 100, 200, 300, 400, 500 ], "y": { "bdata": "mpasGfq3mT831zn5sCKsP6sEjJgyXr0/htsuk3vVyD83aUcZtdLWPw==", "dtype": "f8" } }, { "fill": "tonexty", "fillcolor": "rgba(0, 212, 255, 0.3)", "line": { "width": 0 }, "mode": "lines", "name": "±1σ", "type": "scatter", "x": [ 100, 200, 300, 400, 500 ], "y": { "bdata": "nhqcddcMdz9Ltl6eGj2oPzfwJn4BCbs/clczHbEMxz8l/FYqAa3VPw==", "dtype": "f8" } }, { "error_y": { "array": { "bdata": "8o9FPMT0gz9fB9nWsixvP5+jKNOIqXI/NUG4X6eMfD8X0QbvPluCPw==", "dtype": "f8" }, "color": "#00d4ff", "type": "data", "visible": true }, "line": { "color": "#00d4ff", "width": 2 }, "marker": { "size": 10 }, "mode": "lines+markers", "name": "Mean time", "type": "scatter", "x": [ 100, 200, 300, 400, 500 ], "y": { "bdata": "QZ0T9y97jz/BRszL5S+qP3F6WQuaM7w/fBkxWBbxxz+uMs8h2z/WPw==", "dtype": "f8" } }, { "line": { "color": "#ff4757", "dash": "dash", "width": 1.5 }, "mode": "lines", "name": "O(n³) reference", "type": "scatter", "x": [ 100, 200, 300, 400, 500 ], "y": { "bdata": "QZ0T9y97jz9BnRP3L3u/P6+MeHjwj9o/QZ0T9y977z+RJ0nXTL7+Pw==", "dtype": "f8" } } ], "layout": { "height": 450, "template": { "layout": { "font": { "color": "#e6edf3" }, "paper_bgcolor": "#0d1117", "plot_bgcolor": "#0d1117", "xaxis": { "gridcolor": "#30363d", "zerolinecolor": "#30363d" }, "yaxis": { "gridcolor": "#30363d", "zerolinecolor": "#30363d" } } }, "title": { "text": "Matrix Multiplication Benchmark" }, "xaxis": { "title": { "text": "Matrix Size" } }, "yaxis": { "title": { "text": "Time (ms)" } } } } }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Visualize benchmark distribution\n", "# Run multiple sizes\n", "sizes = [100, 200, 300, 400, 500]\n", "means = []\n", "stds = []\n", "\n", "for n in sizes:\n", " A = np.random.randn(n, n)\n", " B = np.random.randn(n, n)\n", " results = reliable_benchmark(np.dot, A, B, n_runs=50)\n", " means.append(results['mean'])\n", " stds.append(results['std'])\n", "\n", "means = np.array(means)\n", "stds = np.array(stds)\n", "\n", "fig = go.Figure()\n", "\n", "# Error bars with shaded region\n", "fig.add_trace(\n", " go.Scatter(x=sizes, y=means + stds, mode='lines',\n", " line=dict(width=0), showlegend=False, hoverinfo='skip')\n", ")\n", "fig.add_trace(\n", " go.Scatter(x=sizes, y=means - stds, mode='lines',\n", " fill='tonexty', fillcolor='rgba(0, 212, 255, 0.3)',\n", " line=dict(width=0), name='±1σ')\n", ")\n", "fig.add_trace(\n", " go.Scatter(x=sizes, y=means, mode='lines+markers',\n", " name='Mean time', line=dict(color='#00d4ff', width=2),\n", " marker=dict(size=10), error_y=dict(type='data', array=stds,\n", " visible=True, color='#00d4ff'))\n", ")\n", "\n", "# O(n³) reference\n", "n3_fit = means[0] * (np.array(sizes) / sizes[0])**3\n", "fig.add_trace(\n", " go.Scatter(x=sizes, y=n3_fit, mode='lines',\n", " name='O(n³) reference', line=dict(color='#ff4757', width=1.5, dash='dash'))\n", ")\n", "\n", "fig.update_layout(\n", " template=dark_template,\n", " title='Matrix Multiplication Benchmark',\n", " xaxis_title='Matrix Size',\n", " yaxis_title='Time (ms)',\n", " height=450,\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "id": "cell-21", "metadata": {}, "source": [ "## Summary & Learning Path\n", "\n", "### Key Optimization Strategies\n", "\n", "Key optimization strategies (in priority order):\n", "\n", "1. **Profile first** - Identify actual bottlenecks (not guesses)\n", "2. **Vectorize** - Replace explicit loops with NumPy operations\n", "3. **Use Numba** - JIT compile hot spots to machine code\n", "4. **Cache results** - Memoize expensive pure functions\n", "5. **Pre-allocate** - Use np.zeros instead of append loops\n", "6. **In-place operations** - Use += instead of temp = x + y\n", "7. **Data types** - Use float32 if precision allows (2x memory/speed)\n", "8. **Parallelization** - Use Numba prange or multiprocessing\n", "\n", "### Optimization Checklist\n", "\n", "| Check | Tool | Speedup | Effort |\n", "|-------|------|---------|--------|\n", "| Profile | cProfile, line_profiler | 0x (info) | Low |\n", "| Vectorize | NumPy broadcasting | 10-100x | Medium |\n", "| Numba | @njit, @njit(parallel=True) | 100-1000x | Medium |\n", "| Cache | @lru_cache, functools | 5-1000x | Low |\n", "| Pre-allocate | np.zeros, np.empty | 2-10x | Low |\n", "| In-place | +=, @=, reductions | 1.5-3x | Trivial |\n", "| Data types | float32 vs float64 | 2x | Low† |\n", "| Parallel | prange, multiprocessing | N·(speedup) | High |\n", "\n", "### 4-Week Learning Curriculum\n", "\n", "**Week 1: Profiling & Analysis**\n", "- Day 1-2: cProfile for function-level analysis\n", "- Day 3-4: line_profiler for line-by-line bottlenecks\n", "- Day 5: memory_profiler and tracemalloc for memory leaks\n", "\n", "**Week 2: NumPy Vectorization**\n", "- Day 6-7: Broadcasting rules and shapes\n", "- Day 8-9: einsum for advanced tensor operations\n", "- Day 10: Avoiding common pitfalls (memory layout, dtype conversions)\n", "\n", "**Week 3: Numba JIT Compilation**\n", "- Day 11-12: @njit decorator and type inference\n", "- Day 13-14: Automatic parallelization with prange\n", "- Day 15: Debugging Numba code (limitations and workarounds)\n", "\n", "**Week 4: Integration & Advanced Patterns**\n", "- Day 16-17: Hybrid CPython-Numba code (when to use each)\n", "- Day 18-19: Multi-level optimization (profile → vectorize → Numba)\n", "- Day 20: Real-world tracking pipeline optimization (target: 10-100x speedup)\n", "\n", "### Advanced Topics\n", "\n", "**1. Numba Futures & Parallel Programming**\n", "- Fine-grained control over parallelization with prange\n", "- Thread pool vs process pool trade-offs\n", "- GPU acceleration with CUDA (via CuPy or Numba CUDA)\n", "- Scaling from single machine to distributed processing\n", "\n", "**2. Memory Layout & Cache Efficiency**\n", "- Row-major (C) vs column-major (Fortran) memory layout\n", "- L1/L2/L3 cache line utilization (typically 64 bytes)\n", "- Memory bandwidth bottlenecks for large matrices\n", "- Profiling with perf (Linux) or Instruments (macOS)\n", "\n", "**3. Algorithmic Improvements**\n", "- Reducing algorithmic complexity (O(n²) → O(n log n))\n", "- Approximate vs exact algorithms (trade accuracy for speed)\n", "- Early termination and pruning strategies\n", "- Lazy evaluation and streaming algorithms\n", "\n", "**4. Advanced Caching Strategies**\n", "- @lru_cache vs @cache (Python 3.9+)\n", "- Custom cache implementations (LRU, LFU, ARC)\n", "- Cache-aware algorithm design\n", "- Memoization of stateful computations\n", "\n", "**5. Batch Processing & Vectorization**\n", "- Kalman filter batch operations (vectorized over 100+ tracks)\n", "- Association algorithm batch matrices\n", "- Resampling algorithms with particle batches\n", "- Efficient tensor operations for multi-dimensional data\n", "\n", "### 8 Progressive Exercises\n", "\n", "**⭐ Beginner: Profile a Tracking Pipeline**\n", "```python\n", "import cProfile, pstats\n", "# 1. Create tracking loop with initialization, prediction, update\n", "# 2. Profile with cProfile\n", "# 3. Identify top 3 functions by cumulative time\n", "# 4. Measure total runtime\n", "```\n", "\n", "**⭐ Beginner: Vectorize Pairwise Distances**\n", "```python\n", "# Replace nested loop with broadcasting:\n", "# Loop: for i: for j: distances[i,j] = ||X[i] - X[j]||\n", "# Vectorized: diff = X[:, np.newaxis] - X[np.newaxis, :]\n", "# Measure speedup for n=1000 points\n", "```\n", "\n", "**⭐⭐ Intermediate: Numba JIT Compilation**\n", "```python\n", "from numba import njit\n", "# @njit a distance computation function\n", "# Time: Python vs Numba vs NumPy\n", "# Expected: Numba ≈ NumPy > Python (by 10-100x)\n", "```\n", "\n", "**⭐⭐ Intermediate: Batch Kalman Filter**\n", "```python\n", "# Replace loop-based Kalman for each track with:\n", "# einsum('ij,bj->bi', F, x_batch) # x_pred = F @ x for all batches\n", "# einsum('bij,jk->bik', FP, F) # P_pred = FPF.T for all batches\n", "# Measure speedup for 1000 tracks\n", "```\n", "\n", "**⭐⭐⭐ Advanced: Memory-Efficient Particle Filter**\n", "```python\n", "# Implement resampling with:\n", "# - In-place operations where possible\n", "# - Minimal temporary array allocations\n", "# - Numba prange for parallel resampling\n", "# Profile memory usage and execution time\n", "```\n", "\n", "**⭐⭐⭐ Advanced: Cached Matrix Operations**\n", "```python\n", "# Implement @lru_cache for:\n", "# - Covariance computations from fixed state\n", "# - Gate threshold calculations\n", "# - Measurement likelihood computations\n", "# Measure cache hit rate and throughput improvement\n", "```\n", "\n", "**⭐⭐⭐⭐ Expert: GPU Acceleration with CuPy**\n", "```python\n", "# Port Kalman filter prediction to GPU:\n", "# - Use CuPy arrays (cupy.array)\n", "# - Implement batch matrix operations on GPU\n", "# - Measure speedup vs NumPy (typical: 10-50x for large batches)\n", "# - Optimize GPU memory transfers\n", "```\n", "\n", "**⭐⭐⭐⭐ Expert: End-to-End Pipeline Optimization**\n", "```python\n", "# Multi-target tracking optimization strategy:\n", "# 1. Profile: Identify bottleneck (e.g., Hungarian assignment)\n", "# 2. Vectorize: Batch-process distance/cost matrices\n", "# 3. Numba: JIT compile inner loops (+100x)\n", "# 4. Cache: Memoize repeated gating computations (+5-10x)\n", "# 5. Goal: 10-100x total speedup on 1000+ track scenario\n", "```\n", "\n", "### Complete References\n", "\n", "**Core References:**\n", "\n", "1. **VanderPlas, J.** (2016). *Python Data Science Handbook: Essential Tools for Working with Data*. O'Reilly Media.\n", " - NumPy vectorization patterns and best practices\n", " - Memory layout and performance considerations\n", " - Practical examples for data science workflows\n", "\n", "2. **Numba Official Documentation**. https://numba.pydata.org/\n", " - CUDA for GPU acceleration\n", " - Performance tips and optimization guide\n", " - Limitations and workarounds reference\n", "\n", "3. **Gorelick, M., & Ozsvald, I.** (2020). *High Performance Python: Practical Performant Programming for Humans* (2nd ed.). O'Reilly Media.\n", " - Comprehensive performance profiling strategies\n", " - Pure Python, NumPy, Cython, Numba comparisons\n", " - Real-world case studies and optimization tales\n", "\n", "**Advanced References:**\n", "\n", "4. **Harris, M., et al.** (2017). \"High-Performance Computing Pattern Recognition in GPU Architectures\". *IEEE Computing in Science & Engineering*, 19(4), 34-41.\n", " - GPU memory hierarchy and optimization\n", " - Batch matrix operations on accelerators\n", " - Tracking algorithm GPU implementation patterns\n", "\n", "5. **Oliphant, T.** (2015). \"Guide to NumPy\" (2nd ed.). https://numpy.org/doc/stable/\n", " - Internal NumPy architecture and broadcasting\n", " - Einsum and advanced indexing\n", " - Memory layout and performance implications\n", "\n", "6. **Intel MKL Documentation**. https://software.intel.com/en-us/mkl-developer-reference-c\n", " - BLAS/LAPACK optimizations (used by NumPy)\n", " - Batch linear algebra operations\n", " - Performance tuning for matrix operations\n", "\n", "### PyTCL API Reference & Performance Characteristics\n", "\n", "Key functions optimized for performance:\n", "\n", "```python\n", "# Kalman filter (vectorizable)\n", "from pytcl.dynamic_estimation.kalman import (\n", " kf_predict, # O(n³) matrix operations\n", " kf_update, # O(n³) matrix inversion\n", ")\n", "# Profile: inversion dominates; can batch with einsum\n", "\n", "# Assignment algorithms (critical path)\n", "from pytcl.assignment_algorithms import (\n", " hungarian, # O(n³) optimal\n", " auction, # O(n²·m·ε⁻¹) often faster in practice\n", " gnn_association, # O(n·m) nearest neighbor (suboptimal)\n", " jpda, # O(2^n) hypothesis enumeration (n=associations)\n", ")\n", "# Profile: Hungarian sparse for large n; use GNN for real-time\n", "\n", "# Particle filters (highly vectorizable)\n", "from pytcl.dynamic_estimation.particle_filters import (\n", " resample_systematic, # O(n) but loop-intensive\n", " effective_sample_size, # O(n) lightweight computation\n", ")\n", "# Numba acceleration: 100-1000x for large particle counts (N > 10k)\n", "\n", "# Resampling with Numba prange:\n", "# @njit(parallel=True)\n", "# def resample_umkf(weights): ... # Parallelizable resampling\n", "```\n", "\n", "### Performance Targets by Component\n", "\n", "| Component | Operation | Current | Target | Method |\n", "|-----------|-----------|---------|--------|--------|\n", "| Kalman | 1000 tracks × 1 step | 50 ms | 5 ms | Batch einsum |\n", "| Hungarian | 500×500 assignment | 100 ms | 5 ms | Numba or Auction |\n", "| Particle Filter | Resample 100k particles | 200 ms | 10 ms | Numba prange |\n", "| JPDA | Association for 100 tracks | 500 ms | 50 ms | GNN or Numba |\n", "| Complete Pipeline | 1000 tracks × 100 steps | 1000s | 100s | All above |" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" } }, "nbformat": 4, "nbformat_minor": 5 }