Performance Optimization Guide ============================== Overview -------- This guide covers CPU-side optimization techniques for the Tracker Component Library. For GPU acceleration, see :doc:`gpu_acceleration`. Key Techniques: 1. **Profiling** - Identify bottlenecks 2. **Vectorization** - Batch operations 3. **Algorithm Selection** - Choose O(n) over O(n²) 4. **Caching** - Reuse computed values 5. **Numba JIT** - Compile hotspots to machine code 6. **Sparse Data Structures** - Reduce memory overhead Before Optimizing ------------------ **Profile First!** Never optimize without profiling. Most time is spent in a few functions: .. code-block:: python import cProfile import pstats from io import StringIO def profile_tracking_algorithm(): # Your tracking code here for z in measurements: x, P = extended_kalman_filter(x, P, z, ...) # Profile profiler = cProfile.Profile() profiler.enable() profile_tracking_algorithm() profiler.disable() # View results stats = pstats.Stats(profiler) stats.sort_stats('cumulative') stats.print_stats(20) # Top 20 functions Alternative: Using **line_profiler** for detailed line-by-line analysis: .. code-block:: bash pip install line_profiler kernprof -l -v your_script.py # Profiles lines with @profile decorator Vectorization & Batching ------------------------- **Problem: Loop Over Measurements** .. code-block:: python # ❌ SLOW: Python loop - slow interpreter overhead for z in measurements: # measurements shape: (10000, 2) x, P = extended_kalman_filter(x, P, z, ...) # ✅ FAST: Batch processing (if your filter supports it) # Process all measurements at once x, P = extended_kalman_filter_batch(x, P, measurements, ...) **Problem: Repeated Coordinate Conversions** .. code-block:: python # ❌ SLOW: Convert one at a time cartesian_coords = [] for spherical in spherical_array: cartesian = sphere2cart(spherical) cartesian_coords.append(cartesian) # ✅ FAST: Vectorized conversion from pytcl.coordinate_systems.conversions import sphere2cart cartesian_coords = sphere2cart(spherical_array) **Problem: Data Association with Many Targets** .. code-block:: python # ❌ SLOW: Compute full cost matrix cost = np.zeros((n_targets, n_measurements)) for i, target in enumerate(targets): for j, measurement in enumerate(measurements): cost[i, j] = compute_distance(target, measurement) # ✅ FAST: Vectorized distance computation from scipy.spatial.distance import cdist cost = cdist(targets, measurements, metric='euclidean') Algorithm Selection ------------------- **Assignment Problems:** =================== ====== ======= =============== Algorithm Time Optimal Best For =================== ====== ======= =============== Greedy O(n²) No Quick estimates Hungarian (Munkres) O(n³) Yes Small: n < 1000 Auction O(n³) Yes Modern libraries Auction + heuristic O(n²) ~Yes Large: n > 1000 =================== ====== ======= =============== .. code-block:: python from pytcl.assignment.optimization import assignment_nd # For < 100 targets: Hungarian is fine # assignments = assignment_nd(cost_matrix, method='hungarian') # For > 1000 targets: Use greedy + auctions # assignments = assignment_nd(cost_matrix, method='auction') # For sparse matrices (mostly infinite costs): from pytcl.assignment.optimization import assignment_nd_sparse assignments = assignment_nd_sparse(cost_matrix) Caching with ``lru_cache`` --------------------------- The library automatically caches expensive functions: .. code-block:: python # These are already cached (you don't need to do anything): from pytcl.sphericalHarmonics import legendre_scaling_factors # Cached from pytcl.special_functions import clenshaw_recursion # Cached # For custom functions in your code: from functools import lru_cache @lru_cache(maxsize=128) def expensive_lookup_table(index): # Computed once, reused on subsequent calls return precomputed_values[index] # Query many times - only computed once for i in range(10000): result = expensive_lookup_table(i % 128) **Caching for Jacobian Computations:** .. code-block:: python @lru_cache(maxsize=256) def cached_jacobian(angle_rounded): """Cache Jacobian with angle quantization""" # Helper function - quantize to 0.01 degree resolution return compute_jacobian(angle_rounded) for angle in angles: # Quantize to 0.01 degree for caching angle_rounded = round(angle / 0.01) * 0.01 J = cached_jacobian(angle_rounded) Numba JIT Compilation --------------------- The library uses Numba for matrix operations. You can use it for custom code: .. code-block:: python from numba import njit import numpy as np # Compile to machine code on first call @njit(cache=True) def compute_range_rate(positions, velocities): """Compute range-rate (dot product) - 5-10x speedup""" n = len(positions) range_rates = np.zeros(n) for i in range(n): # Convert to machine code - no interpreter overhead relative_pos = positions[i] - receiver_pos range_rates[i] = np.dot(relative_pos, velocities[i]) / np.linalg.norm(relative_pos) return range_rates # First call: compilation (slow) range_rates = compute_range_rate(pos, vel) # Subsequent calls: machine code (fast) range_rates = compute_range_rate(pos_new, vel_new) **Numba Tips:** - Cache=True allows reuse across runs - Avoid Python objects - use numpy arrays - Keep functions simple (no complex control flow) - Use ``@njit`` for pure functions, ``@jit`` for methods - Test with small input first (compilation can fail on edge cases) Example: Multi-target Tracking Loop .. code-block:: python @njit(cache=True) def kalman_predict_batch(states, covariances, F, Q): """Predict for all targets at once""" n_targets = len(states) x_pred = np.zeros_like(states) P_pred = np.zeros_like(covariances) for i in range(n_targets): x_pred[i] = F @ states[i] P_pred[i] = F @ covariances[i] @ F.T + Q return x_pred, P_pred Sparse Data Structures ---------------------- For large assignment problems with few valid assignments (sparse cost matrix): .. code-block:: python from pytcl.assignment.optimization import assignment_nd_sparse import numpy as np # Traditional: Full matrix (memory waste on infinite costs) cost_dense = np.full((10000, 10000), np.inf) cost_dense[valid_pairs] = actual_costs # Only 100 valid pairs # Memory: 10000 × 10000 × 8 bytes = 800 MB # Sparse: Only store valid entries assignments = assignment_nd_sparse(cost_dense) # Efficient indexing # Memory: ~100 entries × 8 bytes = 1 KB **Benefits:** - Memory savings: 50-90% reduction - Speed: O(n_valid log n_valid) vs O(n² log n²) - Automatic selection in ``assignment_nd()`` Real-World Example: Multi-Sensor Tracking ------------------------------------------ Optimize a realistic tracking scenario: .. code-block:: python import numpy as np from pytcl.dynamic_estimation.kalman import extended_kalman_filter from pytcl.assignment.optimization import assignment_nd from scipy.spatial.distance import cdist import time class OptimizedTracker: def __init__(self, n_targets): self.states = np.zeros((n_targets, 4)) self.covariances = np.eye(4)[np.newaxis, :, :].repeat(n_targets, axis=0) self.F = np.eye(4) # Simple constant-velocity model self.H = np.eye(2, 4) # Observe position only # Cache computations self.prev_cost_matrix = None def predict(self, Q): """Predict all targets""" # Vectorized prediction self.states = self.states @ self.F.T self.covariances = np.array([ self.F @ P @ self.F.T + Q for P in self.covariances ]) def compute_association_cost(self, measurements): """Fast data association using vectorized operations""" # Vectorized distance: O(n × m) not O(n × m) loop iterations positions = self.states[:, :2] distances = cdist(positions, measurements, metric='mahalanobis') # Gate: only nearby measurements matter cost_matrix = np.where(distances < 5, distances, np.inf) return cost_matrix def update(self, measurements): """Update with measurements""" cost = self.compute_association_cost(measurements) assignments = assignment_nd(cost) for target_idx, meas_idx in assignments: if meas_idx >= 0: # Valid assignment z = measurements[meas_idx] # Update state using measurement innovation = z - self.H @ self.states[target_idx] self.states[target_idx] += 0.5 * innovation # Usage with timing tracker = OptimizedTracker(n_targets=100) start = time.perf_counter() for measurements in measurement_sequence: tracker.predict(Q=np.eye(4) * 0.1) tracker.update(measurements) elapsed = time.perf_counter() - start print(f"Tracking {len(measurement_sequence)} frames: {elapsed:.2f}s") Performance Checklist --------------------- Before shipping - verify these optimizations: ✓ **Profile hotspots** - Know where time is spent ✓ **Vectorize loops** - Use numpy operations, not Python loops ✓ **Choose right algorithm** - O(n) vs O(n²) matters ✓ **Cache values** - Don't recompute constants ✓ **Consider Numba** - For tight numerical loops ✓ **Use sparse structures** - For large matrices with many zeros ✓ **GPU acceleration** - If data is large enough ✓ **Thread pool** - For independent operations Common Mistakes --------------- **Mistake 1: Premature Optimization** .. code-block:: python # ❌ Don't start with this # Complex Numba code, sparse matrices, GPU # ✅ Do this first # Simple, readable code # Profile it # Optimize bottlenecks only **Mistake 2: Micro-optimizations on Non-Critical Code** .. code-block:: python # Profile shows 99% time in parsing input # But you optimize the filter algorithm # ✅ Profile first to find real bottleneck **Mistake 3: Memory Allocation in Loop** .. code-block:: python # ❌ SLOW: Allocates new arrays each iteration for z in measurements: cost = np.zeros((n_targets, n_measurements)) # Allocate cost[:] = compute_distances(...) # Fill # ✅ FAST: Allocate once, reuse cost = np.zeros((n_targets, n_measurements)) for z in measurements: np.fill_diagonal(cost, 0) # Reset instead of reallocate compute_distances(cost, ...) **Mistake 4: Ignoring NumPy Broadcasting** .. code-block:: python # ❌ SLOW: Python loop ranges = [] for pos_target, pos_sensor in zip(target_positions, sensor_positions): r = np.linalg.norm(pos_target - pos_sensor) ranges.append(r) # ✅ FAST: NumPy broadcasting ranges = np.linalg.norm(target_positions - sensor_positions, axis=1) Resources --------- - `NumPy Performance Guide `_ - `Numba Performance Tips `_ - `Python Speed Profile Guide `_ - `SciPy Optimization `_ See Also ~~~~~~~~ - :doc:`gpu_acceleration` - GPU-accelerated operations - :doc:`kalman_filter_tuning` - Filter parameter tuning - Module: ``pytcl.dynamic_estimation.kalman`` - Examples: ``examples/performance_evaluation.py``