Performance Optimization Guide
Overview
This guide covers CPU-side optimization techniques for the Tracker Component Library. For GPU acceleration, see GPU Acceleration Guide.
Key Techniques:
Profiling - Identify bottlenecks
Vectorization - Batch operations
Algorithm Selection - Choose O(n) over O(n²)
Caching - Reuse computed values
Numba JIT - Compile hotspots to machine code
Sparse Data Structures - Reduce memory overhead
Before Optimizing
Profile First!
Never optimize without profiling. Most time is spent in a few functions:
import cProfile
import pstats
from io import StringIO
def profile_tracking_algorithm():
# Your tracking code here
for z in measurements:
x, P = extended_kalman_filter(x, P, z, ...)
# Profile
profiler = cProfile.Profile()
profiler.enable()
profile_tracking_algorithm()
profiler.disable()
# View results
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20) # Top 20 functions
Alternative: Using line_profiler for detailed line-by-line analysis:
pip install line_profiler
kernprof -l -v your_script.py # Profiles lines with @profile decorator
Vectorization & Batching
Problem: Loop Over Measurements
# ❌ SLOW: Python loop - slow interpreter overhead
for z in measurements: # measurements shape: (10000, 2)
x, P = extended_kalman_filter(x, P, z, ...)
# ✅ FAST: Batch processing (if your filter supports it)
# Process all measurements at once
x, P = extended_kalman_filter_batch(x, P, measurements, ...)
Problem: Repeated Coordinate Conversions
# ❌ SLOW: Convert one at a time
cartesian_coords = []
for spherical in spherical_array:
cartesian = sphere2cart(spherical)
cartesian_coords.append(cartesian)
# ✅ FAST: Vectorized conversion
from pytcl.coordinate_systems.conversions import sphere2cart
cartesian_coords = sphere2cart(spherical_array)
Problem: Data Association with Many Targets
# ❌ SLOW: Compute full cost matrix
cost = np.zeros((n_targets, n_measurements))
for i, target in enumerate(targets):
for j, measurement in enumerate(measurements):
cost[i, j] = compute_distance(target, measurement)
# ✅ FAST: Vectorized distance computation
from scipy.spatial.distance import cdist
cost = cdist(targets, measurements, metric='euclidean')
Algorithm Selection
Assignment Problems:
Algorithm |
Time |
Optimal |
Best For |
|---|---|---|---|
Greedy |
O(n²) |
No |
Quick estimates |
Hungarian (Munkres) |
O(n³) |
Yes |
Small: n < 1000 |
Auction |
O(n³) |
Yes |
Modern libraries |
Auction + heuristic |
O(n²) |
~Yes |
Large: n > 1000 |
from pytcl.assignment.optimization import assignment_nd
# For < 100 targets: Hungarian is fine
# assignments = assignment_nd(cost_matrix, method='hungarian')
# For > 1000 targets: Use greedy + auctions
# assignments = assignment_nd(cost_matrix, method='auction')
# For sparse matrices (mostly infinite costs):
from pytcl.assignment.optimization import assignment_nd_sparse
assignments = assignment_nd_sparse(cost_matrix)
Caching with lru_cache
The library automatically caches expensive functions:
# These are already cached (you don't need to do anything):
from pytcl.sphericalHarmonics import legendre_scaling_factors # Cached
from pytcl.special_functions import clenshaw_recursion # Cached
# For custom functions in your code:
from functools import lru_cache
@lru_cache(maxsize=128)
def expensive_lookup_table(index):
# Computed once, reused on subsequent calls
return precomputed_values[index]
# Query many times - only computed once
for i in range(10000):
result = expensive_lookup_table(i % 128)
Caching for Jacobian Computations:
@lru_cache(maxsize=256)
def cached_jacobian(angle_rounded):
"""Cache Jacobian with angle quantization"""
# Helper function - quantize to 0.01 degree resolution
return compute_jacobian(angle_rounded)
for angle in angles:
# Quantize to 0.01 degree for caching
angle_rounded = round(angle / 0.01) * 0.01
J = cached_jacobian(angle_rounded)
Numba JIT Compilation
The library uses Numba for matrix operations. You can use it for custom code:
from numba import njit
import numpy as np
# Compile to machine code on first call
@njit(cache=True)
def compute_range_rate(positions, velocities):
"""Compute range-rate (dot product) - 5-10x speedup"""
n = len(positions)
range_rates = np.zeros(n)
for i in range(n):
# Convert to machine code - no interpreter overhead
relative_pos = positions[i] - receiver_pos
range_rates[i] = np.dot(relative_pos, velocities[i]) / np.linalg.norm(relative_pos)
return range_rates
# First call: compilation (slow)
range_rates = compute_range_rate(pos, vel)
# Subsequent calls: machine code (fast)
range_rates = compute_range_rate(pos_new, vel_new)
Numba Tips:
Cache=True allows reuse across runs
Avoid Python objects - use numpy arrays
Keep functions simple (no complex control flow)
Use
@njitfor pure functions,@jitfor methodsTest with small input first (compilation can fail on edge cases)
Example: Multi-target Tracking Loop
@njit(cache=True)
def kalman_predict_batch(states, covariances, F, Q):
"""Predict for all targets at once"""
n_targets = len(states)
x_pred = np.zeros_like(states)
P_pred = np.zeros_like(covariances)
for i in range(n_targets):
x_pred[i] = F @ states[i]
P_pred[i] = F @ covariances[i] @ F.T + Q
return x_pred, P_pred
Sparse Data Structures
For large assignment problems with few valid assignments (sparse cost matrix):
from pytcl.assignment.optimization import assignment_nd_sparse
import numpy as np
# Traditional: Full matrix (memory waste on infinite costs)
cost_dense = np.full((10000, 10000), np.inf)
cost_dense[valid_pairs] = actual_costs # Only 100 valid pairs
# Memory: 10000 × 10000 × 8 bytes = 800 MB
# Sparse: Only store valid entries
assignments = assignment_nd_sparse(cost_dense) # Efficient indexing
# Memory: ~100 entries × 8 bytes = 1 KB
Benefits:
Memory savings: 50-90% reduction
Speed: O(n_valid log n_valid) vs O(n² log n²)
Automatic selection in
assignment_nd()
Real-World Example: Multi-Sensor Tracking
Optimize a realistic tracking scenario:
import numpy as np
from pytcl.dynamic_estimation.kalman import extended_kalman_filter
from pytcl.assignment.optimization import assignment_nd
from scipy.spatial.distance import cdist
import time
class OptimizedTracker:
def __init__(self, n_targets):
self.states = np.zeros((n_targets, 4))
self.covariances = np.eye(4)[np.newaxis, :, :].repeat(n_targets, axis=0)
self.F = np.eye(4) # Simple constant-velocity model
self.H = np.eye(2, 4) # Observe position only
# Cache computations
self.prev_cost_matrix = None
def predict(self, Q):
"""Predict all targets"""
# Vectorized prediction
self.states = self.states @ self.F.T
self.covariances = np.array([
self.F @ P @ self.F.T + Q for P in self.covariances
])
def compute_association_cost(self, measurements):
"""Fast data association using vectorized operations"""
# Vectorized distance: O(n × m) not O(n × m) loop iterations
positions = self.states[:, :2]
distances = cdist(positions, measurements, metric='mahalanobis')
# Gate: only nearby measurements matter
cost_matrix = np.where(distances < 5, distances, np.inf)
return cost_matrix
def update(self, measurements):
"""Update with measurements"""
cost = self.compute_association_cost(measurements)
assignments = assignment_nd(cost)
for target_idx, meas_idx in assignments:
if meas_idx >= 0: # Valid assignment
z = measurements[meas_idx]
# Update state using measurement
innovation = z - self.H @ self.states[target_idx]
self.states[target_idx] += 0.5 * innovation
# Usage with timing
tracker = OptimizedTracker(n_targets=100)
start = time.perf_counter()
for measurements in measurement_sequence:
tracker.predict(Q=np.eye(4) * 0.1)
tracker.update(measurements)
elapsed = time.perf_counter() - start
print(f"Tracking {len(measurement_sequence)} frames: {elapsed:.2f}s")
Performance Checklist
Before shipping - verify these optimizations:
✓ Profile hotspots - Know where time is spent ✓ Vectorize loops - Use numpy operations, not Python loops ✓ Choose right algorithm - O(n) vs O(n²) matters ✓ Cache values - Don’t recompute constants ✓ Consider Numba - For tight numerical loops ✓ Use sparse structures - For large matrices with many zeros ✓ GPU acceleration - If data is large enough ✓ Thread pool - For independent operations
Common Mistakes
Mistake 1: Premature Optimization
# ❌ Don't start with this
# Complex Numba code, sparse matrices, GPU
# ✅ Do this first
# Simple, readable code
# Profile it
# Optimize bottlenecks only
Mistake 2: Micro-optimizations on Non-Critical Code
# Profile shows 99% time in parsing input
# But you optimize the filter algorithm
# ✅ Profile first to find real bottleneck
Mistake 3: Memory Allocation in Loop
# ❌ SLOW: Allocates new arrays each iteration
for z in measurements:
cost = np.zeros((n_targets, n_measurements)) # Allocate
cost[:] = compute_distances(...) # Fill
# ✅ FAST: Allocate once, reuse
cost = np.zeros((n_targets, n_measurements))
for z in measurements:
np.fill_diagonal(cost, 0) # Reset instead of reallocate
compute_distances(cost, ...)
Mistake 4: Ignoring NumPy Broadcasting
# ❌ SLOW: Python loop
ranges = []
for pos_target, pos_sensor in zip(target_positions, sensor_positions):
r = np.linalg.norm(pos_target - pos_sensor)
ranges.append(r)
# ✅ FAST: NumPy broadcasting
ranges = np.linalg.norm(target_positions - sensor_positions, axis=1)
Resources
See Also
GPU Acceleration Guide - GPU-accelerated operations
Kalman Filter Tuning Guide - Filter parameter tuning
Module:
pytcl.dynamic_estimation.kalmanExamples:
examples/performance_evaluation.py