GPU Acceleration Guide ====================== Overview -------- The Tracker Component Library provides GPU acceleration for compute-intensive operations through two backends: - **CuPy** - NVIDIA GPU support (CUDA) - **MLX** - Apple Silicon GPU support (Metal Performance Shaders) This guide explains how to use GPU acceleration to speed up Kalman filters and particle filters. .. note:: GPU acceleration is optional. The library falls back to CPU automatically if GPU libraries aren't installed. Installation ------------ **NVIDIA GPU (CUDA):** .. code-block:: bash pip install cupy-cuda11x # Replace 11x with your CUDA version (11.2, 12.0, etc.) # Or for a specific CUDA version: pip install cupy-cuda12 **Apple Silicon (MLX):** .. code-block:: bash pip install mlx **Verify Installation:** .. code-block:: python from pytcl.gpu import is_gpu_available print(is_gpu_available()) # Returns GPU backend name or None Quick Start ----------- Extended Kalman Filter on GPU: .. code-block:: python import numpy as np from pytcl.dynamic_estimation.kalman import extended_kalman_filter # Your state transition and measurement functions def state_transition(x, dt): return x # Example: constant velocity model def measurement_fn(x): return x[:2] # Measure position only # Initialize on CPU (standard NumPy) x0 = np.array([0.0, 0.0, 1.0, 0.0]) # [x, y, vx, vy] P0 = np.eye(4) # If GPU available, convert to GPU arrays try: import cupy as cp x0_gpu = cp.asarray(x0) P0_gpu = cp.asarray(P0) except ImportError: x0_gpu = x0 P0_gpu = P0 # Run filter - automatically uses GPU if inputs are GPU arrays z = np.array([[1.0, 2.0]]) # Measurement: [x_meas, y_meas] x_new, P_new = extended_kalman_filter(x0_gpu, P0_gpu, z, state_transition, measurement_fn) Performance Considerations -------------------------- **When GPU Acceleration Helps:** - Particle filters with 1,000+ particles - Batch processing of many trajectories - Large-scale data association problems - Real-time systems processing multiple targets **When GPU May Not Help:** - Small filters (< 100 state dimensions) - One-shot filtering operations (transfer overhead dominates) - CPU-bound operations (coordinate conversions, etc.) **Performance Best Practices:** 1. **Batch Operations**: Process multiple time steps before CPU transfer .. code-block:: python import cupy as cp # ✅ Good: Batch 100 measurements measurements = cp.asarray(measurements_array) # [100, 2] measurements for z in measurements: x, P = extended_kalman_filter(x, P, z, ...) # Results stay on GPU # ❌ Avoid: Constant GPU<->CPU transfers for z in measurements: x_cpu = cp.asnumpy(x) # Avoid repeated transfers x = cp.asarray(extended_kalman_filter(x_cpu, ...)) 2. **Memory Management**: Monitor GPU memory usage .. code-block:: python import cupy as cp # Check available GPU memory mempool = cp.get_default_memory_pool() print(f"Used: {mempool.used_bytes() / 1e9:.2f} GB") print(f"Total: {mempool.total_bytes() / 1e9:.2f} GB") # Clear cache if needed mempool.free_all_blocks() 3. **Data Type Selection**: Use float32 for better GPU performance .. code-block:: python import cupy as cp # float32 is faster on most GPUs x = cp.asarray([0.0, 0.0, 1.0, 0.0], dtype=cp.float32) P = cp.eye(4, dtype=cp.float32) # float64 only if numerical precision is critical Module-Specific GPU Support ---------------------------- **Kalman Filters** (Full Support) - ``extended_kalman_filter()`` - EKF - ``unscented_kalman_filter()`` - UKF - ``cubature_kalman_filter()`` - CKF All matrix operations (Cholesky, QR, etc.) automatically use GPU. **Particle Filters** (Full Support) - ``particle_filter()`` - Standard resampling - ``sequential_importance_resampling()`` - SIR GPU accelerates particle propagation and weight computation. **Data Association** (Partial Support) - ``assignment_nd()`` - Greedy and Hungarian algorithms - Sparse assignment with large cost matrices benefits most **Coordinate Conversions** (Limited Benefit) - GPU acceleration not recommended for single conversions - Batch conversions of 10,000+ points show ~2-5x speedup Troubleshooting --------------- **Issue: "ModuleNotFoundError: No module named 'cupy'"** CuPy not installed. Install with your CUDA version: .. code-block:: bash pip install cupy-cuda12 **Issue: CUDA Error "Device insufficient for this operation"** GPU doesn't support required operations. Fall back to CPU: .. code-block:: python import numpy as np x = np.asarray(x) # Convert GPU arrays back to CPU # Continue processing on CPU **Issue: Out of Memory Error** Reduce batch size or switch to CPU: .. code-block:: python # Process smaller batches batch_size = 100 for i in range(0, len(measurements), batch_size): batch = measurements[i:i+batch_size] # Process batch... **Issue: Slower than CPU** GPU overhead > speedup. Common causes: 1. Small problem size (< 100 state dimension) 2. Frequent GPU<->CPU transfers 3. I/O bound (disk reading slower than computation) Solution: Profile with `time` module to identify bottleneck: .. code-block:: python import time # Time GPU operation x_gpu = cp.asarray(x) start = time.perf_counter() x_result = extended_kalman_filter(x_gpu, P_gpu, z_gpu, ...) cp.cuda.Stream.null.synchronize() # Wait for GPU elapsed = time.perf_counter() - start print(f"GPU time: {elapsed:.4f}s") Performance Benchmarks ---------------------- Typical speedups (relative to CPU NumPy): ================================= ============== ============== Operation NVIDIA (CuPy) Apple (MLX) ================================= ============== ============== EKF with 20-dim state 5-8x 3-5x Particle filter (1000 particles) 8-12x 4-7x Sparse assignment (10k x 10k) 10-15x 5-10x Coordinate conversion (1M points) 2-3x 1-2x ================================= ============== ============== .. note:: Benchmark results depend on GPU model, data size, and problem type. Always profile your specific use case. Advanced Topics --------------- **Custom GPU Kernels** For extreme performance, write custom CUDA kernels with CuPy: .. code-block:: python import cupy as cp # Define custom CUDA kernel kernel_code = ''' __global__ void my_kernel(float *x, int n) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < n) x[idx] *= 2.0f; } ''' kernel = cp.RawKernel(kernel_code, 'my_kernel') **Multi-GPU Processing** For systems with multiple GPUs: .. code-block:: python import cupy as cp # Distribute work across GPUs for gpu_id in range(cp.cuda.runtime.getDeviceCount()): with cp.cuda.Device(gpu_id): # Process on this GPU pass See Also ~~~~~~~~ - :doc:`performance_optimization` - CPU optimization techniques - :doc:`kalman_filter_tuning` - Filter tuning and diagnostics - `CuPy Documentation `_ - `MLX Documentation `_