GPU Acceleration Guide
Overview
The Tracker Component Library provides GPU acceleration for compute-intensive operations through two backends:
CuPy - NVIDIA GPU support (CUDA)
MLX - Apple Silicon GPU support (Metal Performance Shaders)
This guide explains how to use GPU acceleration to speed up Kalman filters and particle filters.
Note
GPU acceleration is optional. The library falls back to CPU automatically if GPU libraries aren’t installed.
Installation
NVIDIA GPU (CUDA):
pip install cupy-cuda11x # Replace 11x with your CUDA version (11.2, 12.0, etc.)
# Or for a specific CUDA version:
pip install cupy-cuda12
Apple Silicon (MLX):
pip install mlx
Verify Installation:
from pytcl.gpu import is_gpu_available
print(is_gpu_available()) # Returns GPU backend name or None
Quick Start
Extended Kalman Filter on GPU:
import numpy as np
from pytcl.dynamic_estimation.kalman import extended_kalman_filter
# Your state transition and measurement functions
def state_transition(x, dt):
return x # Example: constant velocity model
def measurement_fn(x):
return x[:2] # Measure position only
# Initialize on CPU (standard NumPy)
x0 = np.array([0.0, 0.0, 1.0, 0.0]) # [x, y, vx, vy]
P0 = np.eye(4)
# If GPU available, convert to GPU arrays
try:
import cupy as cp
x0_gpu = cp.asarray(x0)
P0_gpu = cp.asarray(P0)
except ImportError:
x0_gpu = x0
P0_gpu = P0
# Run filter - automatically uses GPU if inputs are GPU arrays
z = np.array([[1.0, 2.0]]) # Measurement: [x_meas, y_meas]
x_new, P_new = extended_kalman_filter(x0_gpu, P0_gpu, z, state_transition, measurement_fn)
Performance Considerations
When GPU Acceleration Helps:
Particle filters with 1,000+ particles
Batch processing of many trajectories
Large-scale data association problems
Real-time systems processing multiple targets
When GPU May Not Help:
Small filters (< 100 state dimensions)
One-shot filtering operations (transfer overhead dominates)
CPU-bound operations (coordinate conversions, etc.)
Performance Best Practices:
Batch Operations: Process multiple time steps before CPU transfer
import cupy as cp # ✅ Good: Batch 100 measurements measurements = cp.asarray(measurements_array) # [100, 2] measurements for z in measurements: x, P = extended_kalman_filter(x, P, z, ...) # Results stay on GPU # ❌ Avoid: Constant GPU<->CPU transfers for z in measurements: x_cpu = cp.asnumpy(x) # Avoid repeated transfers x = cp.asarray(extended_kalman_filter(x_cpu, ...))
Memory Management: Monitor GPU memory usage
import cupy as cp # Check available GPU memory mempool = cp.get_default_memory_pool() print(f"Used: {mempool.used_bytes() / 1e9:.2f} GB") print(f"Total: {mempool.total_bytes() / 1e9:.2f} GB") # Clear cache if needed mempool.free_all_blocks()
Data Type Selection: Use float32 for better GPU performance
import cupy as cp # float32 is faster on most GPUs x = cp.asarray([0.0, 0.0, 1.0, 0.0], dtype=cp.float32) P = cp.eye(4, dtype=cp.float32) # float64 only if numerical precision is critical
Module-Specific GPU Support
Kalman Filters (Full Support)
extended_kalman_filter()- EKFunscented_kalman_filter()- UKFcubature_kalman_filter()- CKF
All matrix operations (Cholesky, QR, etc.) automatically use GPU.
Particle Filters (Full Support)
particle_filter()- Standard resamplingsequential_importance_resampling()- SIR
GPU accelerates particle propagation and weight computation.
Data Association (Partial Support)
assignment_nd()- Greedy and Hungarian algorithmsSparse assignment with large cost matrices benefits most
Coordinate Conversions (Limited Benefit)
GPU acceleration not recommended for single conversions
Batch conversions of 10,000+ points show ~2-5x speedup
Troubleshooting
Issue: “ModuleNotFoundError: No module named ‘cupy’”
CuPy not installed. Install with your CUDA version:
pip install cupy-cuda12
Issue: CUDA Error “Device insufficient for this operation”
GPU doesn’t support required operations. Fall back to CPU:
import numpy as np
x = np.asarray(x) # Convert GPU arrays back to CPU
# Continue processing on CPU
Issue: Out of Memory Error
Reduce batch size or switch to CPU:
# Process smaller batches
batch_size = 100
for i in range(0, len(measurements), batch_size):
batch = measurements[i:i+batch_size]
# Process batch...
Issue: Slower than CPU
GPU overhead > speedup. Common causes:
Small problem size (< 100 state dimension)
Frequent GPU<->CPU transfers
I/O bound (disk reading slower than computation)
Solution: Profile with time module to identify bottleneck:
import time
# Time GPU operation
x_gpu = cp.asarray(x)
start = time.perf_counter()
x_result = extended_kalman_filter(x_gpu, P_gpu, z_gpu, ...)
cp.cuda.Stream.null.synchronize() # Wait for GPU
elapsed = time.perf_counter() - start
print(f"GPU time: {elapsed:.4f}s")
Performance Benchmarks
Typical speedups (relative to CPU NumPy):
Operation |
NVIDIA (CuPy) |
Apple (MLX) |
|---|---|---|
EKF with 20-dim state |
5-8x |
3-5x |
Particle filter (1000 particles) |
8-12x |
4-7x |
Sparse assignment (10k x 10k) |
10-15x |
5-10x |
Coordinate conversion (1M points) |
2-3x |
1-2x |
Note
Benchmark results depend on GPU model, data size, and problem type. Always profile your specific use case.
Advanced Topics
Custom GPU Kernels
For extreme performance, write custom CUDA kernels with CuPy:
import cupy as cp
# Define custom CUDA kernel
kernel_code = '''
__global__ void my_kernel(float *x, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) x[idx] *= 2.0f;
}
'''
kernel = cp.RawKernel(kernel_code, 'my_kernel')
Multi-GPU Processing
For systems with multiple GPUs:
import cupy as cp
# Distribute work across GPUs
for gpu_id in range(cp.cuda.runtime.getDeviceCount()):
with cp.cuda.Device(gpu_id):
# Process on this GPU
pass
See Also
Performance Optimization Guide - CPU optimization techniques
Kalman Filter Tuning Guide - Filter tuning and diagnostics