GPU Acceleration: CuPy Integration for Tracking
This notebook demonstrates GPU acceleration techniques for tracking algorithms using CuPy. We cover:
CuPy Basics - GPU array operations and NumPy compatibility
Accelerating Matrix Operations - Core operations for Kalman filters
Batch Processing - Processing multiple tracks in parallel
Particle Filter Acceleration - GPU-based resampling and weight computation
Performance Comparison - Benchmarking CPU vs GPU
Prerequisites
# GPU-enabled version
pip install cupy-cuda12x # For CUDA 12.x
# or pip install cupy-cuda11x for CUDA 11.x
pip install nrl-tracker plotly numpy
Note: This notebook requires an NVIDIA GPU with CUDA support. If running without a GPU, code examples will demonstrate concepts with NumPy fallbacks.
[16]:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import time
# Check for GPU availability
try:
import cupy as cp
GPU_AVAILABLE = True
print(f"CuPy version: {cp.__version__}")
print(f"CUDA version: {cp.cuda.runtime.runtimeGetVersion()}")
print(f"GPU: {cp.cuda.Device().name}")
print(f"GPU memory: {cp.cuda.Device().mem_info[1] / 1e9:.1f} GB")
except ImportError:
GPU_AVAILABLE = False
print("CuPy not available. Running in CPU-only mode.")
print("Install with: pip install cupy-cuda12x")
# Use appropriate array module
xp = cp if GPU_AVAILABLE else np
np.random.seed(42)
# Plotly dark theme template
dark_template = go.layout.Template()
dark_template.layout = go.Layout(
paper_bgcolor='#0d1117',
plot_bgcolor='#0d1117',
font=dict(color='#e6edf3'),
xaxis=dict(gridcolor='#30363d', zerolinecolor='#30363d'),
yaxis=dict(gridcolor='#30363d', zerolinecolor='#30363d'),
)
CuPy not available. Running in CPU-only mode.
Install with: pip install cupy-cuda12x
1. CuPy Basics
CuPy provides a NumPy-compatible interface for GPU computing. Arrays and operations mirror NumPy exactly, making migration straightforward.
Key Concepts
NumPy |
CuPy |
Location |
|---|---|---|
|
|
GPU |
|
|
GPU |
|
|
GPU |
Transfer to GPU |
|
CPU→GPU |
Transfer to CPU |
|
GPU→CPU |
[17]:
# Basic array operations
if GPU_AVAILABLE:
# Create arrays on GPU
a_gpu = cp.array([1, 2, 3, 4, 5])
b_gpu = cp.array([5, 4, 3, 2, 1])
# Operations happen on GPU
c_gpu = a_gpu + b_gpu
d_gpu = cp.dot(a_gpu, b_gpu)
print(f"a + b = {c_gpu}")
print(f"a · b = {d_gpu}")
print(f"Array type: {type(c_gpu)}")
# Transfer back to CPU when needed
c_cpu = cp.asnumpy(c_gpu)
print(f"Transferred type: {type(c_cpu)}")
else:
# NumPy fallback
a_cpu = np.array([1, 2, 3, 4, 5])
b_cpu = np.array([5, 4, 3, 2, 1])
print(f"a + b = {a_cpu + b_cpu}")
print(f"a · b = {np.dot(a_cpu, b_cpu)}")
a + b = [6 6 6 6 6]
a · b = 35
[18]:
# Memory management
if GPU_AVAILABLE:
# Check memory before
mempool = cp.get_default_memory_pool()
print(f"Memory used before: {mempool.used_bytes() / 1e6:.2f} MB")
# Create large array
large_array = cp.random.randn(10000, 10000)
print(f"Memory after 10000x10000 float64: {mempool.used_bytes() / 1e6:.2f} MB")
# Free memory explicitly
del large_array
mempool.free_all_blocks()
print(f"Memory after free: {mempool.used_bytes() / 1e6:.2f} MB")
2. Accelerating Matrix Operations
Kalman filters involve repeated matrix operations that benefit from GPU parallelization:
Matrix multiplication: \(P \cdot H^T\)
Matrix inversion: \((HPH^T + R)^{-1}\)
Cholesky decomposition: \(P = LL^T\)
Let’s implement core Kalman filter operations for both CPU and GPU.
[19]:
def kalman_predict(x, P, F, Q, xp=np):
"""
Kalman filter prediction step.
Parameters
----------
x : array_like
State estimate [n, 1].
P : array_like
State covariance [n, n].
F : array_like
State transition matrix [n, n].
Q : array_like
Process noise covariance [n, n].
xp : module
Array module (numpy or cupy).
Returns
-------
x_pred : array
Predicted state.
P_pred : array
Predicted covariance.
"""
x_pred = xp.dot(F, x)
P_pred = xp.dot(xp.dot(F, P), F.T) + Q
return x_pred, P_pred
def kalman_update(x, P, z, H, R, xp=np):
"""
Kalman filter update step.
Parameters
----------
x : array_like
Predicted state [n, 1].
P : array_like
Predicted covariance [n, n].
z : array_like
Measurement [m, 1].
H : array_like
Measurement matrix [m, n].
R : array_like
Measurement noise covariance [m, m].
xp : module
Array module (numpy or cupy).
Returns
-------
x_upd : array
Updated state.
P_upd : array
Updated covariance.
"""
# Innovation
y = z - xp.dot(H, x)
# Innovation covariance
S = xp.dot(xp.dot(H, P), H.T) + R
# Kalman gain
K = xp.dot(xp.dot(P, H.T), xp.linalg.inv(S))
# State update
x_upd = x + xp.dot(K, y)
# Covariance update (Joseph form for numerical stability)
I = xp.eye(P.shape[0])
IKH = I - xp.dot(K, H)
P_upd = xp.dot(xp.dot(IKH, P), IKH.T) + xp.dot(xp.dot(K, R), K.T)
return x_upd, P_upd
print("Kalman filter functions defined.")
Kalman filter functions defined.
[20]:
# Test Kalman filter on CPU and GPU
state_dim = 4 # [x, vx, y, vy]
meas_dim = 2 # [x, y]
dt = 1.0
# State transition (constant velocity)
F = np.array([
[1, dt, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, dt],
[0, 0, 0, 1]
])
# Measurement matrix (observe position only)
H = np.array([
[1, 0, 0, 0],
[0, 0, 1, 0]
])
# Process noise
q = 0.1
Q = q * np.array([
[dt**3/3, dt**2/2, 0, 0],
[dt**2/2, dt, 0, 0],
[0, 0, dt**3/3, dt**2/2],
[0, 0, dt**2/2, dt]
])
# Measurement noise
R = 10 * np.eye(meas_dim)
# Initial state and covariance
x = np.array([0, 1, 0, 1]).reshape(-1, 1)
P = 100 * np.eye(state_dim)
# Measurement
z = np.array([5, 5]).reshape(-1, 1)
print("System matrices defined.")
print(f"State dimension: {state_dim}")
print(f"Measurement dimension: {meas_dim}")
System matrices defined.
State dimension: 4
Measurement dimension: 2
[21]:
# Run on CPU
x_pred_cpu, P_pred_cpu = kalman_predict(x, P, F, Q, xp=np)
x_upd_cpu, P_upd_cpu = kalman_update(x_pred_cpu, P_pred_cpu, z, H, R, xp=np)
print("CPU Results:")
print(f"Predicted state: {x_pred_cpu.flatten()}")
print(f"Updated state: {x_upd_cpu.flatten()}")
if GPU_AVAILABLE:
# Transfer to GPU
x_gpu = cp.asarray(x)
P_gpu = cp.asarray(P)
F_gpu = cp.asarray(F)
Q_gpu = cp.asarray(Q)
H_gpu = cp.asarray(H)
R_gpu = cp.asarray(R)
z_gpu = cp.asarray(z)
# Run on GPU
x_pred_gpu, P_pred_gpu = kalman_predict(x_gpu, P_gpu, F_gpu, Q_gpu, xp=cp)
x_upd_gpu, P_upd_gpu = kalman_update(x_pred_gpu, P_pred_gpu, z_gpu, H_gpu, R_gpu, xp=cp)
print("\nGPU Results:")
print(f"Predicted state: {cp.asnumpy(x_pred_gpu).flatten()}")
print(f"Updated state: {cp.asnumpy(x_upd_gpu).flatten()}")
# Verify results match
error = np.max(np.abs(cp.asnumpy(x_upd_gpu) - x_upd_cpu))
print(f"\nMax difference CPU vs GPU: {error:.2e}")
CPU Results:
Predicted state: [1. 1. 1. 1.]
Updated state: [4.80955404 2.90541184 4.80955404 2.90541184]
3. Batch Processing Multiple Tracks
The real power of GPU acceleration comes from processing many tracks simultaneously. Instead of looping over tracks, we can stack them into batched arrays.
[22]:
def batch_kalman_predict(x_batch, P_batch, F, Q, xp=np):
"""
Batch Kalman filter prediction.
Parameters
----------
x_batch : array_like
Batch of states [batch_size, n, 1].
P_batch : array_like
Batch of covariances [batch_size, n, n].
F : array_like
State transition matrix [n, n] (shared).
Q : array_like
Process noise [n, n] (shared).
xp : module
Array module.
Returns
-------
x_pred : array
Predicted states [batch_size, n, 1].
P_pred : array
Predicted covariances [batch_size, n, n].
"""
batch_size = x_batch.shape[0]
# Vectorized prediction
# x_pred = F @ x for each batch element
x_pred = xp.einsum('ij,bjk->bik', F, x_batch)
# P_pred = F @ P @ F.T + Q for each batch element
FP = xp.einsum('ij,bjk->bik', F, P_batch)
FPFt = xp.einsum('bij,kj->bik', FP, F)
P_pred = FPFt + Q
return x_pred, P_pred
def batch_kalman_update(x_batch, P_batch, z_batch, H, R, xp=np):
"""
Batch Kalman filter update.
Parameters
----------
x_batch : array_like
Batch of predicted states [batch_size, n, 1].
P_batch : array_like
Batch of predicted covariances [batch_size, n, n].
z_batch : array_like
Batch of measurements [batch_size, m, 1].
H : array_like
Measurement matrix [m, n] (shared).
R : array_like
Measurement noise [m, m] (shared).
xp : module
Array module.
Returns
-------
x_upd : array
Updated states [batch_size, n, 1].
P_upd : array
Updated covariances [batch_size, n, n].
"""
batch_size = x_batch.shape[0]
n = x_batch.shape[1]
# Innovation: y = z - H @ x
Hx = xp.einsum('ij,bjk->bik', H, x_batch)
y = z_batch - Hx
# Innovation covariance: S = H @ P @ H.T + R
HP = xp.einsum('ij,bjk->bik', H, P_batch)
HPHt = xp.einsum('bij,kj->bik', HP, H)
S = HPHt + R
# Kalman gain: K = P @ H.T @ S^{-1}
PHt = xp.einsum('bij,kj->bik', P_batch, H)
S_inv = xp.linalg.inv(S)
K = xp.einsum('bij,bjk->bik', PHt, S_inv)
# State update: x_upd = x + K @ y
Ky = xp.einsum('bij,bjk->bik', K, y)
x_upd = x_batch + Ky
# Covariance update (simplified form)
I = xp.eye(n)
KH = xp.einsum('bij,jk->bik', K, H)
IKH = I - KH
P_upd = xp.einsum('bij,bjk->bik', IKH, P_batch)
return x_upd, P_upd
print("Batch Kalman functions defined.")
Batch Kalman functions defined.
[23]:
# Benchmark batch processing
batch_sizes = [10, 100, 1000, 5000]
if not GPU_AVAILABLE:
batch_sizes = [10, 100, 1000] # Smaller sizes for CPU-only
cpu_times = []
gpu_times = []
for batch_size in batch_sizes:
# Create batch data
x_batch_np = np.random.randn(batch_size, state_dim, 1)
P_batch_np = np.tile(P.reshape(1, state_dim, state_dim), (batch_size, 1, 1))
z_batch_np = np.random.randn(batch_size, meas_dim, 1)
# CPU timing
n_iterations = 100
start = time.time()
for _ in range(n_iterations):
x_pred, P_pred = batch_kalman_predict(x_batch_np, P_batch_np, F, Q, xp=np)
x_upd, P_upd = batch_kalman_update(x_pred, P_pred, z_batch_np, H, R, xp=np)
cpu_time = (time.time() - start) / n_iterations
cpu_times.append(cpu_time)
if GPU_AVAILABLE:
# Transfer to GPU
x_batch_gpu = cp.asarray(x_batch_np)
P_batch_gpu = cp.asarray(P_batch_np)
z_batch_gpu = cp.asarray(z_batch_np)
# Warm up
x_pred, P_pred = batch_kalman_predict(x_batch_gpu, P_batch_gpu, F_gpu, Q_gpu, xp=cp)
x_upd, P_upd = batch_kalman_update(x_pred, P_pred, z_batch_gpu, H_gpu, R_gpu, xp=cp)
cp.cuda.Stream.null.synchronize()
# GPU timing
start = time.time()
for _ in range(n_iterations):
x_pred, P_pred = batch_kalman_predict(x_batch_gpu, P_batch_gpu, F_gpu, Q_gpu, xp=cp)
x_upd, P_upd = batch_kalman_update(x_pred, P_pred, z_batch_gpu, H_gpu, R_gpu, xp=cp)
cp.cuda.Stream.null.synchronize()
gpu_time = (time.time() - start) / n_iterations
gpu_times.append(gpu_time)
speedup = cpu_time / gpu_time
print(f"Batch size {batch_size:5d}: CPU={cpu_time*1e3:7.3f}ms, "
f"GPU={gpu_time*1e3:7.3f}ms, Speedup={speedup:.1f}x")
else:
print(f"Batch size {batch_size:5d}: CPU={cpu_time*1e3:7.3f}ms")
Batch size 10: CPU= 0.043ms
Batch size 100: CPU= 0.146ms
Batch size 1000: CPU= 1.205ms
[24]:
# Visualize performance
if GPU_AVAILABLE and len(gpu_times) > 0:
fig = make_subplots(rows=1, cols=2,
subplot_titles=['Kalman Filter Batch Performance', 'GPU Speedup Factor'])
fig.add_trace(
go.Scatter(x=batch_sizes, y=np.array(cpu_times)*1e3, mode='lines+markers',
name='CPU (NumPy)', line=dict(color='#00d4ff', width=2),
marker=dict(size=8)),
row=1, col=1
)
fig.add_trace(
go.Scatter(x=batch_sizes, y=np.array(gpu_times)*1e3, mode='lines+markers',
name='GPU (CuPy)', line=dict(color='#ff4757', width=2),
marker=dict(size=8, symbol='square')),
row=1, col=1
)
speedups = np.array(cpu_times) / np.array(gpu_times)
fig.add_trace(
go.Scatter(x=batch_sizes, y=speedups, mode='lines+markers',
name='Speedup', line=dict(color='#00ff88', width=2),
marker=dict(size=10, symbol='triangle-up')),
row=1, col=2
)
fig.add_trace(
go.Scatter(x=[batch_sizes[0], batch_sizes[-1]], y=[1, 1], mode='lines',
line=dict(color='white', dash='dash', width=1),
showlegend=False),
row=1, col=2
)
fig.update_xaxes(type='log', title_text='Batch Size (number of tracks)', row=1, col=1)
fig.update_yaxes(type='log', title_text='Time per iteration (ms)', row=1, col=1)
fig.update_xaxes(type='log', title_text='Batch Size (number of tracks)', row=1, col=2)
fig.update_yaxes(title_text='Speedup (CPU time / GPU time)', row=1, col=2)
fig.update_layout(template=dark_template, height=400)
fig.show()
else:
fig = go.Figure()
fig.add_trace(
go.Scatter(x=batch_sizes, y=np.array(cpu_times)*1e3, mode='lines+markers',
name='CPU (NumPy)', line=dict(color='#00d4ff', width=2),
marker=dict(size=8))
)
fig.update_layout(
template=dark_template,
title='Kalman Filter CPU Performance',
xaxis_title='Batch Size',
xaxis_type='log',
yaxis_title='Time (ms)',
yaxis_type='log',
height=400
)
fig.show()
Data type cannot be displayed: application/vnd.plotly.v1+json
4. GPU-Accelerated Particle Filter
Particle filters involve operations on large numbers of particles that parallelize well:
Particle propagation: Apply dynamics to each particle
Weight computation: Evaluate likelihood for each particle
Resampling: Select particles based on weights
[25]:
def particle_filter_step(particles, weights, z, f_dynamics, h_meas, Q_std, R_std, xp=np):
"""
GPU-accelerated particle filter step.
Parameters
----------
particles : array_like
Particle states [n_particles, state_dim].
weights : array_like
Particle weights [n_particles].
z : array_like
Measurement [meas_dim].
f_dynamics : callable
Dynamics function f(x) -> x_next.
h_meas : callable
Measurement function h(x) -> z_pred.
Q_std : float
Process noise standard deviation.
R_std : float
Measurement noise standard deviation.
xp : module
Array module.
Returns
-------
particles : array
Updated particles.
weights : array
Updated weights.
"""
n_particles = particles.shape[0]
state_dim = particles.shape[1]
# Prediction: propagate particles through dynamics + noise
particles_pred = f_dynamics(particles, xp)
particles_pred += Q_std * xp.random.randn(n_particles, state_dim)
# Update: compute likelihood weights
z_pred = h_meas(particles_pred, xp)
innovation = z - z_pred
# Gaussian likelihood (vectorized)
log_likelihood = -0.5 * xp.sum(innovation**2, axis=1) / R_std**2
# Update weights
log_weights = xp.log(weights + 1e-300) + log_likelihood
log_weights = log_weights - xp.max(log_weights) # Numerical stability
weights = xp.exp(log_weights)
weights = weights / xp.sum(weights) # Normalize
# Compute effective sample size
ess = 1.0 / xp.sum(weights**2)
# Resample if ESS is too low
if float(ess) < n_particles / 2:
# Systematic resampling
indices = systematic_resample(weights, xp)
particles_pred = particles_pred[indices]
weights = xp.ones(n_particles) / n_particles
return particles_pred, weights, float(ess)
def systematic_resample(weights, xp=np):
"""
Systematic resampling for particle filters.
Parameters
----------
weights : array_like
Normalized weights [n_particles].
xp : module
Array module.
Returns
-------
indices : array
Resampled indices.
"""
n = len(weights)
# Generate systematic points
positions = (xp.arange(n) + xp.random.rand()) / n
# Cumulative sum of weights
cumsum = xp.cumsum(weights)
# Find indices using searchsorted
indices = xp.searchsorted(cumsum, positions)
indices = xp.clip(indices, 0, n - 1)
return indices
print("Particle filter functions defined.")
Particle filter functions defined.
[26]:
# Define nonlinear dynamics and measurement
def f_nonlinear(x, xp=np):
"""Nonlinear dynamics: x_next = x/2 + 25*x/(1+x^2) + 8*cos(1.2*k)"""
return x / 2 + 25 * x / (1 + x**2 + 1e-10)
def h_nonlinear(x, xp=np):
"""Nonlinear measurement: z = x^2 / 20"""
return x**2 / 20
# Parameters
n_particles_list = [1000, 5000, 10000, 50000]
if not GPU_AVAILABLE:
n_particles_list = [1000, 5000, 10000]
state_dim_pf = 1
Q_std = 3.0
R_std = 1.0
n_steps = 50
n_trials = 5
cpu_pf_times = []
gpu_pf_times = []
print("Benchmarking particle filter...")
for n_particles in n_particles_list:
# Generate true trajectory and measurements
np.random.seed(42)
true_state = [0.0]
measurements = []
for k in range(n_steps):
x_new = f_nonlinear(np.array([[true_state[-1]]]))[0, 0] + Q_std * np.random.randn()
true_state.append(x_new)
z = h_nonlinear(np.array([[x_new]]))[0, 0] + R_std * np.random.randn()
measurements.append(z)
# CPU timing
cpu_times_trial = []
for _ in range(n_trials):
particles = np.random.randn(n_particles, state_dim_pf) * 5
weights = np.ones(n_particles) / n_particles
start = time.time()
for z in measurements:
z_arr = np.array([[z]])
particles, weights, ess = particle_filter_step(
particles, weights, z_arr, f_nonlinear, h_nonlinear, Q_std, R_std, xp=np
)
cpu_times_trial.append(time.time() - start)
cpu_pf_times.append(np.mean(cpu_times_trial))
if GPU_AVAILABLE:
# GPU timing
gpu_times_trial = []
for _ in range(n_trials):
particles_gpu = cp.random.randn(n_particles, state_dim_pf) * 5
weights_gpu = cp.ones(n_particles) / n_particles
# Warm up
z_arr = cp.array([[measurements[0]]])
_, _, _ = particle_filter_step(
particles_gpu, weights_gpu, z_arr, f_nonlinear, h_nonlinear, Q_std, R_std, xp=cp
)
cp.cuda.Stream.null.synchronize()
particles_gpu = cp.random.randn(n_particles, state_dim_pf) * 5
weights_gpu = cp.ones(n_particles) / n_particles
start = time.time()
for z in measurements:
z_arr = cp.array([[z]])
particles_gpu, weights_gpu, ess = particle_filter_step(
particles_gpu, weights_gpu, z_arr, f_nonlinear, h_nonlinear, Q_std, R_std, xp=cp
)
cp.cuda.Stream.null.synchronize()
gpu_times_trial.append(time.time() - start)
gpu_pf_times.append(np.mean(gpu_times_trial))
speedup = cpu_pf_times[-1] / gpu_pf_times[-1]
print(f"N={n_particles:5d}: CPU={cpu_pf_times[-1]*1e3:7.1f}ms, "
f"GPU={gpu_pf_times[-1]*1e3:7.1f}ms, Speedup={speedup:.1f}x")
else:
print(f"N={n_particles:5d}: CPU={cpu_pf_times[-1]*1e3:7.1f}ms")
Benchmarking particle filter...
N= 1000: CPU= 2.7ms
N= 5000: CPU= 9.8ms
N=10000: CPU= 18.5ms
[27]:
# Visualize particle filter performance
if GPU_AVAILABLE and len(gpu_pf_times) > 0:
fig = make_subplots(rows=1, cols=2,
subplot_titles=[f'Particle Filter ({n_steps} steps)', 'GPU Speedup for Particle Filter'])
fig.add_trace(
go.Scatter(x=n_particles_list, y=np.array(cpu_pf_times)*1e3, mode='lines+markers',
name='CPU', line=dict(color='#00d4ff', width=2),
marker=dict(size=8)),
row=1, col=1
)
fig.add_trace(
go.Scatter(x=n_particles_list, y=np.array(gpu_pf_times)*1e3, mode='lines+markers',
name='GPU', line=dict(color='#ff4757', width=2),
marker=dict(size=8, symbol='square')),
row=1, col=1
)
speedups = np.array(cpu_pf_times) / np.array(gpu_pf_times)
fig.add_trace(
go.Scatter(x=n_particles_list, y=speedups, mode='lines+markers',
name='Speedup', line=dict(color='#00ff88', width=2),
marker=dict(size=10, symbol='triangle-up')),
row=1, col=2
)
fig.add_trace(
go.Scatter(x=[n_particles_list[0], n_particles_list[-1]], y=[1, 1], mode='lines',
line=dict(color='white', dash='dash', width=1),
showlegend=False),
row=1, col=2
)
fig.update_xaxes(type='log', title_text='Number of Particles', row=1, col=1)
fig.update_yaxes(type='log', title_text='Time (ms)', row=1, col=1)
fig.update_xaxes(type='log', title_text='Number of Particles', row=1, col=2)
fig.update_yaxes(title_text='Speedup Factor', row=1, col=2)
fig.update_layout(template=dark_template, height=400)
fig.show()
5. Memory Management Best Practices
Efficient GPU memory management is crucial for large-scale tracking applications.
[28]:
if GPU_AVAILABLE:
print("GPU Memory Management Best Practices")
print("=" * 50)
# 1. Memory pools
print("\n1. Memory Pools")
mempool = cp.get_default_memory_pool()
pinned_mempool = cp.get_default_pinned_memory_pool()
print(f" GPU memory pool used: {mempool.used_bytes() / 1e6:.2f} MB")
print(f" Pinned memory pool used: {pinned_mempool.n_free_blocks()} free blocks")
# 2. Pre-allocation
print("\n2. Pre-allocation for Repeated Operations")
n = 1000
# Bad: Allocate new arrays each iteration
start = time.time()
for _ in range(1000):
temp = cp.zeros((n, n))
result = cp.dot(temp, temp)
cp.cuda.Stream.null.synchronize()
time_alloc = time.time() - start
# Good: Reuse pre-allocated arrays
temp = cp.zeros((n, n))
result = cp.zeros((n, n))
start = time.time()
for _ in range(1000):
cp.dot(temp, temp, out=result)
cp.cuda.Stream.null.synchronize()
time_reuse = time.time() - start
print(f" With allocation each time: {time_alloc*1e3:.1f} ms")
print(f" With pre-allocated arrays: {time_reuse*1e3:.1f} ms")
print(f" Speedup: {time_alloc/time_reuse:.1f}x")
# 3. Async transfers
print("\n3. Asynchronous Data Transfers")
data_cpu = np.random.randn(10000, 1000)
# Synchronous
start = time.time()
data_gpu = cp.asarray(data_cpu)
cp.cuda.Stream.null.synchronize()
time_sync = time.time() - start
# With pinned memory (faster transfers)
data_pinned = cp.cuda.alloc_pinned_memory(data_cpu.nbytes)
data_view = np.frombuffer(data_pinned, dtype=data_cpu.dtype).reshape(data_cpu.shape)
np.copyto(data_view, data_cpu)
start = time.time()
data_gpu2 = cp.asarray(data_view)
cp.cuda.Stream.null.synchronize()
time_pinned = time.time() - start
print(f" Standard transfer: {time_sync*1e3:.2f} ms")
print(f" Pinned memory transfer: {time_pinned*1e3:.2f} ms")
# Clean up
del data_gpu, data_gpu2, temp, result
mempool.free_all_blocks()
6. Integration with pyTCL
Here’s how to integrate GPU acceleration with existing pyTCL code.
[29]:
# Import from pyTCL
from pytcl.dynamic_estimation.kalman import kf_predict, kf_update
def gpu_accelerated_tracking(measurements, F, H, Q, R, x0, P0, use_gpu=True):
"""
Run Kalman filter tracking with optional GPU acceleration.
Parameters
----------
measurements : array_like
List of measurements [n_timesteps, meas_dim].
F, H, Q, R : array_like
Kalman filter matrices.
x0, P0 : array_like
Initial state and covariance.
use_gpu : bool
Whether to use GPU acceleration.
Returns
-------
estimates : array
State estimates [n_timesteps, state_dim].
covariances : array
Covariance estimates [n_timesteps, state_dim, state_dim].
"""
if use_gpu and GPU_AVAILABLE:
xp = cp
F = cp.asarray(F)
H = cp.asarray(H)
Q = cp.asarray(Q)
R = cp.asarray(R)
x = cp.asarray(x0)
P = cp.asarray(P0)
measurements = cp.asarray(measurements)
else:
xp = np
x = x0.copy()
P = P0.copy()
estimates = []
covariances = []
for z in measurements:
# Predict
x, P = kalman_predict(x, P, F, Q, xp)
# Update
z = z.reshape(-1, 1)
x, P = kalman_update(x, P, z, H, R, xp)
estimates.append(x.flatten())
covariances.append(P.copy())
if use_gpu and GPU_AVAILABLE:
estimates = cp.asnumpy(cp.array(estimates))
covariances = cp.asnumpy(cp.array(covariances))
else:
estimates = np.array([xp.asnumpy(e) if hasattr(xp, 'asnumpy') else e for e in estimates])
covariances = np.array([xp.asnumpy(c) if hasattr(xp, 'asnumpy') else c for c in covariances])
return np.array(estimates), np.array(covariances)
print("GPU-accelerated tracking function defined.")
GPU-accelerated tracking function defined.
[30]:
# Run tracking example
# Generate trajectory
n_timesteps = 100
true_state = []
measurements_list = []
x_true = np.array([0, 1, 0, 0.5]) # Start at origin, moving NE
for t in range(n_timesteps):
true_state.append(x_true.copy())
# Propagate
x_true = F @ x_true + np.random.multivariate_normal(np.zeros(4), Q * 0.1)
# Generate noisy measurement
z = H @ x_true + np.random.multivariate_normal(np.zeros(2), R)
measurements_list.append(z)
true_state = np.array(true_state)
measurements_arr = np.array(measurements_list)
# Run tracking
x0 = np.array([0, 0, 0, 0]).reshape(-1, 1)
P0 = 100 * np.eye(4)
estimates_cpu, _ = gpu_accelerated_tracking(
measurements_arr, F, H, Q, R, x0, P0, use_gpu=False
)
if GPU_AVAILABLE:
estimates_gpu, _ = gpu_accelerated_tracking(
measurements_arr, F, H, Q, R, x0, P0, use_gpu=True
)
print(f"Tracking complete: {n_timesteps} time steps")
Tracking complete: 100 time steps
[31]:
# Visualize results
fig = make_subplots(rows=1, cols=2, subplot_titles=['Tracking Results', 'Tracking Error'])
# Trajectory plot
fig.add_trace(
go.Scatter(x=true_state[:, 0], y=true_state[:, 2], mode='lines',
name='True', line=dict(color='#00ff88', width=2)),
row=1, col=1
)
fig.add_trace(
go.Scatter(x=measurements_arr[:, 0], y=measurements_arr[:, 1], mode='markers',
name='Measurements', marker=dict(color='gray', size=4, opacity=0.5)),
row=1, col=1
)
fig.add_trace(
go.Scatter(x=estimates_cpu[:, 0], y=estimates_cpu[:, 2], mode='lines',
name='CPU Estimate', line=dict(color='#00d4ff', width=1.5, dash='dash')),
row=1, col=1
)
if GPU_AVAILABLE:
fig.add_trace(
go.Scatter(x=estimates_gpu[:, 0], y=estimates_gpu[:, 2], mode='lines',
name='GPU Estimate', line=dict(color='#ff4757', width=1.5, dash='dot')),
row=1, col=1
)
# Error plot
error_cpu = np.sqrt((estimates_cpu[:, 0] - true_state[:, 0])**2 +
(estimates_cpu[:, 2] - true_state[:, 2])**2)
fig.add_trace(
go.Scatter(y=error_cpu, mode='lines', name='CPU Error',
line=dict(color='#00d4ff', width=1.5)),
row=1, col=2
)
if GPU_AVAILABLE:
error_gpu = np.sqrt((estimates_gpu[:, 0] - true_state[:, 0])**2 +
(estimates_gpu[:, 2] - true_state[:, 2])**2)
fig.add_trace(
go.Scatter(y=error_gpu, mode='lines', name='GPU Error',
line=dict(color='#ff4757', width=1.5, dash='dash')),
row=1, col=2
)
fig.update_xaxes(title_text='X Position', row=1, col=1)
fig.update_yaxes(title_text='Y Position', row=1, col=1, scaleanchor='x', scaleratio=1)
fig.update_xaxes(title_text='Time Step', row=1, col=2)
fig.update_yaxes(title_text='Position Error', row=1, col=2)
fig.update_layout(template=dark_template, height=450)
fig.show()
print(f"CPU RMSE: {np.sqrt(np.mean(error_cpu**2)):.4f}")
if GPU_AVAILABLE:
print(f"GPU RMSE: {np.sqrt(np.mean(error_gpu**2)):.4f}")
print(f"Max difference: {np.max(np.abs(estimates_cpu - estimates_gpu)):.2e}")
Data type cannot be displayed: application/vnd.plotly.v1+json
CPU RMSE: 2.9987
Summary
Key takeaways for GPU acceleration:
CuPy provides NumPy compatibility - Easy migration with minimal code changes
Batch processing is key - GPU shines when processing many tracks/particles simultaneously
Memory management matters - Pre-allocate, use pinned memory, free unused memory
Transfer overhead - Minimize CPU↔GPU transfers for best performance
Problem size determines speedup - Small problems may not benefit from GPU
Performance Guidelines
Scenario |
Recommendation |
|---|---|
< 100 tracks |
CPU (transfer overhead dominates) |
100-1000 tracks |
GPU beneficial for complex filters |
> 1000 tracks |
GPU strongly recommended |
< 1000 particles |
CPU usually faster |
> 10000 particles |
GPU provides significant speedup |
Exercises
Implement GPU-accelerated UKF with sigma point generation on GPU
Add GPU memory pooling for repeated filter operations
Benchmark different matrix sizes to find the CPU/GPU crossover point
Implement GPU-accelerated JPDA likelihood computation
References
CuPy Documentation: https://docs.cupy.dev/
CUDA Programming Guide: https://docs.nvidia.com/cuda/
Gustafsson, F. (2010). Particle Methods for Tracking.