GPU Acceleration: CuPy Integration for Tracking

This notebook demonstrates GPU acceleration techniques for tracking algorithms using CuPy. We cover:

  1. CuPy Basics - GPU array operations and NumPy compatibility

  2. Accelerating Matrix Operations - Core operations for Kalman filters

  3. Batch Processing - Processing multiple tracks in parallel

  4. Particle Filter Acceleration - GPU-based resampling and weight computation

  5. Performance Comparison - Benchmarking CPU vs GPU

Prerequisites

# GPU-enabled version
pip install cupy-cuda12x  # For CUDA 12.x
# or pip install cupy-cuda11x for CUDA 11.x

pip install nrl-tracker plotly numpy

Note: This notebook requires an NVIDIA GPU with CUDA support. If running without a GPU, code examples will demonstrate concepts with NumPy fallbacks.

[16]:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import time

# Check for GPU availability
try:
    import cupy as cp
    GPU_AVAILABLE = True
    print(f"CuPy version: {cp.__version__}")
    print(f"CUDA version: {cp.cuda.runtime.runtimeGetVersion()}")
    print(f"GPU: {cp.cuda.Device().name}")
    print(f"GPU memory: {cp.cuda.Device().mem_info[1] / 1e9:.1f} GB")
except ImportError:
    GPU_AVAILABLE = False
    print("CuPy not available. Running in CPU-only mode.")
    print("Install with: pip install cupy-cuda12x")

# Use appropriate array module
xp = cp if GPU_AVAILABLE else np

np.random.seed(42)

# Plotly dark theme template
dark_template = go.layout.Template()
dark_template.layout = go.Layout(
    paper_bgcolor='#0d1117',
    plot_bgcolor='#0d1117',
    font=dict(color='#e6edf3'),
    xaxis=dict(gridcolor='#30363d', zerolinecolor='#30363d'),
    yaxis=dict(gridcolor='#30363d', zerolinecolor='#30363d'),
)
CuPy not available. Running in CPU-only mode.
Install with: pip install cupy-cuda12x

1. CuPy Basics

CuPy provides a NumPy-compatible interface for GPU computing. Arrays and operations mirror NumPy exactly, making migration straightforward.

Key Concepts

NumPy

CuPy

Location

numpy.array()

cupy.array()

GPU

numpy.zeros()

cupy.zeros()

GPU

numpy.linalg.inv()

cupy.linalg.inv()

GPU

Transfer to GPU

cupy.asarray(np_array)

CPU→GPU

Transfer to CPU

cupy.asnumpy(cp_array)

GPU→CPU

[17]:
# Basic array operations
if GPU_AVAILABLE:
    # Create arrays on GPU
    a_gpu = cp.array([1, 2, 3, 4, 5])
    b_gpu = cp.array([5, 4, 3, 2, 1])

    # Operations happen on GPU
    c_gpu = a_gpu + b_gpu
    d_gpu = cp.dot(a_gpu, b_gpu)

    print(f"a + b = {c_gpu}")
    print(f"a · b = {d_gpu}")
    print(f"Array type: {type(c_gpu)}")

    # Transfer back to CPU when needed
    c_cpu = cp.asnumpy(c_gpu)
    print(f"Transferred type: {type(c_cpu)}")
else:
    # NumPy fallback
    a_cpu = np.array([1, 2, 3, 4, 5])
    b_cpu = np.array([5, 4, 3, 2, 1])
    print(f"a + b = {a_cpu + b_cpu}")
    print(f"a · b = {np.dot(a_cpu, b_cpu)}")
a + b = [6 6 6 6 6]
a · b = 35
[18]:
# Memory management
if GPU_AVAILABLE:
    # Check memory before
    mempool = cp.get_default_memory_pool()
    print(f"Memory used before: {mempool.used_bytes() / 1e6:.2f} MB")

    # Create large array
    large_array = cp.random.randn(10000, 10000)
    print(f"Memory after 10000x10000 float64: {mempool.used_bytes() / 1e6:.2f} MB")

    # Free memory explicitly
    del large_array
    mempool.free_all_blocks()
    print(f"Memory after free: {mempool.used_bytes() / 1e6:.2f} MB")

2. Accelerating Matrix Operations

Kalman filters involve repeated matrix operations that benefit from GPU parallelization:

  • Matrix multiplication: \(P \cdot H^T\)

  • Matrix inversion: \((HPH^T + R)^{-1}\)

  • Cholesky decomposition: \(P = LL^T\)

Let’s implement core Kalman filter operations for both CPU and GPU.

[19]:
def kalman_predict(x, P, F, Q, xp=np):
    """
    Kalman filter prediction step.

    Parameters
    ----------
    x : array_like
        State estimate [n, 1].
    P : array_like
        State covariance [n, n].
    F : array_like
        State transition matrix [n, n].
    Q : array_like
        Process noise covariance [n, n].
    xp : module
        Array module (numpy or cupy).

    Returns
    -------
    x_pred : array
        Predicted state.
    P_pred : array
        Predicted covariance.
    """
    x_pred = xp.dot(F, x)
    P_pred = xp.dot(xp.dot(F, P), F.T) + Q
    return x_pred, P_pred


def kalman_update(x, P, z, H, R, xp=np):
    """
    Kalman filter update step.

    Parameters
    ----------
    x : array_like
        Predicted state [n, 1].
    P : array_like
        Predicted covariance [n, n].
    z : array_like
        Measurement [m, 1].
    H : array_like
        Measurement matrix [m, n].
    R : array_like
        Measurement noise covariance [m, m].
    xp : module
        Array module (numpy or cupy).

    Returns
    -------
    x_upd : array
        Updated state.
    P_upd : array
        Updated covariance.
    """
    # Innovation
    y = z - xp.dot(H, x)

    # Innovation covariance
    S = xp.dot(xp.dot(H, P), H.T) + R

    # Kalman gain
    K = xp.dot(xp.dot(P, H.T), xp.linalg.inv(S))

    # State update
    x_upd = x + xp.dot(K, y)

    # Covariance update (Joseph form for numerical stability)
    I = xp.eye(P.shape[0])
    IKH = I - xp.dot(K, H)
    P_upd = xp.dot(xp.dot(IKH, P), IKH.T) + xp.dot(xp.dot(K, R), K.T)

    return x_upd, P_upd


print("Kalman filter functions defined.")
Kalman filter functions defined.
[20]:
# Test Kalman filter on CPU and GPU
state_dim = 4  # [x, vx, y, vy]
meas_dim = 2   # [x, y]

dt = 1.0

# State transition (constant velocity)
F = np.array([
    [1, dt, 0, 0],
    [0, 1, 0, 0],
    [0, 0, 1, dt],
    [0, 0, 0, 1]
])

# Measurement matrix (observe position only)
H = np.array([
    [1, 0, 0, 0],
    [0, 0, 1, 0]
])

# Process noise
q = 0.1
Q = q * np.array([
    [dt**3/3, dt**2/2, 0, 0],
    [dt**2/2, dt, 0, 0],
    [0, 0, dt**3/3, dt**2/2],
    [0, 0, dt**2/2, dt]
])

# Measurement noise
R = 10 * np.eye(meas_dim)

# Initial state and covariance
x = np.array([0, 1, 0, 1]).reshape(-1, 1)
P = 100 * np.eye(state_dim)

# Measurement
z = np.array([5, 5]).reshape(-1, 1)

print("System matrices defined.")
print(f"State dimension: {state_dim}")
print(f"Measurement dimension: {meas_dim}")
System matrices defined.
State dimension: 4
Measurement dimension: 2
[21]:
# Run on CPU
x_pred_cpu, P_pred_cpu = kalman_predict(x, P, F, Q, xp=np)
x_upd_cpu, P_upd_cpu = kalman_update(x_pred_cpu, P_pred_cpu, z, H, R, xp=np)

print("CPU Results:")
print(f"Predicted state: {x_pred_cpu.flatten()}")
print(f"Updated state: {x_upd_cpu.flatten()}")

if GPU_AVAILABLE:
    # Transfer to GPU
    x_gpu = cp.asarray(x)
    P_gpu = cp.asarray(P)
    F_gpu = cp.asarray(F)
    Q_gpu = cp.asarray(Q)
    H_gpu = cp.asarray(H)
    R_gpu = cp.asarray(R)
    z_gpu = cp.asarray(z)

    # Run on GPU
    x_pred_gpu, P_pred_gpu = kalman_predict(x_gpu, P_gpu, F_gpu, Q_gpu, xp=cp)
    x_upd_gpu, P_upd_gpu = kalman_update(x_pred_gpu, P_pred_gpu, z_gpu, H_gpu, R_gpu, xp=cp)

    print("\nGPU Results:")
    print(f"Predicted state: {cp.asnumpy(x_pred_gpu).flatten()}")
    print(f"Updated state: {cp.asnumpy(x_upd_gpu).flatten()}")

    # Verify results match
    error = np.max(np.abs(cp.asnumpy(x_upd_gpu) - x_upd_cpu))
    print(f"\nMax difference CPU vs GPU: {error:.2e}")
CPU Results:
Predicted state: [1. 1. 1. 1.]
Updated state: [4.80955404 2.90541184 4.80955404 2.90541184]

3. Batch Processing Multiple Tracks

The real power of GPU acceleration comes from processing many tracks simultaneously. Instead of looping over tracks, we can stack them into batched arrays.

[22]:
def batch_kalman_predict(x_batch, P_batch, F, Q, xp=np):
    """
    Batch Kalman filter prediction.

    Parameters
    ----------
    x_batch : array_like
        Batch of states [batch_size, n, 1].
    P_batch : array_like
        Batch of covariances [batch_size, n, n].
    F : array_like
        State transition matrix [n, n] (shared).
    Q : array_like
        Process noise [n, n] (shared).
    xp : module
        Array module.

    Returns
    -------
    x_pred : array
        Predicted states [batch_size, n, 1].
    P_pred : array
        Predicted covariances [batch_size, n, n].
    """
    batch_size = x_batch.shape[0]

    # Vectorized prediction
    # x_pred = F @ x for each batch element
    x_pred = xp.einsum('ij,bjk->bik', F, x_batch)

    # P_pred = F @ P @ F.T + Q for each batch element
    FP = xp.einsum('ij,bjk->bik', F, P_batch)
    FPFt = xp.einsum('bij,kj->bik', FP, F)
    P_pred = FPFt + Q

    return x_pred, P_pred


def batch_kalman_update(x_batch, P_batch, z_batch, H, R, xp=np):
    """
    Batch Kalman filter update.

    Parameters
    ----------
    x_batch : array_like
        Batch of predicted states [batch_size, n, 1].
    P_batch : array_like
        Batch of predicted covariances [batch_size, n, n].
    z_batch : array_like
        Batch of measurements [batch_size, m, 1].
    H : array_like
        Measurement matrix [m, n] (shared).
    R : array_like
        Measurement noise [m, m] (shared).
    xp : module
        Array module.

    Returns
    -------
    x_upd : array
        Updated states [batch_size, n, 1].
    P_upd : array
        Updated covariances [batch_size, n, n].
    """
    batch_size = x_batch.shape[0]
    n = x_batch.shape[1]

    # Innovation: y = z - H @ x
    Hx = xp.einsum('ij,bjk->bik', H, x_batch)
    y = z_batch - Hx

    # Innovation covariance: S = H @ P @ H.T + R
    HP = xp.einsum('ij,bjk->bik', H, P_batch)
    HPHt = xp.einsum('bij,kj->bik', HP, H)
    S = HPHt + R

    # Kalman gain: K = P @ H.T @ S^{-1}
    PHt = xp.einsum('bij,kj->bik', P_batch, H)
    S_inv = xp.linalg.inv(S)
    K = xp.einsum('bij,bjk->bik', PHt, S_inv)

    # State update: x_upd = x + K @ y
    Ky = xp.einsum('bij,bjk->bik', K, y)
    x_upd = x_batch + Ky

    # Covariance update (simplified form)
    I = xp.eye(n)
    KH = xp.einsum('bij,jk->bik', K, H)
    IKH = I - KH
    P_upd = xp.einsum('bij,bjk->bik', IKH, P_batch)

    return x_upd, P_upd


print("Batch Kalman functions defined.")
Batch Kalman functions defined.
[23]:
# Benchmark batch processing
batch_sizes = [10, 100, 1000, 5000]

if not GPU_AVAILABLE:
    batch_sizes = [10, 100, 1000]  # Smaller sizes for CPU-only

cpu_times = []
gpu_times = []

for batch_size in batch_sizes:
    # Create batch data
    x_batch_np = np.random.randn(batch_size, state_dim, 1)
    P_batch_np = np.tile(P.reshape(1, state_dim, state_dim), (batch_size, 1, 1))
    z_batch_np = np.random.randn(batch_size, meas_dim, 1)

    # CPU timing
    n_iterations = 100
    start = time.time()
    for _ in range(n_iterations):
        x_pred, P_pred = batch_kalman_predict(x_batch_np, P_batch_np, F, Q, xp=np)
        x_upd, P_upd = batch_kalman_update(x_pred, P_pred, z_batch_np, H, R, xp=np)
    cpu_time = (time.time() - start) / n_iterations
    cpu_times.append(cpu_time)

    if GPU_AVAILABLE:
        # Transfer to GPU
        x_batch_gpu = cp.asarray(x_batch_np)
        P_batch_gpu = cp.asarray(P_batch_np)
        z_batch_gpu = cp.asarray(z_batch_np)

        # Warm up
        x_pred, P_pred = batch_kalman_predict(x_batch_gpu, P_batch_gpu, F_gpu, Q_gpu, xp=cp)
        x_upd, P_upd = batch_kalman_update(x_pred, P_pred, z_batch_gpu, H_gpu, R_gpu, xp=cp)
        cp.cuda.Stream.null.synchronize()

        # GPU timing
        start = time.time()
        for _ in range(n_iterations):
            x_pred, P_pred = batch_kalman_predict(x_batch_gpu, P_batch_gpu, F_gpu, Q_gpu, xp=cp)
            x_upd, P_upd = batch_kalman_update(x_pred, P_pred, z_batch_gpu, H_gpu, R_gpu, xp=cp)
        cp.cuda.Stream.null.synchronize()
        gpu_time = (time.time() - start) / n_iterations
        gpu_times.append(gpu_time)

        speedup = cpu_time / gpu_time
        print(f"Batch size {batch_size:5d}: CPU={cpu_time*1e3:7.3f}ms, "
              f"GPU={gpu_time*1e3:7.3f}ms, Speedup={speedup:.1f}x")
    else:
        print(f"Batch size {batch_size:5d}: CPU={cpu_time*1e3:7.3f}ms")
Batch size    10: CPU=  0.043ms
Batch size   100: CPU=  0.146ms
Batch size  1000: CPU=  1.205ms
[24]:
# Visualize performance
if GPU_AVAILABLE and len(gpu_times) > 0:
    fig = make_subplots(rows=1, cols=2,
                        subplot_titles=['Kalman Filter Batch Performance', 'GPU Speedup Factor'])

    fig.add_trace(
        go.Scatter(x=batch_sizes, y=np.array(cpu_times)*1e3, mode='lines+markers',
                   name='CPU (NumPy)', line=dict(color='#00d4ff', width=2),
                   marker=dict(size=8)),
        row=1, col=1
    )
    fig.add_trace(
        go.Scatter(x=batch_sizes, y=np.array(gpu_times)*1e3, mode='lines+markers',
                   name='GPU (CuPy)', line=dict(color='#ff4757', width=2),
                   marker=dict(size=8, symbol='square')),
        row=1, col=1
    )

    speedups = np.array(cpu_times) / np.array(gpu_times)
    fig.add_trace(
        go.Scatter(x=batch_sizes, y=speedups, mode='lines+markers',
                   name='Speedup', line=dict(color='#00ff88', width=2),
                   marker=dict(size=10, symbol='triangle-up')),
        row=1, col=2
    )
    fig.add_trace(
        go.Scatter(x=[batch_sizes[0], batch_sizes[-1]], y=[1, 1], mode='lines',
                   line=dict(color='white', dash='dash', width=1),
                   showlegend=False),
        row=1, col=2
    )

    fig.update_xaxes(type='log', title_text='Batch Size (number of tracks)', row=1, col=1)
    fig.update_yaxes(type='log', title_text='Time per iteration (ms)', row=1, col=1)
    fig.update_xaxes(type='log', title_text='Batch Size (number of tracks)', row=1, col=2)
    fig.update_yaxes(title_text='Speedup (CPU time / GPU time)', row=1, col=2)

    fig.update_layout(template=dark_template, height=400)
    fig.show()
else:
    fig = go.Figure()
    fig.add_trace(
        go.Scatter(x=batch_sizes, y=np.array(cpu_times)*1e3, mode='lines+markers',
                   name='CPU (NumPy)', line=dict(color='#00d4ff', width=2),
                   marker=dict(size=8))
    )
    fig.update_layout(
        template=dark_template,
        title='Kalman Filter CPU Performance',
        xaxis_title='Batch Size',
        xaxis_type='log',
        yaxis_title='Time (ms)',
        yaxis_type='log',
        height=400
    )
    fig.show()

Data type cannot be displayed: application/vnd.plotly.v1+json

4. GPU-Accelerated Particle Filter

Particle filters involve operations on large numbers of particles that parallelize well:

  1. Particle propagation: Apply dynamics to each particle

  2. Weight computation: Evaluate likelihood for each particle

  3. Resampling: Select particles based on weights

[25]:
def particle_filter_step(particles, weights, z, f_dynamics, h_meas, Q_std, R_std, xp=np):
    """
    GPU-accelerated particle filter step.

    Parameters
    ----------
    particles : array_like
        Particle states [n_particles, state_dim].
    weights : array_like
        Particle weights [n_particles].
    z : array_like
        Measurement [meas_dim].
    f_dynamics : callable
        Dynamics function f(x) -> x_next.
    h_meas : callable
        Measurement function h(x) -> z_pred.
    Q_std : float
        Process noise standard deviation.
    R_std : float
        Measurement noise standard deviation.
    xp : module
        Array module.

    Returns
    -------
    particles : array
        Updated particles.
    weights : array
        Updated weights.
    """
    n_particles = particles.shape[0]
    state_dim = particles.shape[1]

    # Prediction: propagate particles through dynamics + noise
    particles_pred = f_dynamics(particles, xp)
    particles_pred += Q_std * xp.random.randn(n_particles, state_dim)

    # Update: compute likelihood weights
    z_pred = h_meas(particles_pred, xp)
    innovation = z - z_pred

    # Gaussian likelihood (vectorized)
    log_likelihood = -0.5 * xp.sum(innovation**2, axis=1) / R_std**2

    # Update weights
    log_weights = xp.log(weights + 1e-300) + log_likelihood
    log_weights = log_weights - xp.max(log_weights)  # Numerical stability
    weights = xp.exp(log_weights)
    weights = weights / xp.sum(weights)  # Normalize

    # Compute effective sample size
    ess = 1.0 / xp.sum(weights**2)

    # Resample if ESS is too low
    if float(ess) < n_particles / 2:
        # Systematic resampling
        indices = systematic_resample(weights, xp)
        particles_pred = particles_pred[indices]
        weights = xp.ones(n_particles) / n_particles

    return particles_pred, weights, float(ess)


def systematic_resample(weights, xp=np):
    """
    Systematic resampling for particle filters.

    Parameters
    ----------
    weights : array_like
        Normalized weights [n_particles].
    xp : module
        Array module.

    Returns
    -------
    indices : array
        Resampled indices.
    """
    n = len(weights)

    # Generate systematic points
    positions = (xp.arange(n) + xp.random.rand()) / n

    # Cumulative sum of weights
    cumsum = xp.cumsum(weights)

    # Find indices using searchsorted
    indices = xp.searchsorted(cumsum, positions)
    indices = xp.clip(indices, 0, n - 1)

    return indices


print("Particle filter functions defined.")
Particle filter functions defined.
[26]:
# Define nonlinear dynamics and measurement
def f_nonlinear(x, xp=np):
    """Nonlinear dynamics: x_next = x/2 + 25*x/(1+x^2) + 8*cos(1.2*k)"""
    return x / 2 + 25 * x / (1 + x**2 + 1e-10)

def h_nonlinear(x, xp=np):
    """Nonlinear measurement: z = x^2 / 20"""
    return x**2 / 20

# Parameters
n_particles_list = [1000, 5000, 10000, 50000]
if not GPU_AVAILABLE:
    n_particles_list = [1000, 5000, 10000]

state_dim_pf = 1
Q_std = 3.0
R_std = 1.0

n_steps = 50
n_trials = 5

cpu_pf_times = []
gpu_pf_times = []

print("Benchmarking particle filter...")

for n_particles in n_particles_list:
    # Generate true trajectory and measurements
    np.random.seed(42)
    true_state = [0.0]
    measurements = []
    for k in range(n_steps):
        x_new = f_nonlinear(np.array([[true_state[-1]]]))[0, 0] + Q_std * np.random.randn()
        true_state.append(x_new)
        z = h_nonlinear(np.array([[x_new]]))[0, 0] + R_std * np.random.randn()
        measurements.append(z)

    # CPU timing
    cpu_times_trial = []
    for _ in range(n_trials):
        particles = np.random.randn(n_particles, state_dim_pf) * 5
        weights = np.ones(n_particles) / n_particles

        start = time.time()
        for z in measurements:
            z_arr = np.array([[z]])
            particles, weights, ess = particle_filter_step(
                particles, weights, z_arr, f_nonlinear, h_nonlinear, Q_std, R_std, xp=np
            )
        cpu_times_trial.append(time.time() - start)
    cpu_pf_times.append(np.mean(cpu_times_trial))

    if GPU_AVAILABLE:
        # GPU timing
        gpu_times_trial = []
        for _ in range(n_trials):
            particles_gpu = cp.random.randn(n_particles, state_dim_pf) * 5
            weights_gpu = cp.ones(n_particles) / n_particles

            # Warm up
            z_arr = cp.array([[measurements[0]]])
            _, _, _ = particle_filter_step(
                particles_gpu, weights_gpu, z_arr, f_nonlinear, h_nonlinear, Q_std, R_std, xp=cp
            )
            cp.cuda.Stream.null.synchronize()

            particles_gpu = cp.random.randn(n_particles, state_dim_pf) * 5
            weights_gpu = cp.ones(n_particles) / n_particles

            start = time.time()
            for z in measurements:
                z_arr = cp.array([[z]])
                particles_gpu, weights_gpu, ess = particle_filter_step(
                    particles_gpu, weights_gpu, z_arr, f_nonlinear, h_nonlinear, Q_std, R_std, xp=cp
                )
            cp.cuda.Stream.null.synchronize()
            gpu_times_trial.append(time.time() - start)
        gpu_pf_times.append(np.mean(gpu_times_trial))

        speedup = cpu_pf_times[-1] / gpu_pf_times[-1]
        print(f"N={n_particles:5d}: CPU={cpu_pf_times[-1]*1e3:7.1f}ms, "
              f"GPU={gpu_pf_times[-1]*1e3:7.1f}ms, Speedup={speedup:.1f}x")
    else:
        print(f"N={n_particles:5d}: CPU={cpu_pf_times[-1]*1e3:7.1f}ms")
Benchmarking particle filter...
N= 1000: CPU=    2.7ms
N= 5000: CPU=    9.8ms
N=10000: CPU=   18.5ms
[27]:
# Visualize particle filter performance
if GPU_AVAILABLE and len(gpu_pf_times) > 0:
    fig = make_subplots(rows=1, cols=2,
                        subplot_titles=[f'Particle Filter ({n_steps} steps)', 'GPU Speedup for Particle Filter'])

    fig.add_trace(
        go.Scatter(x=n_particles_list, y=np.array(cpu_pf_times)*1e3, mode='lines+markers',
                   name='CPU', line=dict(color='#00d4ff', width=2),
                   marker=dict(size=8)),
        row=1, col=1
    )
    fig.add_trace(
        go.Scatter(x=n_particles_list, y=np.array(gpu_pf_times)*1e3, mode='lines+markers',
                   name='GPU', line=dict(color='#ff4757', width=2),
                   marker=dict(size=8, symbol='square')),
        row=1, col=1
    )

    speedups = np.array(cpu_pf_times) / np.array(gpu_pf_times)
    fig.add_trace(
        go.Scatter(x=n_particles_list, y=speedups, mode='lines+markers',
                   name='Speedup', line=dict(color='#00ff88', width=2),
                   marker=dict(size=10, symbol='triangle-up')),
        row=1, col=2
    )
    fig.add_trace(
        go.Scatter(x=[n_particles_list[0], n_particles_list[-1]], y=[1, 1], mode='lines',
                   line=dict(color='white', dash='dash', width=1),
                   showlegend=False),
        row=1, col=2
    )

    fig.update_xaxes(type='log', title_text='Number of Particles', row=1, col=1)
    fig.update_yaxes(type='log', title_text='Time (ms)', row=1, col=1)
    fig.update_xaxes(type='log', title_text='Number of Particles', row=1, col=2)
    fig.update_yaxes(title_text='Speedup Factor', row=1, col=2)

    fig.update_layout(template=dark_template, height=400)
    fig.show()

5. Memory Management Best Practices

Efficient GPU memory management is crucial for large-scale tracking applications.

[28]:
if GPU_AVAILABLE:
    print("GPU Memory Management Best Practices")
    print("=" * 50)

    # 1. Memory pools
    print("\n1. Memory Pools")
    mempool = cp.get_default_memory_pool()
    pinned_mempool = cp.get_default_pinned_memory_pool()

    print(f"   GPU memory pool used: {mempool.used_bytes() / 1e6:.2f} MB")
    print(f"   Pinned memory pool used: {pinned_mempool.n_free_blocks()} free blocks")

    # 2. Pre-allocation
    print("\n2. Pre-allocation for Repeated Operations")

    n = 1000

    # Bad: Allocate new arrays each iteration
    start = time.time()
    for _ in range(1000):
        temp = cp.zeros((n, n))
        result = cp.dot(temp, temp)
    cp.cuda.Stream.null.synchronize()
    time_alloc = time.time() - start

    # Good: Reuse pre-allocated arrays
    temp = cp.zeros((n, n))
    result = cp.zeros((n, n))
    start = time.time()
    for _ in range(1000):
        cp.dot(temp, temp, out=result)
    cp.cuda.Stream.null.synchronize()
    time_reuse = time.time() - start

    print(f"   With allocation each time: {time_alloc*1e3:.1f} ms")
    print(f"   With pre-allocated arrays: {time_reuse*1e3:.1f} ms")
    print(f"   Speedup: {time_alloc/time_reuse:.1f}x")

    # 3. Async transfers
    print("\n3. Asynchronous Data Transfers")

    data_cpu = np.random.randn(10000, 1000)

    # Synchronous
    start = time.time()
    data_gpu = cp.asarray(data_cpu)
    cp.cuda.Stream.null.synchronize()
    time_sync = time.time() - start

    # With pinned memory (faster transfers)
    data_pinned = cp.cuda.alloc_pinned_memory(data_cpu.nbytes)
    data_view = np.frombuffer(data_pinned, dtype=data_cpu.dtype).reshape(data_cpu.shape)
    np.copyto(data_view, data_cpu)

    start = time.time()
    data_gpu2 = cp.asarray(data_view)
    cp.cuda.Stream.null.synchronize()
    time_pinned = time.time() - start

    print(f"   Standard transfer: {time_sync*1e3:.2f} ms")
    print(f"   Pinned memory transfer: {time_pinned*1e3:.2f} ms")

    # Clean up
    del data_gpu, data_gpu2, temp, result
    mempool.free_all_blocks()

6. Integration with pyTCL

Here’s how to integrate GPU acceleration with existing pyTCL code.

[29]:
# Import from pyTCL
from pytcl.dynamic_estimation.kalman import kf_predict, kf_update

def gpu_accelerated_tracking(measurements, F, H, Q, R, x0, P0, use_gpu=True):
    """
    Run Kalman filter tracking with optional GPU acceleration.

    Parameters
    ----------
    measurements : array_like
        List of measurements [n_timesteps, meas_dim].
    F, H, Q, R : array_like
        Kalman filter matrices.
    x0, P0 : array_like
        Initial state and covariance.
    use_gpu : bool
        Whether to use GPU acceleration.

    Returns
    -------
    estimates : array
        State estimates [n_timesteps, state_dim].
    covariances : array
        Covariance estimates [n_timesteps, state_dim, state_dim].
    """
    if use_gpu and GPU_AVAILABLE:
        xp = cp
        F = cp.asarray(F)
        H = cp.asarray(H)
        Q = cp.asarray(Q)
        R = cp.asarray(R)
        x = cp.asarray(x0)
        P = cp.asarray(P0)
        measurements = cp.asarray(measurements)
    else:
        xp = np
        x = x0.copy()
        P = P0.copy()

    estimates = []
    covariances = []

    for z in measurements:
        # Predict
        x, P = kalman_predict(x, P, F, Q, xp)

        # Update
        z = z.reshape(-1, 1)
        x, P = kalman_update(x, P, z, H, R, xp)

        estimates.append(x.flatten())
        covariances.append(P.copy())

    if use_gpu and GPU_AVAILABLE:
        estimates = cp.asnumpy(cp.array(estimates))
        covariances = cp.asnumpy(cp.array(covariances))
    else:
        estimates = np.array([xp.asnumpy(e) if hasattr(xp, 'asnumpy') else e for e in estimates])
        covariances = np.array([xp.asnumpy(c) if hasattr(xp, 'asnumpy') else c for c in covariances])

    return np.array(estimates), np.array(covariances)

print("GPU-accelerated tracking function defined.")
GPU-accelerated tracking function defined.
[30]:
# Run tracking example
# Generate trajectory
n_timesteps = 100
true_state = []
measurements_list = []

x_true = np.array([0, 1, 0, 0.5])  # Start at origin, moving NE
for t in range(n_timesteps):
    true_state.append(x_true.copy())

    # Propagate
    x_true = F @ x_true + np.random.multivariate_normal(np.zeros(4), Q * 0.1)

    # Generate noisy measurement
    z = H @ x_true + np.random.multivariate_normal(np.zeros(2), R)
    measurements_list.append(z)

true_state = np.array(true_state)
measurements_arr = np.array(measurements_list)

# Run tracking
x0 = np.array([0, 0, 0, 0]).reshape(-1, 1)
P0 = 100 * np.eye(4)

estimates_cpu, _ = gpu_accelerated_tracking(
    measurements_arr, F, H, Q, R, x0, P0, use_gpu=False
)

if GPU_AVAILABLE:
    estimates_gpu, _ = gpu_accelerated_tracking(
        measurements_arr, F, H, Q, R, x0, P0, use_gpu=True
    )

print(f"Tracking complete: {n_timesteps} time steps")
Tracking complete: 100 time steps
[31]:
# Visualize results
fig = make_subplots(rows=1, cols=2, subplot_titles=['Tracking Results', 'Tracking Error'])

# Trajectory plot
fig.add_trace(
    go.Scatter(x=true_state[:, 0], y=true_state[:, 2], mode='lines',
               name='True', line=dict(color='#00ff88', width=2)),
    row=1, col=1
)
fig.add_trace(
    go.Scatter(x=measurements_arr[:, 0], y=measurements_arr[:, 1], mode='markers',
               name='Measurements', marker=dict(color='gray', size=4, opacity=0.5)),
    row=1, col=1
)
fig.add_trace(
    go.Scatter(x=estimates_cpu[:, 0], y=estimates_cpu[:, 2], mode='lines',
               name='CPU Estimate', line=dict(color='#00d4ff', width=1.5, dash='dash')),
    row=1, col=1
)
if GPU_AVAILABLE:
    fig.add_trace(
        go.Scatter(x=estimates_gpu[:, 0], y=estimates_gpu[:, 2], mode='lines',
                   name='GPU Estimate', line=dict(color='#ff4757', width=1.5, dash='dot')),
        row=1, col=1
    )

# Error plot
error_cpu = np.sqrt((estimates_cpu[:, 0] - true_state[:, 0])**2 +
                    (estimates_cpu[:, 2] - true_state[:, 2])**2)
fig.add_trace(
    go.Scatter(y=error_cpu, mode='lines', name='CPU Error',
               line=dict(color='#00d4ff', width=1.5)),
    row=1, col=2
)
if GPU_AVAILABLE:
    error_gpu = np.sqrt((estimates_gpu[:, 0] - true_state[:, 0])**2 +
                        (estimates_gpu[:, 2] - true_state[:, 2])**2)
    fig.add_trace(
        go.Scatter(y=error_gpu, mode='lines', name='GPU Error',
                   line=dict(color='#ff4757', width=1.5, dash='dash')),
        row=1, col=2
    )

fig.update_xaxes(title_text='X Position', row=1, col=1)
fig.update_yaxes(title_text='Y Position', row=1, col=1, scaleanchor='x', scaleratio=1)
fig.update_xaxes(title_text='Time Step', row=1, col=2)
fig.update_yaxes(title_text='Position Error', row=1, col=2)

fig.update_layout(template=dark_template, height=450)
fig.show()

print(f"CPU RMSE: {np.sqrt(np.mean(error_cpu**2)):.4f}")
if GPU_AVAILABLE:
    print(f"GPU RMSE: {np.sqrt(np.mean(error_gpu**2)):.4f}")
    print(f"Max difference: {np.max(np.abs(estimates_cpu - estimates_gpu)):.2e}")

Data type cannot be displayed: application/vnd.plotly.v1+json

CPU RMSE: 2.9987

Summary

Key takeaways for GPU acceleration:

  1. CuPy provides NumPy compatibility - Easy migration with minimal code changes

  2. Batch processing is key - GPU shines when processing many tracks/particles simultaneously

  3. Memory management matters - Pre-allocate, use pinned memory, free unused memory

  4. Transfer overhead - Minimize CPU↔GPU transfers for best performance

  5. Problem size determines speedup - Small problems may not benefit from GPU

Performance Guidelines

Scenario

Recommendation

< 100 tracks

CPU (transfer overhead dominates)

100-1000 tracks

GPU beneficial for complex filters

> 1000 tracks

GPU strongly recommended

< 1000 particles

CPU usually faster

> 10000 particles

GPU provides significant speedup

Exercises

  1. Implement GPU-accelerated UKF with sigma point generation on GPU

  2. Add GPU memory pooling for repeated filter operations

  3. Benchmark different matrix sizes to find the CPU/GPU crossover point

  4. Implement GPU-accelerated JPDA likelihood computation

References

  1. CuPy Documentation: https://docs.cupy.dev/

  2. CUDA Programming Guide: https://docs.nvidia.com/cuda/

  3. Gustafsson, F. (2010). Particle Methods for Tracking.