GPU Acceleration: CuPy Integration for Tracking

This notebook demonstrates GPU acceleration techniques for tracking algorithms using CuPy. We cover:

CuPy Basics - GPU array operations and NumPy compatibility
Accelerating Matrix Operations - Core operations for Kalman filters
Batch Processing - Processing multiple tracks in parallel
Particle Filter Acceleration - GPU-based resampling and weight computation
Performance Comparison - Benchmarking CPU vs GPU

Prerequisites

# GPU-enabled version
pip install cupy-cuda12x  # For CUDA 12.x
# or pip install cupy-cuda11x for CUDA 11.x

pip install nrl-tracker plotly numpy

Note: This notebook requires an NVIDIA GPU with CUDA support. If running without a GPU, code examples will demonstrate concepts with NumPy fallbacks.

[16]:

import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import time

# Check for GPU availability
try:
    import cupy as cp
    GPU_AVAILABLE = True
    print(f"CuPy version: {cp.__version__}")
    print(f"CUDA version: {cp.cuda.runtime.runtimeGetVersion()}")
    print(f"GPU: {cp.cuda.Device().name}")
    print(f"GPU memory: {cp.cuda.Device().mem_info[1] / 1e9:.1f} GB")
except ImportError:
    GPU_AVAILABLE = False
    print("CuPy not available. Running in CPU-only mode.")
    print("Install with: pip install cupy-cuda12x")

# Use appropriate array module
xp = cp if GPU_AVAILABLE else np

np.random.seed(42)

# Plotly dark theme template
dark_template = go.layout.Template()
dark_template.layout = go.Layout(
    paper_bgcolor='#0d1117',
    plot_bgcolor='#0d1117',
    font=dict(color='#e6edf3'),
    xaxis=dict(gridcolor='#30363d', zerolinecolor='#30363d'),
    yaxis=dict(gridcolor='#30363d', zerolinecolor='#30363d'),
)

CuPy not available. Running in CPU-only mode.
Install with: pip install cupy-cuda12x

1. CuPy Basics

CuPy provides a NumPy-compatible interface for GPU computing. Arrays and operations mirror NumPy exactly, making migration straightforward.

Key Concepts

NumPy	CuPy	Location
`numpy.array()`	`cupy.array()`	GPU
`numpy.zeros()`	`cupy.zeros()`	GPU
`numpy.linalg.inv()`	`cupy.linalg.inv()`	GPU
Transfer to GPU	`cupy.asarray(np_array)`	CPU→GPU
Transfer to CPU	`cupy.asnumpy(cp_array)`	GPU→CPU

[17]:

# Basic array operations
if GPU_AVAILABLE:
    # Create arrays on GPU
    a_gpu = cp.array([1, 2, 3, 4, 5])
    b_gpu = cp.array([5, 4, 3, 2, 1])

    # Operations happen on GPU
    c_gpu = a_gpu + b_gpu
    d_gpu = cp.dot(a_gpu, b_gpu)

    print(f"a + b = {c_gpu}")
    print(f"a · b = {d_gpu}")
    print(f"Array type: {type(c_gpu)}")

    # Transfer back to CPU when needed
    c_cpu = cp.asnumpy(c_gpu)
    print(f"Transferred type: {type(c_cpu)}")
else:
    # NumPy fallback
    a_cpu = np.array([1, 2, 3, 4, 5])
    b_cpu = np.array([5, 4, 3, 2, 1])
    print(f"a + b = {a_cpu + b_cpu}")
    print(f"a · b = {np.dot(a_cpu, b_cpu)}")

a + b = [6 6 6 6 6]
a · b = 35

[18]:

# Memory management
if GPU_AVAILABLE:
    # Check memory before
    mempool = cp.get_default_memory_pool()
    print(f"Memory used before: {mempool.used_bytes() / 1e6:.2f} MB")

    # Create large array
    large_array = cp.random.randn(10000, 10000)
    print(f"Memory after 10000x10000 float64: {mempool.used_bytes() / 1e6:.2f} MB")

    # Free memory explicitly
    del large_array
    mempool.free_all_blocks()
    print(f"Memory after free: {mempool.used_bytes() / 1e6:.2f} MB")

2. Accelerating Matrix Operations

Kalman filters involve repeated matrix operations that benefit from GPU parallelization:

Matrix multiplication: \(P \cdot H^T\)
Matrix inversion: \((HPH^T + R)^{-1}\)
Cholesky decomposition: \(P = LL^T\)

Let’s implement core Kalman filter operations for both CPU and GPU.

[19]:

def kalman_predict(x, P, F, Q, xp=np):
    """
    Kalman filter prediction step.

    Parameters
    ----------
    x : array_like
        State estimate [n, 1].
    P : array_like
        State covariance [n, n].
    F : array_like
        State transition matrix [n, n].
    Q : array_like
        Process noise covariance [n, n].
    xp : module
        Array module (numpy or cupy).

    Returns
    -------
    x_pred : array
        Predicted state.
    P_pred : array
        Predicted covariance.
    """
    x_pred = xp.dot(F, x)
    P_pred = xp.dot(xp.dot(F, P), F.T) + Q
    return x_pred, P_pred


def kalman_update(x, P, z, H, R, xp=np):
    """
    Kalman filter update step.

    Parameters
    ----------
    x : array_like
        Predicted state [n, 1].
    P : array_like
        Predicted covariance [n, n].
    z : array_like
        Measurement [m, 1].
    H : array_like
        Measurement matrix [m, n].
    R : array_like
        Measurement noise covariance [m, m].
    xp : module
        Array module (numpy or cupy).

    Returns
    -------
    x_upd : array
        Updated state.
    P_upd : array
        Updated covariance.
    """
    # Innovation
    y = z - xp.dot(H, x)

    # Innovation covariance
    S = xp.dot(xp.dot(H, P), H.T) + R

    # Kalman gain
    K = xp.dot(xp.dot(P, H.T), xp.linalg.inv(S))

    # State update
    x_upd = x + xp.dot(K, y)

    # Covariance update (Joseph form for numerical stability)
    I = xp.eye(P.shape[0])
    IKH = I - xp.dot(K, H)
    P_upd = xp.dot(xp.dot(IKH, P), IKH.T) + xp.dot(xp.dot(K, R), K.T)

    return x_upd, P_upd


print("Kalman filter functions defined.")

Kalman filter functions defined.

[20]:

# Test Kalman filter on CPU and GPU
state_dim = 4  # [x, vx, y, vy]
meas_dim = 2   # [x, y]

dt = 1.0

# State transition (constant velocity)
F = np.array([
    [1, dt, 0, 0],
    [0, 1, 0, 0],
    [0, 0, 1, dt],
    [0, 0, 0, 1]
])

# Measurement matrix (observe position only)
H = np.array([
    [1, 0, 0, 0],
    [0, 0, 1, 0]
])

# Process noise
q = 0.1
Q = q * np.array([
    [dt**3/3, dt**2/2, 0, 0],
    [dt**2/2, dt, 0, 0],
    [0, 0, dt**3/3, dt**2/2],
    [0, 0, dt**2/2, dt]
])

# Measurement noise
R = 10 * np.eye(meas_dim)

# Initial state and covariance
x = np.array([0, 1, 0, 1]).reshape(-1, 1)
P = 100 * np.eye(state_dim)

# Measurement
z = np.array([5, 5]).reshape(-1, 1)

print("System matrices defined.")
print(f"State dimension: {state_dim}")
print(f"Measurement dimension: {meas_dim}")

System matrices defined.
State dimension: 4
Measurement dimension: 2

[21]:

# Run on CPU
x_pred_cpu, P_pred_cpu = kalman_predict(x, P, F, Q, xp=np)
x_upd_cpu, P_upd_cpu = kalman_update(x_pred_cpu, P_pred_cpu, z, H, R, xp=np)

print("CPU Results:")
print(f"Predicted state: {x_pred_cpu.flatten()}")
print(f"Updated state: {x_upd_cpu.flatten()}")

if GPU_AVAILABLE:
    # Transfer to GPU
    x_gpu = cp.asarray(x)
    P_gpu = cp.asarray(P)
    F_gpu = cp.asarray(F)
    Q_gpu = cp.asarray(Q)
    H_gpu = cp.asarray(H)
    R_gpu = cp.asarray(R)
    z_gpu = cp.asarray(z)

    # Run on GPU
    x_pred_gpu, P_pred_gpu = kalman_predict(x_gpu, P_gpu, F_gpu, Q_gpu, xp=cp)
    x_upd_gpu, P_upd_gpu = kalman_update(x_pred_gpu, P_pred_gpu, z_gpu, H_gpu, R_gpu, xp=cp)

    print("\nGPU Results:")
    print(f"Predicted state: {cp.asnumpy(x_pred_gpu).flatten()}")
    print(f"Updated state: {cp.asnumpy(x_upd_gpu).flatten()}")

    # Verify results match
    error = np.max(np.abs(cp.asnumpy(x_upd_gpu) - x_upd_cpu))
    print(f"\nMax difference CPU vs GPU: {error:.2e}")

CPU Results:
Predicted state: [1. 1. 1. 1.]
Updated state: [4.80955404 2.90541184 4.80955404 2.90541184]

3. Batch Processing Multiple Tracks

The real power of GPU acceleration comes from processing many tracks simultaneously. Instead of looping over tracks, we can stack them into batched arrays.

[22]:

def batch_kalman_predict(x_batch, P_batch, F, Q, xp=np):
    """
    Batch Kalman filter prediction.

    Parameters
    ----------
    x_batch : array_like
        Batch of states [batch_size, n, 1].
    P_batch : array_like
        Batch of covariances [batch_size, n, n].
    F : array_like
        State transition matrix [n, n] (shared).
    Q : array_like
        Process noise [n, n] (shared).
    xp : module
        Array module.

    Returns
    -------
    x_pred : array
        Predicted states [batch_size, n, 1].
    P_pred : array
        Predicted covariances [batch_size, n, n].
    """
    batch_size = x_batch.shape[0]

    # Vectorized prediction
    # x_pred = F @ x for each batch element
    x_pred = xp.einsum('ij,bjk->bik', F, x_batch)

    # P_pred = F @ P @ F.T + Q for each batch element
    FP = xp.einsum('ij,bjk->bik', F, P_batch)
    FPFt = xp.einsum('bij,kj->bik', FP, F)
    P_pred = FPFt + Q

    return x_pred, P_pred


def batch_kalman_update(x_batch, P_batch, z_batch, H, R, xp=np):
    """
    Batch Kalman filter update.

    Parameters
    ----------
    x_batch : array_like
        Batch of predicted states [batch_size, n, 1].
    P_batch : array_like
        Batch of predicted covariances [batch_size, n, n].
    z_batch : array_like
        Batch of measurements [batch_size, m, 1].
    H : array_like
        Measurement matrix [m, n] (shared).
    R : array_like
        Measurement noise [m, m] (shared).
    xp : module
        Array module.

    Returns
    -------
    x_upd : array
        Updated states [batch_size, n, 1].
    P_upd : array
        Updated covariances [batch_size, n, n].
    """
    batch_size = x_batch.shape[0]
    n = x_batch.shape[1]

    # Innovation: y = z - H @ x
    Hx = xp.einsum('ij,bjk->bik', H, x_batch)
    y = z_batch - Hx

    # Innovation covariance: S = H @ P @ H.T + R
    HP = xp.einsum('ij,bjk->bik', H, P_batch)
    HPHt = xp.einsum('bij,kj->bik', HP, H)
    S = HPHt + R

    # Kalman gain: K = P @ H.T @ S^{-1}
    PHt = xp.einsum('bij,kj->bik', P_batch, H)
    S_inv = xp.linalg.inv(S)
    K = xp.einsum('bij,bjk->bik', PHt, S_inv)

    # State update: x_upd = x + K @ y
    Ky = xp.einsum('bij,bjk->bik', K, y)
    x_upd = x_batch + Ky

    # Covariance update (simplified form)
    I = xp.eye(n)
    KH = xp.einsum('bij,jk->bik', K, H)
    IKH = I - KH
    P_upd = xp.einsum('bij,bjk->bik', IKH, P_batch)

    return x_upd, P_upd


print("Batch Kalman functions defined.")

Batch Kalman functions defined.

[23]:

# Benchmark batch processing
batch_sizes = [10, 100, 1000, 5000]

if not GPU_AVAILABLE:
    batch_sizes = [10, 100, 1000]  # Smaller sizes for CPU-only

cpu_times = []
gpu_times = []

for batch_size in batch_sizes:
    # Create batch data
    x_batch_np = np.random.randn(batch_size, state_dim, 1)
    P_batch_np = np.tile(P.reshape(1, state_dim, state_dim), (batch_size, 1, 1))
    z_batch_np = np.random.randn(batch_size, meas_dim, 1)

    # CPU timing
    n_iterations = 100
    start = time.time()
    for _ in range(n_iterations):
        x_pred, P_pred = batch_kalman_predict(x_batch_np, P_batch_np, F, Q, xp=np)
        x_upd, P_upd = batch_kalman_update(x_pred, P_pred, z_batch_np, H, R, xp=np)
    cpu_time = (time.time() - start) / n_iterations
    cpu_times.append(cpu_time)

    if GPU_AVAILABLE:
        # Transfer to GPU
        x_batch_gpu = cp.asarray(x_batch_np)
        P_batch_gpu = cp.asarray(P_batch_np)
        z_batch_gpu = cp.asarray(z_batch_np)

        # Warm up
        x_pred, P_pred = batch_kalman_predict(x_batch_gpu, P_batch_gpu, F_gpu, Q_gpu, xp=cp)
        x_upd, P_upd = batch_kalman_update(x_pred, P_pred, z_batch_gpu, H_gpu, R_gpu, xp=cp)
        cp.cuda.Stream.null.synchronize()

        # GPU timing
        start = time.time()
        for _ in range(n_iterations):
            x_pred, P_pred = batch_kalman_predict(x_batch_gpu, P_batch_gpu, F_gpu, Q_gpu, xp=cp)
            x_upd, P_upd = batch_kalman_update(x_pred, P_pred, z_batch_gpu, H_gpu, R_gpu, xp=cp)
        cp.cuda.Stream.null.synchronize()
        gpu_time = (time.time() - start) / n_iterations
        gpu_times.append(gpu_time)

        speedup = cpu_time / gpu_time
        print(f"Batch size {batch_size:5d}: CPU={cpu_time*1e3:7.3f}ms, "
              f"GPU={gpu_time*1e3:7.3f}ms, Speedup={speedup:.1f}x")
    else:
        print(f"Batch size {batch_size:5d}: CPU={cpu_time*1e3:7.3f}ms")

Batch size    10: CPU=  0.043ms
Batch size   100: CPU=  0.146ms
Batch size  1000: CPU=  1.205ms

[24]:

# Visualize performance
if GPU_AVAILABLE and len(gpu_times) > 0:
    fig = make_subplots(rows=1, cols=2,
                        subplot_titles=['Kalman Filter Batch Performance', 'GPU Speedup Factor'])

    fig.add_trace(
        go.Scatter(x=batch_sizes, y=np.array(cpu_times)*1e3, mode='lines+markers',
                   name='CPU (NumPy)', line=dict(color='#00d4ff', width=2),
                   marker=dict(size=8)),
        row=1, col=1
    )
    fig.add_trace(
        go.Scatter(x=batch_sizes, y=np.array(gpu_times)*1e3, mode='lines+markers',
                   name='GPU (CuPy)', line=dict(color='#ff4757', width=2),
                   marker=dict(size=8, symbol='square')),
        row=1, col=1
    )

    speedups = np.array(cpu_times) / np.array(gpu_times)
    fig.add_trace(
        go.Scatter(x=batch_sizes, y=speedups, mode='lines+markers',
                   name='Speedup', line=dict(color='#00ff88', width=2),
                   marker=dict(size=10, symbol='triangle-up')),
        row=1, col=2
    )
    fig.add_trace(
        go.Scatter(x=[batch_sizes[0], batch_sizes[-1]], y=[1, 1], mode='lines',
                   line=dict(color='white', dash='dash', width=1),
                   showlegend=False),
        row=1, col=2
    )

    fig.update_xaxes(type='log', title_text='Batch Size (number of tracks)', row=1, col=1)
    fig.update_yaxes(type='log', title_text='Time per iteration (ms)', row=1, col=1)
    fig.update_xaxes(type='log', title_text='Batch Size (number of tracks)', row=1, col=2)
    fig.update_yaxes(title_text='Speedup (CPU time / GPU time)', row=1, col=2)

    fig.update_layout(template=dark_template, height=400)
    fig.show()
else:
    fig = go.Figure()
    fig.add_trace(
        go.Scatter(x=batch_sizes, y=np.array(cpu_times)*1e3, mode='lines+markers',
                   name='CPU (NumPy)', line=dict(color='#00d4ff', width=2),
                   marker=dict(size=8))
    )
    fig.update_layout(
        template=dark_template,
        title='Kalman Filter CPU Performance',
        xaxis_title='Batch Size',
        xaxis_type='log',
        yaxis_title='Time (ms)',
        yaxis_type='log',
        height=400
    )
    fig.show()

Data type cannot be displayed: application/vnd.plotly.v1+json

4. GPU-Accelerated Particle Filter

Particle filters involve operations on large numbers of particles that parallelize well:

Particle propagation: Apply dynamics to each particle
Weight computation: Evaluate likelihood for each particle
Resampling: Select particles based on weights

[25]:

def particle_filter_step(particles, weights, z, f_dynamics, h_meas, Q_std, R_std, xp=np):
    """
    GPU-accelerated particle filter step.

    Parameters
    ----------
    particles : array_like
        Particle states [n_particles, state_dim].
    weights : array_like
        Particle weights [n_particles].
    z : array_like
        Measurement [meas_dim].
    f_dynamics : callable
        Dynamics function f(x) -> x_next.
    h_meas : callable
        Measurement function h(x) -> z_pred.
    Q_std : float
        Process noise standard deviation.
    R_std : float
        Measurement noise standard deviation.
    xp : module
        Array module.

    Returns
    -------
    particles : array
        Updated particles.
    weights : array
        Updated weights.
    """
    n_particles = particles.shape[0]
    state_dim = particles.shape[1]

    # Prediction: propagate particles through dynamics + noise
    particles_pred = f_dynamics(particles, xp)
    particles_pred += Q_std * xp.random.randn(n_particles, state_dim)

    # Update: compute likelihood weights
    z_pred = h_meas(particles_pred, xp)
    innovation = z - z_pred

    # Gaussian likelihood (vectorized)
    log_likelihood = -0.5 * xp.sum(innovation**2, axis=1) / R_std**2

    # Update weights
    log_weights = xp.log(weights + 1e-300) + log_likelihood
    log_weights = log_weights - xp.max(log_weights)  # Numerical stability
    weights = xp.exp(log_weights)
    weights = weights / xp.sum(weights)  # Normalize

    # Compute effective sample size
    ess = 1.0 / xp.sum(weights**2)

    # Resample if ESS is too low
    if float(ess) < n_particles / 2:
        # Systematic resampling
        indices = systematic_resample(weights, xp)
        particles_pred = particles_pred[indices]
        weights = xp.ones(n_particles) / n_particles

    return particles_pred, weights, float(ess)


def systematic_resample(weights, xp=np):
    """
    Systematic resampling for particle filters.

    Parameters
    ----------
    weights : array_like
        Normalized weights [n_particles].
    xp : module
        Array module.

    Returns
    -------
    indices : array
        Resampled indices.
    """
    n = len(weights)

    # Generate systematic points
    positions = (xp.arange(n) + xp.random.rand()) / n

    # Cumulative sum of weights
    cumsum = xp.cumsum(weights)

    # Find indices using searchsorted
    indices = xp.searchsorted(cumsum, positions)
    indices = xp.clip(indices, 0, n - 1)

    return indices


print("Particle filter functions defined.")

Particle filter functions defined.

[26]:

# Define nonlinear dynamics and measurement
def f_nonlinear(x, xp=np):
    """Nonlinear dynamics: x_next = x/2 + 25*x/(1+x^2) + 8*cos(1.2*k)"""
    return x / 2 + 25 * x / (1 + x**2 + 1e-10)

def h_nonlinear(x, xp=np):
    """Nonlinear measurement: z = x^2 / 20"""
    return x**2 / 20

# Parameters
n_particles_list = [1000, 5000, 10000, 50000]
if not GPU_AVAILABLE:
    n_particles_list = [1000, 5000, 10000]

state_dim_pf = 1
Q_std = 3.0
R_std = 1.0

n_steps = 50
n_trials = 5

cpu_pf_times = []
gpu_pf_times = []

print("Benchmarking particle filter...")

for n_particles in n_particles_list:
    # Generate true trajectory and measurements
    np.random.seed(42)
    true_state = [0.0]
    measurements = []
    for k in range(n_steps):
        x_new = f_nonlinear(np.array([[true_state[-1]]]))[0, 0] + Q_std * np.random.randn()
        true_state.append(x_new)
        z = h_nonlinear(np.array([[x_new]]))[0, 0] + R_std * np.random.randn()
        measurements.append(z)

    # CPU timing
    cpu_times_trial = []
    for _ in range(n_trials):
        particles = np.random.randn(n_particles, state_dim_pf) * 5
        weights = np.ones(n_particles) / n_particles

        start = time.time()
        for z in measurements:
            z_arr = np.array([[z]])
            particles, weights, ess = particle_filter_step(
                particles, weights, z_arr, f_nonlinear, h_nonlinear, Q_std, R_std, xp=np
            )
        cpu_times_trial.append(time.time() - start)
    cpu_pf_times.append(np.mean(cpu_times_trial))

    if GPU_AVAILABLE:
        # GPU timing
        gpu_times_trial = []
        for _ in range(n_trials):
            particles_gpu = cp.random.randn(n_particles, state_dim_pf) * 5
            weights_gpu = cp.ones(n_particles) / n_particles

            # Warm up
            z_arr = cp.array([[measurements[0]]])
            _, _, _ = particle_filter_step(
                particles_gpu, weights_gpu, z_arr, f_nonlinear, h_nonlinear, Q_std, R_std, xp=cp
            )
            cp.cuda.Stream.null.synchronize()

            particles_gpu = cp.random.randn(n_particles, state_dim_pf) * 5
            weights_gpu = cp.ones(n_particles) / n_particles

            start = time.time()
            for z in measurements:
                z_arr = cp.array([[z]])
                particles_gpu, weights_gpu, ess = particle_filter_step(
                    particles_gpu, weights_gpu, z_arr, f_nonlinear, h_nonlinear, Q_std, R_std, xp=cp
                )
            cp.cuda.Stream.null.synchronize()
            gpu_times_trial.append(time.time() - start)
        gpu_pf_times.append(np.mean(gpu_times_trial))

        speedup = cpu_pf_times[-1] / gpu_pf_times[-1]
        print(f"N={n_particles:5d}: CPU={cpu_pf_times[-1]*1e3:7.1f}ms, "
              f"GPU={gpu_pf_times[-1]*1e3:7.1f}ms, Speedup={speedup:.1f}x")
    else:
        print(f"N={n_particles:5d}: CPU={cpu_pf_times[-1]*1e3:7.1f}ms")

Benchmarking particle filter...
N= 1000: CPU=    2.7ms
N= 5000: CPU=    9.8ms
N=10000: CPU=   18.5ms

[27]:

# Visualize particle filter performance
if GPU_AVAILABLE and len(gpu_pf_times) > 0:
    fig = make_subplots(rows=1, cols=2,
                        subplot_titles=[f'Particle Filter ({n_steps} steps)', 'GPU Speedup for Particle Filter'])

    fig.add_trace(
        go.Scatter(x=n_particles_list, y=np.array(cpu_pf_times)*1e3, mode='lines+markers',
                   name='CPU', line=dict(color='#00d4ff', width=2),
                   marker=dict(size=8)),
        row=1, col=1
    )
    fig.add_trace(
        go.Scatter(x=n_particles_list, y=np.array(gpu_pf_times)*1e3, mode='lines+markers',
                   name='GPU', line=dict(color='#ff4757', width=2),
                   marker=dict(size=8, symbol='square')),
        row=1, col=1
    )

    speedups = np.array(cpu_pf_times) / np.array(gpu_pf_times)
    fig.add_trace(
        go.Scatter(x=n_particles_list, y=speedups, mode='lines+markers',
                   name='Speedup', line=dict(color='#00ff88', width=2),
                   marker=dict(size=10, symbol='triangle-up')),
        row=1, col=2
    )
    fig.add_trace(
        go.Scatter(x=[n_particles_list[0], n_particles_list[-1]], y=[1, 1], mode='lines',
                   line=dict(color='white', dash='dash', width=1),
                   showlegend=False),
        row=1, col=2
    )

    fig.update_xaxes(type='log', title_text='Number of Particles', row=1, col=1)
    fig.update_yaxes(type='log', title_text='Time (ms)', row=1, col=1)
    fig.update_xaxes(type='log', title_text='Number of Particles', row=1, col=2)
    fig.update_yaxes(title_text='Speedup Factor', row=1, col=2)

    fig.update_layout(template=dark_template, height=400)
    fig.show()

5. Memory Management Best Practices

Efficient GPU memory management is crucial for large-scale tracking applications.

[28]:

if GPU_AVAILABLE:
    print("GPU Memory Management Best Practices")
    print("=" * 50)

    # 1. Memory pools
    print("\n1. Memory Pools")
    mempool = cp.get_default_memory_pool()
    pinned_mempool = cp.get_default_pinned_memory_pool()

    print(f"   GPU memory pool used: {mempool.used_bytes() / 1e6:.2f} MB")
    print(f"   Pinned memory pool used: {pinned_mempool.n_free_blocks()} free blocks")

    # 2. Pre-allocation
    print("\n2. Pre-allocation for Repeated Operations")

    n = 1000

    # Bad: Allocate new arrays each iteration
    start = time.time()
    for _ in range(1000):
        temp = cp.zeros((n, n))
        result = cp.dot(temp, temp)
    cp.cuda.Stream.null.synchronize()
    time_alloc = time.time() - start

    # Good: Reuse pre-allocated arrays
    temp = cp.zeros((n, n))
    result = cp.zeros((n, n))
    start = time.time()
    for _ in range(1000):
        cp.dot(temp, temp, out=result)
    cp.cuda.Stream.null.synchronize()
    time_reuse = time.time() - start

    print(f"   With allocation each time: {time_alloc*1e3:.1f} ms")
    print(f"   With pre-allocated arrays: {time_reuse*1e3:.1f} ms")
    print(f"   Speedup: {time_alloc/time_reuse:.1f}x")

    # 3. Async transfers
    print("\n3. Asynchronous Data Transfers")

    data_cpu = np.random.randn(10000, 1000)

    # Synchronous
    start = time.time()
    data_gpu = cp.asarray(data_cpu)
    cp.cuda.Stream.null.synchronize()
    time_sync = time.time() - start

    # With pinned memory (faster transfers)
    data_pinned = cp.cuda.alloc_pinned_memory(data_cpu.nbytes)
    data_view = np.frombuffer(data_pinned, dtype=data_cpu.dtype).reshape(data_cpu.shape)
    np.copyto(data_view, data_cpu)

    start = time.time()
    data_gpu2 = cp.asarray(data_view)
    cp.cuda.Stream.null.synchronize()
    time_pinned = time.time() - start

    print(f"   Standard transfer: {time_sync*1e3:.2f} ms")
    print(f"   Pinned memory transfer: {time_pinned*1e3:.2f} ms")

    # Clean up
    del data_gpu, data_gpu2, temp, result
    mempool.free_all_blocks()

6. Integration with pyTCL

Here’s how to integrate GPU acceleration with existing pyTCL code.

[29]:

# Import from pyTCL
from pytcl.dynamic_estimation.kalman import kf_predict, kf_update

def gpu_accelerated_tracking(measurements, F, H, Q, R, x0, P0, use_gpu=True):
    """
    Run Kalman filter tracking with optional GPU acceleration.

    Parameters
    ----------
    measurements : array_like
        List of measurements [n_timesteps, meas_dim].
    F, H, Q, R : array_like
        Kalman filter matrices.
    x0, P0 : array_like
        Initial state and covariance.
    use_gpu : bool
        Whether to use GPU acceleration.

    Returns
    -------
    estimates : array
        State estimates [n_timesteps, state_dim].
    covariances : array
        Covariance estimates [n_timesteps, state_dim, state_dim].
    """
    if use_gpu and GPU_AVAILABLE:
        xp = cp
        F = cp.asarray(F)
        H = cp.asarray(H)
        Q = cp.asarray(Q)
        R = cp.asarray(R)
        x = cp.asarray(x0)
        P = cp.asarray(P0)
        measurements = cp.asarray(measurements)
    else:
        xp = np
        x = x0.copy()
        P = P0.copy()

    estimates = []
    covariances = []

    for z in measurements:
        # Predict
        x, P = kalman_predict(x, P, F, Q, xp)

        # Update
        z = z.reshape(-1, 1)
        x, P = kalman_update(x, P, z, H, R, xp)

        estimates.append(x.flatten())
        covariances.append(P.copy())

    if use_gpu and GPU_AVAILABLE:
        estimates = cp.asnumpy(cp.array(estimates))
        covariances = cp.asnumpy(cp.array(covariances))
    else:
        estimates = np.array([xp.asnumpy(e) if hasattr(xp, 'asnumpy') else e for e in estimates])
        covariances = np.array([xp.asnumpy(c) if hasattr(xp, 'asnumpy') else c for c in covariances])

    return np.array(estimates), np.array(covariances)

print("GPU-accelerated tracking function defined.")

GPU-accelerated tracking function defined.

[30]:

# Run tracking example
# Generate trajectory
n_timesteps = 100
true_state = []
measurements_list = []

x_true = np.array([0, 1, 0, 0.5])  # Start at origin, moving NE
for t in range(n_timesteps):
    true_state.append(x_true.copy())

    # Propagate
    x_true = F @ x_true + np.random.multivariate_normal(np.zeros(4), Q * 0.1)

    # Generate noisy measurement
    z = H @ x_true + np.random.multivariate_normal(np.zeros(2), R)
    measurements_list.append(z)

true_state = np.array(true_state)
measurements_arr = np.array(measurements_list)

# Run tracking
x0 = np.array([0, 0, 0, 0]).reshape(-1, 1)
P0 = 100 * np.eye(4)

estimates_cpu, _ = gpu_accelerated_tracking(
    measurements_arr, F, H, Q, R, x0, P0, use_gpu=False
)

if GPU_AVAILABLE:
    estimates_gpu, _ = gpu_accelerated_tracking(
        measurements_arr, F, H, Q, R, x0, P0, use_gpu=True
    )

print(f"Tracking complete: {n_timesteps} time steps")

Tracking complete: 100 time steps

[31]:

# Visualize results
fig = make_subplots(rows=1, cols=2, subplot_titles=['Tracking Results', 'Tracking Error'])

# Trajectory plot
fig.add_trace(
    go.Scatter(x=true_state[:, 0], y=true_state[:, 2], mode='lines',
               name='True', line=dict(color='#00ff88', width=2)),
    row=1, col=1
)
fig.add_trace(
    go.Scatter(x=measurements_arr[:, 0], y=measurements_arr[:, 1], mode='markers',
               name='Measurements', marker=dict(color='gray', size=4, opacity=0.5)),
    row=1, col=1
)
fig.add_trace(
    go.Scatter(x=estimates_cpu[:, 0], y=estimates_cpu[:, 2], mode='lines',
               name='CPU Estimate', line=dict(color='#00d4ff', width=1.5, dash='dash')),
    row=1, col=1
)
if GPU_AVAILABLE:
    fig.add_trace(
        go.Scatter(x=estimates_gpu[:, 0], y=estimates_gpu[:, 2], mode='lines',
                   name='GPU Estimate', line=dict(color='#ff4757', width=1.5, dash='dot')),
        row=1, col=1
    )

# Error plot
error_cpu = np.sqrt((estimates_cpu[:, 0] - true_state[:, 0])**2 +
                    (estimates_cpu[:, 2] - true_state[:, 2])**2)
fig.add_trace(
    go.Scatter(y=error_cpu, mode='lines', name='CPU Error',
               line=dict(color='#00d4ff', width=1.5)),
    row=1, col=2
)
if GPU_AVAILABLE:
    error_gpu = np.sqrt((estimates_gpu[:, 0] - true_state[:, 0])**2 +
                        (estimates_gpu[:, 2] - true_state[:, 2])**2)
    fig.add_trace(
        go.Scatter(y=error_gpu, mode='lines', name='GPU Error',
                   line=dict(color='#ff4757', width=1.5, dash='dash')),
        row=1, col=2
    )

fig.update_xaxes(title_text='X Position', row=1, col=1)
fig.update_yaxes(title_text='Y Position', row=1, col=1, scaleanchor='x', scaleratio=1)
fig.update_xaxes(title_text='Time Step', row=1, col=2)
fig.update_yaxes(title_text='Position Error', row=1, col=2)

fig.update_layout(template=dark_template, height=450)
fig.show()

print(f"CPU RMSE: {np.sqrt(np.mean(error_cpu**2)):.4f}")
if GPU_AVAILABLE:
    print(f"GPU RMSE: {np.sqrt(np.mean(error_gpu**2)):.4f}")
    print(f"Max difference: {np.max(np.abs(estimates_cpu - estimates_gpu)):.2e}")

Data type cannot be displayed: application/vnd.plotly.v1+json

CPU RMSE: 2.9987

Summary

Key takeaways for GPU acceleration:

CuPy provides NumPy compatibility - Easy migration with minimal code changes
Batch processing is key - GPU shines when processing many tracks/particles simultaneously
Memory management matters - Pre-allocate, use pinned memory, free unused memory
Transfer overhead - Minimize CPU↔GPU transfers for best performance
Problem size determines speedup - Small problems may not benefit from GPU

Performance Guidelines

Scenario	Recommendation
< 100 tracks	CPU (transfer overhead dominates)
100-1000 tracks	GPU beneficial for complex filters
> 1000 tracks	GPU strongly recommended
< 1000 particles	CPU usually faster
> 10000 particles	GPU provides significant speedup

Exercises

Implement GPU-accelerated UKF with sigma point generation on GPU
Add GPU memory pooling for repeated filter operations
Benchmark different matrix sizes to find the CPU/GPU crossover point
Implement GPU-accelerated JPDA likelihood computation

References

CuPy Documentation: https://docs.cupy.dev/
CUDA Programming Guide: https://docs.nvidia.com/cuda/
Gustafsson, F. (2010). Particle Methods for Tracking.