GPU Acceleration
================

.. module:: pytcl.gpu

The GPU module provides hardware-accelerated implementations of key tracking
algorithms using CuPy (NVIDIA CUDA) or MLX (Apple Silicon). These implementations
offer significant speedups (5-15x) for batch processing of multiple tracks.

The module automatically selects the best available backend:

- On Apple Silicon (M1/M2/M3): Uses MLX if installed
- On systems with NVIDIA GPUs: Uses CuPy if installed
- Falls back to CPU (numpy) if no GPU backend is available

Installation
------------

For NVIDIA CUDA GPUs::

    pip install nrl-tracker[gpu]
    # or directly:
    pip install cupy-cuda12x

For Apple Silicon (M1/M2/M3)::

    pip install nrl-tracker[gpu-apple]
    # or directly:
    pip install mlx

Quick Start
-----------

Check GPU availability and backend::

    from pytcl.gpu import is_gpu_available, get_backend, is_apple_silicon

    if is_gpu_available():
        print(f"GPU available, using {get_backend()} backend")

    if is_apple_silicon():
        print("Running on Apple Silicon")

Transfer arrays between CPU and GPU::

    from pytcl.gpu import to_gpu, to_cpu
    import numpy as np

    # CPU array
    x = np.random.randn(100, 4)

    # Transfer to GPU (uses best available backend)
    x_gpu = to_gpu(x)

    # Transfer back to CPU
    x_cpu = to_cpu(x_gpu)

Platform Detection
------------------

.. autofunction:: pytcl.gpu.utils.is_apple_silicon

.. autofunction:: pytcl.gpu.utils.is_mlx_available

.. autofunction:: pytcl.gpu.utils.is_cupy_available

.. autofunction:: pytcl.gpu.utils.get_backend

.. autofunction:: pytcl.gpu.utils.is_gpu_available

Array Operations
----------------

.. autofunction:: pytcl.gpu.utils.to_gpu

.. autofunction:: pytcl.gpu.utils.to_cpu

.. autofunction:: pytcl.gpu.utils.get_array_module

.. autofunction:: pytcl.gpu.utils.ensure_gpu_array

Memory Management
-----------------

.. autofunction:: pytcl.gpu.utils.sync_gpu

.. autofunction:: pytcl.gpu.utils.get_gpu_memory_info

.. autofunction:: pytcl.gpu.utils.clear_gpu_memory

Batch Kalman Filter
-------------------

GPU-accelerated batch Kalman filter operations for processing multiple tracks
in parallel. These functions provide 5-10x speedup compared to sequential CPU
processing.

.. autofunction:: pytcl.gpu.kalman.batch_kf_predict

.. autofunction:: pytcl.gpu.kalman.batch_kf_update

.. autoclass:: pytcl.gpu.kalman.CuPyKalmanFilter
   :members:
   :undoc-members:

Batch Extended Kalman Filter
----------------------------

GPU-accelerated Extended Kalman Filter for nonlinear dynamics.

.. autofunction:: pytcl.gpu.ekf.batch_ekf_predict

.. autofunction:: pytcl.gpu.ekf.batch_ekf_update

.. autoclass:: pytcl.gpu.ekf.CuPyExtendedKalmanFilter
   :members:
   :undoc-members:

Batch Unscented Kalman Filter
-----------------------------

GPU-accelerated Unscented Kalman Filter for highly nonlinear systems.

.. autofunction:: pytcl.gpu.ukf.batch_ukf_predict

.. autofunction:: pytcl.gpu.ukf.batch_ukf_update

.. autoclass:: pytcl.gpu.ukf.CuPyUnscentedKalmanFilter
   :members:
   :undoc-members:

GPU Particle Filter
-------------------

GPU-accelerated particle filtering with efficient resampling algorithms.

.. autofunction:: pytcl.gpu.particle_filter.gpu_resample_systematic

.. autofunction:: pytcl.gpu.particle_filter.gpu_resample_multinomial

.. autofunction:: pytcl.gpu.particle_filter.gpu_resample_stratified

.. autofunction:: pytcl.gpu.particle_filter.gpu_effective_sample_size

.. autofunction:: pytcl.gpu.particle_filter.gpu_normalize_weights

.. autoclass:: pytcl.gpu.particle_filter.CuPyParticleFilter
   :members:
   :undoc-members:

GPU Matrix Utilities
--------------------

GPU-accelerated matrix operations commonly used in tracking algorithms.

.. autofunction:: pytcl.gpu.matrix_utils.gpu_cholesky

.. autofunction:: pytcl.gpu.matrix_utils.gpu_cholesky_safe

.. autofunction:: pytcl.gpu.matrix_utils.gpu_qr

.. autofunction:: pytcl.gpu.matrix_utils.gpu_solve

.. autofunction:: pytcl.gpu.matrix_utils.gpu_inv

.. autofunction:: pytcl.gpu.matrix_utils.gpu_eigh

.. autofunction:: pytcl.gpu.matrix_utils.gpu_matrix_sqrt

.. autoclass:: pytcl.gpu.matrix_utils.MemoryPool
   :members:
   :undoc-members:

Example: Batch Track Processing
-------------------------------

Process multiple tracks in parallel using GPU acceleration::

    import numpy as np
    from pytcl.gpu import (
        is_gpu_available,
        to_gpu,
        to_cpu,
        batch_kf_predict,
        batch_kf_update,
    )

    if not is_gpu_available():
        raise RuntimeError("GPU not available")

    # Simulate 1000 tracks with 4D state (x, vx, y, vy)
    n_tracks = 1000
    state_dim = 4
    meas_dim = 2

    # Initial states and covariances
    x = np.random.randn(n_tracks, state_dim)
    P = np.tile(np.eye(state_dim), (n_tracks, 1, 1))

    # System matrices
    dt = 0.1
    F = np.array([
        [1, dt, 0, 0],
        [0, 1, 0, 0],
        [0, 0, 1, dt],
        [0, 0, 0, 1]
    ])
    Q = np.eye(state_dim) * 0.1
    H = np.array([[1, 0, 0, 0], [0, 0, 1, 0]])
    R = np.eye(meas_dim) * 0.5

    # Transfer to GPU
    x_gpu = to_gpu(x)
    P_gpu = to_gpu(P)

    # Batch predict (all 1000 tracks at once!)
    pred_result = batch_kf_predict(x_gpu, P_gpu, F, Q)

    # Generate measurements
    z = np.random.randn(n_tracks, meas_dim)

    # Batch update
    upd_result = batch_kf_update(
        pred_result.x, pred_result.P, z, H, R
    )

    # Transfer results back to CPU
    x_updated = to_cpu(upd_result.x)
    P_updated = to_cpu(upd_result.P)

    print(f"Processed {n_tracks} tracks in batch")

Performance Notes
-----------------

The GPU implementations achieve significant speedups for:

- **Large batch sizes**: Processing 100+ tracks simultaneously
- **Large particle counts**: Particle filters with 1000+ particles
- **Matrix operations**: Cholesky, QR, and eigendecompositions

For small batch sizes (< 10 tracks), CPU implementations may be faster due to
GPU transfer overhead.

Backend Differences
~~~~~~~~~~~~~~~~~~~

**CuPy (NVIDIA CUDA)**:
- Full float64 (double precision) support
- Explicit memory pool management
- CUDA stream synchronization

**MLX (Apple Silicon)**:
- Optimized for float32 (single precision)
- Automatic memory management
- Lazy evaluation with explicit sync via ``mx.eval()``