Author: S. Loisel
Distributed sparse matrix and vector operations using MPI for Julia. This package provides efficient parallel linear algebra operations across multiple MPI ranks.
- Distributed sparse matrices (
HPCSparseMatrix{T,Ti,AV}) with row-partitioning across MPI ranks - Distributed dense vectors (
HPCVector{T,AV}) with flexible partitioning - Matrix-matrix multiplication (
A * B) with memoized communication plans - Matrix-vector multiplication (
A * x,mul!(y, A, x)) - Sparse direct solvers: LU and LDLT factorization using MUMPS
- Lazy transpose with optimized multiplication rules
- Matrix addition/subtraction (
A + B,A - B) - Vector operations: norms, reductions, arithmetic with automatic partition alignment
- Support for both
Float64andComplexF64element types - GPU acceleration via Metal.jl (macOS) or CUDA.jl (Linux/Windows) with automatic CPU staging for MPI
- Multi-GPU sparse direct solver via cuDSS with NCCL communication (CUDA only)
using Pkg
Pkg.add("HPCSparseArrays")using MPI
MPI.Init()
using HPCSparseArrays
using SparseArrays
# Create a sparse matrix (must be identical on all ranks)
A_global = sprand(1000, 1000, 0.01)
A = HPCSparseMatrix{Float64}(A_global)
# Create a vector
x_global = rand(1000)
x = HPCVector(x_global)
# Matrix-vector multiplication
y = A * x
# Matrix-matrix multiplication
B_global = sprand(1000, 500, 0.01)
B = HPCSparseMatrix{Float64}(B_global)
C = A * B
# Transpose operations
At = transpose(A)
D = At * B # Materializes transpose as needed
# Solve linear systems
using LinearAlgebra
A_sym = A + transpose(A) + 10I # Make symmetric positive definite
A_sym_dist = HPCSparseMatrix{Float64}(A_sym)
F = ldlt(A_sym_dist) # LDLT factorization
x_sol = solve(F, y) # Solve A_sym * x_sol = yHPCSparseArrays supports GPU acceleration via Metal.jl (macOS) or CUDA.jl (Linux/Windows). GPU support is optional - extensions are loaded as weak dependencies.
using Metal # Load Metal BEFORE MPI for GPU detection
using MPI
MPI.Init()
using HPCSparseArrays
# Define backends
cpu_backend = HPCBackend(DeviceCPU(), CommMPI(), SolverMUMPS())
metal_backend = HPCBackend(DeviceMetal(), CommMPI(), SolverMUMPS())
# Create vectors/matrices directly with GPU backend
x_gpu = HPCVector(Float32.(rand(1000)), metal_backend)
A = HPCSparseMatrix(sprand(Float32, 1000, 1000, 0.01), metal_backend)
# GPU operations work transparently
y_gpu = A * x_gpu # Sparse A*x with GPU vector (CPU staging for computation)
z_gpu = x_gpu + x_gpu # Vector addition on GPU
# Convert to CPU if needed
y_cpu = to_backend(y_gpu, cpu_backend)using CUDA # Load CUDA BEFORE MPI
using MPI
MPI.Init()
using HPCSparseArrays
# Define backends
cpu_backend = HPCBackend(DeviceCPU(), CommMPI(), SolverMUMPS())
cuda_backend = HPCBackend(DeviceCUDA(), CommMPI(), SolverMUMPS())
# Create directly with GPU backend
x_gpu = HPCVector(rand(1000), cuda_backend)
# GPU operations work transparently
y_gpu = A * x_gpu
z_gpu = x_gpu + x_gpu
# Convert to CPU if needed
y_cpu = to_backend(y_gpu, cpu_backend)For multi-GPU distributed sparse direct solves, use CuDSSFactorizationMPI:
using CUDA, MPI
MPI.Init()
using HPCSparseArrays
# Each rank uses one GPU
CUDA.device!(MPI.Comm_rank(MPI.COMM_WORLD) % length(CUDA.devices()))
# Create distributed sparse matrix
A = HPCSparseMatrix{Float64}(make_spd_matrix(1000))
b = HPCVector(rand(1000))
# Multi-GPU factorization using cuDSS + NCCL
F = cudss_ldlt(A) # or cudss_lu(A)
x = F \ b
finalize!(F) # Clean up cuDSS resourcesRequirements: cuDSS 0.4+ with MGMN (Multi-GPU Multi-Node) support, NCCL for inter-GPU communication.
- Vectors:
HPCVector{T,AV}whereAVisVector{T}(CPU),MtlVector{T}(Metal), orCuVector{T}(CUDA) - Sparse matrices:
HPCSparseMatrix{T,Ti,AV}whereAVdetermines storage for nonzero values - Dense matrices:
HPCMatrix{T,AM}whereAMisMatrix{T},MtlMatrix{T}, orCuMatrix{T} - MPI communication: Always uses CPU buffers (staged automatically)
- Element types: Metal requires
Float32; CUDA supportsFloat32andFloat64
| Operation | Metal | CUDA |
|---|---|---|
v + w, v - w |
Native | Native |
α * v (scalar) |
Native | Native |
A * x (sparse) |
CPU staging | CPU staging |
A * x (dense) |
CPU staging | CPU staging |
transpose(A) * x |
CPU staging | CPU staging |
Broadcasting (abs.(v)) |
Native | Native |
cudss_lu(A), cudss_ldlt(A) |
N/A | Multi-GPU native |
mpiexec -n 4 julia your_script.jlFor detailed documentation, see the stable docs or dev docs.