GPU Execution
DEER's prefix-scan and per-step batch evaluations are array-type-agnostic, so the same ParallelMALASampler runs on either CPU or GPU. This page covers when GPU pays off, the current limitations, and a worked example.
When to use GPU
GPU pays off when the per-Newton-step work i.e., evaluating logdensity_batch and grad_logdensity_batch across the T-wide trajectory and the Hutchinson probes is large enough to saturate the device.
| Regime | Reach for |
|---|---|
D ≲ a few hundred, lightweight logp | CPU. Per-step cost is dominated by overhead; GPU launch latency wins out. |
Large D, or logp is itself a batched linear-algebra / kernel workload | GPU. T × probes Jacobian work fills the device. |
Many independent chains via MCMCThreads() | CPU, scaled with threads. GPU chains do not currently share a device pool. |
A useful heuristic: if your CPU grad_logdensity_batch already saturates BLAS at the batch sizes DEER passes it (T columns, plus probes HVP probes), GPU is likely to help. If not, profile before switching.
Limitations
1. DynamicPPL / Turing models are CPU-only
The one-argument DensityModel(::DynamicPPL.Model) constructor (and the LogDensityProblems path that backs it) executes the model with Vector{Float64} parameters on CPU. There is no path today that runs a @model against CuArray parameters.
To run on GPU, hand-write logdensity / grad_logdensity (and the batched variants) directly over CuArrays and pass them to DensityModel. This is shown in the example below.
2. Enzyme on GPU currently needs pmcmc_matmul / pmcmc_dot / pmcmc_dotsum
ParallelMCMC exports three thin wrappers:
pmcmc_matmul(A, B) = A * B
pmcmc_dot(a, b) = dot(a, b)
pmcmc_dotsum(A, B) = sum(A .* B)These exist purely so the package can register Enzyme rules against them without committing type piracy on Base.*, Base.dot, Base.sum. The rules compute primal and cotangents with plain operators outside Enzyme's IR rewriter, which keeps them off the reverse-mode path that aborts on GPU.
unsupported tag gc-transition for
call i32 @cuMemcpyDtoHAsync_v2(...) [ "jl_roots"(...), "gc-transition"() ]
UNREACHABLE executed at .../Enzyme/GradientUtils.cpp:309 (signal 6)When you should use them:
- You are on GPU and using
AutoEnzyme()as the AD backend.
When you do not need them:
- CPU code: plain
*,dot,sumare faster and clearer. - GPU code with
AutoMooncake(; config=nothing): Mooncake's own CUDA extension handles these operations natively. - GPU code with
AutoZygote(): Zygote uses ChainRules adjoints, which have cuBLAS-backed rules for*/dot/sumonCuArray.
These wrappers are a workaround pending upstream Enzyme support for GPU matmul rules. Once that lands, plain A * B on CuArrays will work and the wrappers will be deprecated.
3. Gradients must be written as single-op broadcasts, not fused expressions
Enzyme on GPU can lower a single broadcast kernel at a time but aborts when a kernel fuses multiple operations. This means the fused form
# BROKEN: aborts during Enzyme compile on GPU
gradlogp(β) = -β .- pmcmc_matmul(transpose(X), pmcmc_matmul(X, β)) ./ Float32(N)must be written as a sequence of single-op stages:
# WORKS: each line is one CUDA kernel Enzyme can differentiate
function gradlogp(β)
Xβ = pmcmc_matmul(X, β)
Y = pmcmc_matmul(transpose(X), Xβ)
Y = Y ./ Float32(N)
Y = Y .+ β
return -Y
endMooncake does not have this restriction.
Worked example: Bayesian logistic regression on GPU
A standard Bayesian logistic regression with a Gaussian prior; an example model where GPU acceleration actually pays off as N and D grow. Posterior:
\[\log p(\beta \mid y, X) = -\tfrac{1}{2}\|\beta\|^2 + \sum_{i=1}^{N} \bigl[ y_i \, x_i^\top \beta - \log(1 + e^{x_i^\top \beta}) \bigr],\]
with gradient $\nabla_\beta \log p = -\beta + X^\top (y - \sigma(X\beta))$, where $\sigma$ is the logistic sigmoid.
We use a numerically stable softplus on GPU:
\[\log(1 + e^{z}) = \max(z, 0) + \log(1 + e^{-|z|}),\]
so neither the positive nor negative tail overflows in Float32.
Synthetic data
using Random, CUDA
D = 50; N = 1000
rng = MersenneTwister(0)
β_true = randn(rng, Float32, D)
X_cpu = randn(rng, Float32, N, D)
probs = 1f0 ./ (1f0 .+ exp.(-(X_cpu * β_true)))
y_cpu = Float32.(rand(rng, N) .< probs)
X_gpu = CUDA.CuMatrix(X_cpu)
y_gpu = CUDA.CuVector(y_cpu)Mooncake backend (plain operators)
using ParallelMCMC, MCMCChains
using ADTypes, Mooncake
softplus(z) = log1p(exp(-abs(z))) + max(z, zero(z))
logp(β) = -0.5f0 * sum(abs2, β) + dot(y_gpu, X_gpu * β) -
sum(softplus.(X_gpu * β))
gradlogp(β) = let z = X_gpu * β
-β .+ transpose(X_gpu) * (y_gpu .- 1f0 ./ (1f0 .+ exp.(-z)))
end
function logp_batch(B)
Z = X_gpu * B
prior = -0.5f0 .* vec(sum(abs2, B; dims=1))
ll = vec(sum(y_gpu .* Z; dims=1)) .- vec(sum(softplus.(Z); dims=1))
return prior .+ ll
end
function gradlogp_batch(B)
Z = X_gpu * B
P = 1f0 ./ (1f0 .+ exp.(-Z))
return -B .+ transpose(X_gpu) * (y_gpu .- P)
end
model = DensityModel(logp, gradlogp, D;
logdensity_batch=logp_batch,
grad_logdensity_batch=gradlogp_batch)
sampler = ParallelMALASampler(0.005f0;
T=16, damping=0.5f0,
backend=ADTypes.AutoMooncake(; config=nothing))
chain = sample(model, sampler, 1_600;
initial_params=CUDA.zeros(Float32, D),
chain_type=MCMCChains.Chains)Posterior mean recovery error ‖β_post − β_true‖ / ‖β_true‖ should land in the 0.1–0.2 range after a few hundred post-warmup samples.
Enzyme backend (requires `pmcmc_matmul)
Same model, with the GPU-Enzyme restrictions applied: every * becomes pmcmc_matmul, every dot becomes pmcmc_dot, and every gradient broadcast is expanded into single-op stages:
using ParallelMCMC, MCMCChains
using ADTypes, Enzyme
function logp(β)
z = pmcmc_matmul(X_gpu, β)
az = abs.(z)
ez = exp.(.-az)
sp1 = log1p.(ez)
sp2 = max.(z, 0f0)
sp = sp1 .+ sp2
ll = pmcmc_dot(y_gpu, z) - sum(sp)
prior = -0.5f0 * sum(abs2, β)
return prior + ll
end
function gradlogp(β)
z = pmcmc_matmul(X_gpu, β)
ez = exp.(.-z)
den = 1f0 .+ ez
p = 1f0 ./ den
resid = y_gpu .- p
g_ll = pmcmc_matmul(transpose(X_gpu), resid)
g = .-β
g = g .+ g_ll
return g
end
function logp_batch(B)
Z = pmcmc_matmul(X_gpu, B)
aZ = abs.(Z)
eZ = exp.(.-aZ)
sp1 = log1p.(eZ)
sp2 = max.(Z, 0f0)
SP = sp1 .+ sp2
yZ = y_gpu .* Z
ll = vec(sum(yZ; dims=1)) .- vec(sum(SP; dims=1))
prior = -0.5f0 .* vec(sum(abs2, B; dims=1))
return prior .+ ll
end
function gradlogp_batch(B)
Z = pmcmc_matmul(X_gpu, B)
eZ = exp.(.-Z)
denZ = 1f0 .+ eZ
P = 1f0 ./ denZ
Resid = y_gpu .- P
G_ll = pmcmc_matmul(transpose(X_gpu), Resid)
G = .-B
G = G .+ G_ll
return G
end
model = DensityModel(logp, gradlogp, D;
logdensity_batch=logp_batch,
grad_logdensity_batch=gradlogp_batch)
sampler = ParallelMALASampler(0.005f0;
T=16, damping=0.5f0,
backend=ADTypes.AutoEnzyme())
chain = sample(model, sampler, 1_600;
initial_params=CUDA.zeros(Float32, D),
chain_type=MCMCChains.Chains)The two snippets sample the same posterior; the difference is purely in what the AD backend can chew on.
The AD-HVP fallback (and when to write your own)
DEER needs a Hessian–vector product $H v$ at every Newton step. DensityModel accepts both gradlogp and (optionally) hvp / hvp_batch. Depending on what you supply, the sampler does one of two things:
- You supply
hvp/hvp_batch. These run as plain kernels. The AD backend is never invoked for HVPs. - You only supply
gradlogp/grad_logdensity_batch. The sampler builds the HVP by differentiating your gradient — either a forward-mode pushforward ofgradlogp(ForwardOnGrad, the default for most backends) or a reverse-mode gradient ofx -> dot(gradlogp(x), v)(ReverseOnGrad, used forAutoMooncakeandAutoZygote). This is the AD-HVP fallback, and it is what the logistic-regression example above uses.
When the fallback is the right call
- Complex or composed models. Bayesian neural nets, hierarchical models with many transformations, mixtures, or anything where the Hessian has no convenient closed form. Deriving and maintaining
hvpby hand for these is error-prone; AD removes a whole class of bugs. - Prototyping. You want a correct sampler running before you optimize. Drop in
gradlogp, lean on AD, profile later. - Operations that AD libraries handle for free but are tedious to differentiate by hand. Special functions, link functions, log-sum-exp, normalizing constants of standard distributions, etc.
- You want backend flexibility. With only
gradlogpyou can swapAutoMooncake↔AutoEnzyme↔AutoZygoteto compare without rewriting model code.
When to write your own analytical HVP
- The HVP has a clean closed form. Quadratic priors, Gaussian likelihoods, GLMs (logistic, Poisson, probit) — the second derivative is a known function of intermediate quantities you already compute in
gradlogp. A few extra lines and you skip the AD pipeline entirely. - Performance matters and the AD compile is heavy. Enzyme and Mooncake both pay a one-shot compilation cost on the user's gradient. For long-running chains this amortizes, but for many short runs the analytical HVP wins.
- You're hitting AD-backend-specific GPU restrictions. The Enzyme limitations above (
pmcmc_*wrappers, staged broadcasts) only matter when the AD backend is invoked. Supplying analytical HVP sidesteps them — yourgradlogpandhvpcan use plain*,dot,sumeven withbackend=AutoEnzyme(), because the backend is held but never run. - You can reuse intermediates between gradient and HVP. When
hvpshares $X\beta$, $\sigma(X\beta)$, or similar with the gradient computation, an analytical version can be both faster and shorter than what AD produces.
Same example with analytical HVP
For Bayesian logistic regression with a Gaussian prior, the Hessian is
\[\nabla^2 \log p = -I - X^\top \, \mathrm{diag}\bigl(\sigma(X\beta) \odot (1 - \sigma(X\beta))\bigr) \, X,\]
so
\[H v = -v - X^\top \bigl[\sigma(X\beta) \odot (1 - \sigma(X\beta)) \odot (X v)\bigr].\]
Plug it in alongside gradlogp and the sampler stops invoking the AD path:
function hvp(β, v)
z = X_gpu * β
σz = 1f0 ./ (1f0 .+ exp.(-z))
w = σz .* (1f0 .- σz)
-v .- transpose(X_gpu) * (w .* (X_gpu * v))
end
function hvp_batch(B, V)
Z = X_gpu * B
Σ = 1f0 ./ (1f0 .+ exp.(-Z))
W = Σ .* (1f0 .- Σ)
-V .- transpose(X_gpu) * (W .* (X_gpu * V))
end
model = DensityModel(logp, gradlogp, D;
logdensity_batch=logp_batch,
grad_logdensity_batch=gradlogp_batch,
hvp=hvp, hvp_batch=hvp_batch)
# `backend` is still required by the API, but is never invoked.
sampler = ParallelMALASampler(0.005f0;
T=16, damping=0.5f0,
backend=ADTypes.AutoMooncake(; config=nothing))This recovers the same posterior as the fallback version — the only difference is that HVPs come from a plain matmul instead of a reverse-mode pass over gradlogp.
Picking an AD backend on GPU
| Backend | GPU support | Restrictions | Recommended when |
|---|---|---|---|
AutoEnzyme() | Via pmcmc_matmul / pmcmc_dot / pmcmc_dotsum and single-op-broadcast gradients | The restrictions above | You want Enzyme's reverse-mode performance and are willing to follow the rules above |
AutoMooncake(; config=nothing) | Native (no wrappers) | Slower compile | You want plain Julia operators with no rewriting |
AutoZygote() | Native (no wrappers) | ChainRules-based; reverse-only, plus no in-place mutation in gradlogp | You want plain Julia operators and are already on the ChainRules ecosystem |
AutoForwardDiff() | Native; works with * / dot / sum / broadcasts on CuArray{Dual} | Cost scales as O(D) | Low-D models, prototyping, validating Mooncake/Enzyme results |
If you supply hvp / hvp_batch analytically, none of the above matters — those run as plain kernels and the sampler does not invoke a backend.