Fine-tuning large language models (LLMs) often faces GPU memory bottlenecks: the backward pass of first-order optimizers like Adam increases memory usage to more than 10 times the inference level (e.g., 633 GB for OPT-30B). Zeroth-order (ZO) optimizers avoid this cost by estimating gradients only from forward passes, yet existing methods like MeZO usually need tens of times more steps to converge. Can this trade-off between speed and memory in ZO be fundamentally improved? Normalized-SGD, for instance, demonstrates strong empirical performance with greater memory efficiency than Adam. In light of this, we introduce FZOO, a Fast Zeroth-Order Optimizer towards Adam-Scale Speed. On the one hand, FZOO reduces the total forward passes needed for convergence by employing batched one-sided estimates that adapt step-sizes based on the standard deviation of batch losses. On the other hand, it accelerates per-batch computation through the use of Rademacher random vector (±1) perturbations coupled with CUDA's parallel processing capabilities. Extensive experiments on diverse models (including RoBERTa-large, the OPT family (350M-66B), Phi-2, and Llama3) across 11 varied downstream tasks validate FZOO's effectiveness. On average, FZOO outperforms MeZO by +3% in accuracy while requiring 3× fewer forward passes. Notably, for the RoBERTa-large model, FZOO achieves average improvements of +5.6% in accuracy and 18× reduction in forward passes compared to MeZO, achieving convergence speeds comparable to Adam. We also provide theoretical analysis proving FZOO's formal equivalence to a normalized-SGD update rule and establishing its convergence guarantees. Beyond full-parameter tuning, FZOO plugs smoothly into PEFT techniques, unlocking even larger memory savings. Taken together, our results make single-GPU, high-speed, full-parameter fine-tuning realistic today and point toward future work on memory-efficient pre-training.
Let $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^{|\mathcal{D}|}$ denote the labeled dataset, and $\mathcal{B} \subset \mathcal{D}$ be a mini-batch of size $B$. The trainable model parameters are denoted as $\theta \in \mathbb{R}^d$, and the empirical loss on $\mathcal{B}$ is $L(\theta; \mathcal{B})$.
Classical Zeroth-Order (ZO) Gradient Estimation:
$$
\hat{\nabla} L(\theta; \mathcal{B}) = \frac{L(\theta+\epsilon z; \mathcal{B}) - L(\theta-\epsilon z; \mathcal{B})}{2\epsilon} z
$$
where $z \sim \mathcal{N}(0, I_d)$. Averaging over $N$ i.i.d. samples gives the $n$-ZO estimator:
$$
\hat{\nabla}_N L = \frac{1}{N} \sum_{i=1}^N \hat{\nabla}_i L
$$
FZOO One-sided Estimation and Adaptive Step Size:
$$
g_t = \frac{1}{\epsilon N} \sum_{i=1}^N (l_i - l_0) u_i
$$
$$
\sigma_t^2 = \frac{1}{N-1} \sum_{i=1}^N \left(l_i - \frac{1}{N} \sum_{j=1}^N l_j \right)^2
$$
$$
\theta_{t+1} = \theta_t - \eta_t \frac{g_t}{\sigma_t}
$$
where $u_i \in \{\pm 1\}^d$ are Rademacher random vectors, $l_0 = L(\theta_t; \mathcal{B}_t)$, and $l_i = L(\theta_t + \epsilon u_i; \mathcal{B}_t)$.
The original ZO pipeline runs separate forward passes for each perturbation, limiting computational efficiency. FZOO leverages Rademacher vectors and CUDA kernel fusion for faster computation. For the first layer:
$$
F^{(1)} = W^{(1)}X, \quad Y^{(1)}_i = F^{(1)} + \epsilon (u_i \odot X), \quad i=1,\dots,N
$$
For subsequent layers ($j \geq 2$):
$$
F^{(j)} = W^{(j)} Y^{(j-1)}, \quad P^{(j)} = \epsilon (U \odot Y^{(j-1)}), \quad Y^{(j)} = F^{(j)} + P^{(j)}
$$
where $U = \mathrm{diag}(u_1, \dots, u_N)$ is the block-diagonal sign matrix. By fusing $N$ matrix multiplications into a single CUDA kernel, the wall-clock time is reduced by a factor $p$.
The overall speed-up achieved by FZOO is:
$$
\boxed{\,f \times \min(s, r)\,}
$$
where $f$ is the gain from one-sided estimation, $r$ is the number of parallel perturbations, and $s$ is the CUDA kernel fusion speedup. On OPT-125M with $N=8$, our batched scheme delivers a $1.92 \times$ speed-up over the standard "8 perturbations + 8 forward passes" baseline.
@article{dang2025fzoo,
title={FZOO: Fast Zeroth-Order Optimizer for Fine-Tuning Large Language Models towards Adam-Scale Speed},
author={Sizhe Dang and Yangyang Guo and Yanjun Zhao and Haishan Ye and Xiaodong Zheng and Guang Dai and Ivor Tsang},
journal={arXiv preprint arXiv:2506.09034},
year={2025}
}