FZOO: Fast Zeroth-Order Optimizer for Fine‑Tuning Large Language Models towards Adam‑Scale Speed

Sizhe Dang1*, Yangyang Guo1*, Yanjun Zhao1*, Haishan Ye1,2*✉, Xiaodong Zheng1,
Guang Dai2, Ivor Tsang3,
1Xi'an Jiaotong University,  2SGIT AI Lab,  3Centre for Frontier Artificial Intelligence Research, A* STAR 
* Equal contribution,  Corresponding author 

Structure of the of FZOO. The bottom half depicts the toy example of the efficient implementation of batched forward passes.

Performance of MeZO, Adam and FZOO on different tasks when fine-tuning RoBERTa-large model. For the sake of uniform comparison, we convert Adam's backward pass into 3 forward passes. FZOO can achieve 18× speedup compared with MeZO, nearly 20× that of Adam.

GPU memory consumption with different OPT models and tuning methods on MultiRC. Adam needs 10 times more memory than FZOO.

Abstract

Fine-tuning large language models (LLMs) often faces GPU memory bottlenecks: the backward pass of first-order optimizers like Adam increases memory usage to more than 10 times the inference level (e.g., 633 GB for OPT-30B). Zeroth-order (ZO) optimizers avoid this cost by estimating gradients only from forward passes, yet existing methods like MeZO usually need tens of times more steps to converge. Can this trade-off between speed and memory in ZO be fundamentally improved? Normalized-SGD, for instance, demonstrates strong empirical performance with greater memory efficiency than Adam. In light of this, we introduce FZOO, a Fast Zeroth-Order Optimizer towards Adam-Scale Speed. On the one hand, FZOO reduces the total forward passes needed for convergence by employing batched one-sided estimates that adapt step-sizes based on the standard deviation of batch losses. On the other hand, it accelerates per-batch computation through the use of Rademacher random vector (±1) perturbations coupled with CUDA's parallel processing capabilities. Extensive experiments on diverse models (including RoBERTa-large, the OPT family (350M-66B), Phi-2, and Llama3) across 11 varied downstream tasks validate FZOO's effectiveness. On average, FZOO outperforms MeZO by +3% in accuracy while requiring 3× fewer forward passes. Notably, for the RoBERTa-large model, FZOO achieves average improvements of +5.6% in accuracy and 18× reduction in forward passes compared to MeZO, achieving convergence speeds comparable to Adam. We also provide theoretical analysis proving FZOO's formal equivalence to a normalized-SGD update rule and establishing its convergence guarantees. Beyond full-parameter tuning, FZOO plugs smoothly into PEFT techniques, unlocking even larger memory savings. Taken together, our results make single-GPU, high-speed, full-parameter fine-tuning realistic today and point toward future work on memory-efficient pre-training.

Method

Show Method Details (Click to Expand)

Notation

Let $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^{|\mathcal{D}|}$ denote the labeled dataset, and $\mathcal{B} \subset \mathcal{D}$ be a mini-batch of size $B$. The trainable model parameters are denoted as $\theta \in \mathbb{R}^d$, and the empirical loss on $\mathcal{B}$ is $L(\theta; \mathcal{B})$.

Gradient Estimation and Parameter Update

Classical Zeroth-Order (ZO) Gradient Estimation:
$$ \hat{\nabla} L(\theta; \mathcal{B}) = \frac{L(\theta+\epsilon z; \mathcal{B}) - L(\theta-\epsilon z; \mathcal{B})}{2\epsilon} z $$ where $z \sim \mathcal{N}(0, I_d)$. Averaging over $N$ i.i.d. samples gives the $n$-ZO estimator: $$ \hat{\nabla}_N L = \frac{1}{N} \sum_{i=1}^N \hat{\nabla}_i L $$
FZOO One-sided Estimation and Adaptive Step Size:
$$ g_t = \frac{1}{\epsilon N} \sum_{i=1}^N (l_i - l_0) u_i $$ $$ \sigma_t^2 = \frac{1}{N-1} \sum_{i=1}^N \left(l_i - \frac{1}{N} \sum_{j=1}^N l_j \right)^2 $$ $$ \theta_{t+1} = \theta_t - \eta_t \frac{g_t}{\sigma_t} $$ where $u_i \in \{\pm 1\}^d$ are Rademacher random vectors, $l_0 = L(\theta_t; \mathcal{B}_t)$, and $l_i = L(\theta_t + \epsilon u_i; \mathcal{B}_t)$.

Efficient Batched Forward Implementation

The original ZO pipeline runs separate forward passes for each perturbation, limiting computational efficiency. FZOO leverages Rademacher vectors and CUDA kernel fusion for faster computation. For the first layer: $$ F^{(1)} = W^{(1)}X, \quad Y^{(1)}_i = F^{(1)} + \epsilon (u_i \odot X), \quad i=1,\dots,N $$ For subsequent layers ($j \geq 2$): $$ F^{(j)} = W^{(j)} Y^{(j-1)}, \quad P^{(j)} = \epsilon (U \odot Y^{(j-1)}), \quad Y^{(j)} = F^{(j)} + P^{(j)} $$ where $U = \mathrm{diag}(u_1, \dots, u_N)$ is the block-diagonal sign matrix. By fusing $N$ matrix multiplications into a single CUDA kernel, the wall-clock time is reduced by a factor $p$.
The overall speed-up achieved by FZOO is: $$ \boxed{\,f \times \min(s, r)\,} $$ where $f$ is the gain from one-sided estimation, $r$ is the number of parallel perturbations, and $s$ is the CUDA kernel fusion speedup. On OPT-125M with $N=8$, our batched scheme delivers a $1.92 \times$ speed-up over the standard "8 perturbations + 8 forward passes" baseline.


Results

Convergence speed comparison

Performance Comparison

BibTeX

@article{dang2025fzoo,
  title={FZOO: Fast Zeroth-Order Optimizer for Fine-Tuning Large Language Models towards Adam-Scale Speed},
  author={Sizhe Dang and Yangyang Guo and Yanjun Zhao and Haishan Ye and Xiaodong Zheng and Guang Dai and Ivor Tsang},
  journal={arXiv preprint arXiv:2506.09034},
  year={2025}
}