LLM-based web agents have recently made significant progress, but much of it has
occurred in closed-source systems—widening the gap with open-source alternatives.
Progress has been held back by two key challenges—first, a narrow focus on single-
step tasks that overlooks the complexity of multi-step web interactions, and second,
the high compute costs required to post-train LLM-based web agents. To address
this, we present the first statistically grounded study on compute allocation for
LLM web-agent post-training. Our approach uses a two-stage pipeline, training
a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via SFT, followed
by on-policy reinforcement learning. We find this process highly sensitive to
hyperparameter choices — exhaustive sweeps are impractical. To spare others from
expensive trial-and-error, we sample 1,370 configurations and use bootstrapping
to estimate effective hyperparameters. Our results show that combining SFT with
on-policy RL consistently outperforms either approach alone on both WorkArena
and MiniWob++. Further, this strategy only requires 55% of the compute to match
the peak of pure SFT on MiniWob++, pushing the compute–performance Pareto
frontier and is the only strategy that can close the gap with closed-source models.