Mechanism
When lambda sharpens a near-deterministic structural token too far, the fixed point leaves the IS-clip-safe tail-mass region.
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
Reward extrapolation can lift a student past its teacher, but on near-deterministic structured outputs it crosses a computable boundary where the output contract collapses.
ListOPD turns a brittle lambda sweep into a measurable safety predicate for structured-output OPD.
When lambda sharpens a near-deterministic structural token too far, the fixed point leaves the IS-clip-safe tail-mass region.
The cliff is predicted from measurable quantities: teacher modal probability, warm-start mass, and clipping strength.
Choose lambda below the measured deployment-budget boundary; above it, format-preserving training becomes format-collapsing.
The plots below show the predicted boundary, the observed parse-rate collapse, and the finite-budget shifts.
Preprint and verification materials for reproducing the reported aggregate numbers.
Preprint PDF with appendix, figures, and bibliography.
Scripts, configs, and aggregate metrics for checking the reported numbers.
BibTeX entry for the current preprint.
@misc{li2026listopd,
title = {The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs},
author = {Li, Xin and Jiang, Hao and Wang, Annan and Zhang, Yichi and Yuen, Chau},
year = {2026},
note = {Preprint}
}