On-policy distillation / structured outputs

ListOPD: The Extrapolation Cliff in OPD

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

Reward extrapolation can lift a student past its teacher, but on near-deterministic structured outputs it crosses a computable boundary where the output contract collapses.

Xin Li, Hao Jiang, Annan Wang, Yichi Zhang, Chau Yuen
Nanyang Technological University, Singapore

Preprint version Project page: lixin.ai/ListOPD
Clip-safe geometry and Fashion validation showing a sharp parse-rate cliff around lambda star.
The main diagnostic: the extrapolated fixed point exits the clip-safe region at the same scale where strict JSON parse validity collapses.

A boundary, not just a hyperparameter.

ListOPD turns a brittle lambda sweep into a measurable safety predicate for structured-output OPD.

λ* = 1.22 Closed-form base-neutral cliff marker from teacher modal probability and IS clip strength.
[1.204, 1.228] Five-seed fine-grid onset interval on Fashion K=8 JSON listwise ranking.
33.5% → 94.8% Strict parse rate moves from SFT to sub-threshold ListOPD at the operating point.
1.7B ≈ 8B A 1.7B Qwen3 ListOPD student reaches in-domain parity with an 8B-SFT baseline.

Mechanism

When lambda sharpens a near-deterministic structural token too far, the fixed point leaves the IS-clip-safe tail-mass region.

Prediction

The cliff is predicted from measurable quantities: teacher modal probability, warm-start mass, and clipping strength.

Operating Rule

Choose lambda below the measured deployment-budget boundary; above it, format-preserving training becomes format-collapsing.

The evidence is visual.

The plots below show the predicted boundary, the observed parse-rate collapse, and the finite-budget shifts.

Cliff and size axis

Fashion cliff validation and size-axis results for ListOPD.
Strict parse validity changes sharply at the predicted lambda band; below the cliff, ListOPD flattens the size axis for the structured-output contract.

ListOPD pipeline

Diagram of the ListOPD training pipeline.
Student rollouts are trained against teacher token probabilities under reward-extrapolated OPD on listwise JSON outputs.

Finite-budget dynamics

Finite-budget dynamics of the parse-rate cliff.
The finite-budget version of the cliff shifts the observed boundary, motivating deployment-budget measurement.

Budget shift

Budget extension test for the cliff location.
A budget-extension test lands inside its locked prediction window.

Clip mechanism

Multi-lambda mechanism plot for the IS-clipped extrapolation boundary.
The IS-clipped mechanism exposes where extrapolated targets move outside the safe region.

Paper and materials.

Preprint and verification materials for reproducing the reported aggregate numbers.

Paper PDF

Preprint PDF with appendix, figures, and bibliography.

Open PDF

Verification bundle

Scripts, configs, and aggregate metrics for checking the reported numbers.

Download bundle

Cite ListOPD.

BibTeX entry for the current preprint.

@misc{li2026listopd,
  title  = {The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs},
  author = {Li, Xin and Jiang, Hao and Wang, Annan and Zhang, Yichi and Yuen, Chau},
  year   = {2026},
  note   = {Preprint}
}