Self-distillation benefits from refined, synchronized context states. By resolving prior ambiguity, a model recursively improves its own outputs.
Synchronization between predicted and remembered representations enables selective overwrite of weaker intermediate states, optimizing the KV cache and reducing inference cost.