Diffusion‐based models can be applied directly to a transformer’s KV cache to smooth and denoise the stored representations. By treating the stacked key and value tensors as a high‐dimensional latent state, a forward diffusion process gradually adds controlled noise, and a learned reverse process iteratively restores structure. This has the effect of refining attention memory—fading out spurious correlations while reinforcing consistent context signals across layers and heads. The result is a more robust context snapshot that the transformer can use for subsequent decoding, reducing jitter and improving coherence over long sequences.
In practice, snapshot the KV cache tensor (shape 2×B×H×T×D) from the transformer and feed it through a diffusion network—typically a U-Net or Transformer‐Diffuser—that is trained to reverse Gaussian noise at various noise levels. At inference, apply a small number of reverse steps (e.g., 10–20) to each cache snapshot. The denoised KV states are then reinserted into the transformer’s past_key_values before generating the next token. This inline diffusion pass refines residual noise in attention maps, aligns misaligned head embeddings, and consolidates redundant context entries. Early experiments show improvement in factual consistency and reduction of hallucinations, especially on tasks with noisy or partially corrupted context.