tranSymbolics

KV Cache — Inference State Snapshot

Name What Where Type / Shape Notes
KV cache Stored keys and values outputs.past_key_values List of (key, value) tuples Per layer, per head
Token count Number of tokens seen input_ids.shape[1] Integer Used for position and mask
Position index Positional embedding offset same as token count (usually) Integer May shift with rotary / learned pos
Attention mask Visibility mask generated or implicit Tensor (1 × token count) Often all ones; may be auto
Input buffer Original processed tokens input_ids Tensor (1 × seq_len) Used for replay / speculative gen
Model config Architecture and geometry model.config.to_dict() Dict Must match when restoring
The minimum required to save a Transformer model's inference state includes six items. First is the KV cache, which holds the stored keys and values per layer and attention head. Second is the token count, representing how many tokens have been processed so far; this determines positional behavior. Third is the position index, often equal to the token count unless a rotary or learned offset is used. Fourth is the attention mask, which prevents tokens from attending to future positions; this may be implicit or explicitly defined as a tensor of ones. Fifth is the input token buffer, a copy of the tokens already processed, used for speculative decoding or context reconstruction. Sixth is the model configuration, including layer count and head geometry, which must match when restoring the cache. To capture these during inference with Gemma 3, run the model with use_cache set to true. This produces outputs including past_key_values, which is the KV cache. Record the shape of input_ids to determine the token count. If needed, compute a position index based on the token count. If an attention mask is required, create a tensor of ones of shape one by token count. Preserve the original input_ids as the input buffer. Extract the model configuration using model.config.to_dict(). To resume, supply new tokens to the model along with the stored past_key_values and other state elements. This allows seamless continuation of generation.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24