KV Cache — Inference State Snapshot
Name |
What |
Where |
Type / Shape |
Notes |
KV cache |
Stored keys and values |
outputs.past_key_values |
List of (key, value) tuples |
Per layer, per head |
Token count |
Number of tokens seen |
input_ids.shape[1] |
Integer |
Used for position and mask |
Position index |
Positional embedding offset |
same as token count (usually) |
Integer |
May shift with rotary / learned pos |
Attention mask |
Visibility mask |
generated or implicit |
Tensor (1 × token count) |
Often all ones; may be auto |
Input buffer |
Original processed tokens |
input_ids |
Tensor (1 × seq_len) |
Used for replay / speculative gen |
Model config |
Architecture and geometry |
model.config.to_dict() |
Dict |
Must match when restoring |
The minimum required to save a Transformer model's inference state includes six items. First is the KV cache, which holds the stored keys and values per layer and attention head. Second is the token count, representing how many tokens have been processed so far; this determines positional behavior. Third is the position index, often equal to the token count unless a rotary or learned offset is used. Fourth is the attention mask, which prevents tokens from attending to future positions; this may be implicit or explicitly defined as a tensor of ones. Fifth is the input token buffer, a copy of the tokens already processed, used for speculative decoding or context reconstruction. Sixth is the model configuration, including layer count and head geometry, which must match when restoring the cache.
To capture these during inference with Gemma 3, run the model with use_cache set to true. This produces outputs including past_key_values, which is the KV cache. Record the shape of input_ids to determine the token count. If needed, compute a position index based on the token count. If an attention mask is required, create a tensor of ones of shape one by token count. Preserve the original input_ids as the input buffer. Extract the model configuration using model.config.to_dict(). To resume, supply new tokens to the model along with the stored past_key_values and other state elements. This allows seamless continuation of generation.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24