tranSymbolics - template

The minimum required to save a Transformer model's inference state includes six items. First is the KV cache, which holds the stored keys and values per layer and attention head. Second is the token count, representing how many tokens have been processed so far; this determines positional behavior. Third is the position index, often equal to the token count unless a rotary or learned offset is used. Fourth is the attention mask, which prevents tokens from attending to future positions; this may be implicit or explicitly defined as a tensor of ones. Fifth is the input token buffer, a copy of the tokens already processed, used for speculative decoding or context reconstruction. Sixth is the model configuration, including layer count and head geometry, which must match when restoring the cache. To capture these during inference with Gemma 3, run the model with use_cache set to true. This produces outputs including past_key_values, which is the KV cache. Record the shape of input_ids to determine the token count. If needed, compute a position index based on the token count. If an attention mask is required, create a tensor of ones of shape one by token count. Preserve the original input_ids as the input buffer. Extract the model configuration using model.config.to_dict(). To resume, supply new tokens to the model along with the stored past_key_values and other state elements. This allows seamless continuation of generation.

Name	What	Where	Type / Shape	Notes
KV cache	Stored keys and values	outputs.past_key_values	List of (key, value) tuples	Per layer, per head
Token count	Number of tokens seen	input_ids.shape[1]	Integer	Used for position and mask
Position index	Positional embedding offset	same as token count (usually)	Integer	May shift with rotary / learned pos
Attention mask	Visibility mask	generated or implicit	Tensor (1 × token count)	Often all ones; may be auto
Input buffer	Original processed tokens	input_ids	Tensor (1 × seq_len)	Used for replay / speculative gen
Model config	Architecture and geometry	model.config.to_dict()	Dict	Must match when restoring

KV Cache — Inference State Snapshot

Navigation