var | description | insight # |
---|
mod | Transformer model object returned by `AutoModelForCausalLM`. This encapsulates the entire causal language model including token embeddings, transformer blocks, and final LM head. When loaded with `from_pretrained()`, it includes learned weights and configuration specific to the selected variant (e.g. Gemma 1b or 4b). All forward passes, whether dummy init or actual inference, are routed through this object. Central to runtime behavior, especially for capturing and restoring past key/values (`pkv`). | 1 |
tok | AutoTokenizer instance used to encode string-based chat input into token IDs, and decode token predictions back into text. Applies HuggingFace chat templates, manages special tokens like , , , and provides token ID mappings such as `pad_token_id` and `eos_token_id`. The tokenizer must match the model family to ensure compatibility of token vocab and prompt formatting. | 3 |
maxlen | Defines the maximum token sequence length used during dummy initialization (1024 tokens). Although model capacity is dynamic, this sets the shape of the padded input for triggering KV cache allocation. It represents the largest span over which `dummyids` will stretch and directly defines the memory shape of the attention layers on first use. | 3 |
padid | Token ID used for padding during dummy input creation. Initially taken from `tok.pad_token_id`. If undefined, it falls back to `tok.eos_token_id`. Used to construct filler token arrays that prime the model’s KV cache during `torch.no_grad()` initialization, helping to allocate all necessary memory buffers for subsequent context reuse. | 1 |
pkv | Past key/values captured from a dummy model forward pass. This tuple of tensors holds intermediate attention values for every layer of the transformer and is necessary for enabling fast continuation of text generation. Each `forward()` call consumes and modifies `pkv`. Correct cache management is essential for inference speed, token alignment, and multi-turn continuity. If `pkv` is out of sync with the current prompt position (`cp`), the model will generate incoherent or broken output. Key component of cache reuse behavior. | 1,4 |
chat | List of alternating user and model messages, each represented as a dictionary with `role` and `content`. This structure feeds into `tok.apply_chat_template()`, enabling prompt formatting consistent with chat models. It grows over time, preserving all dialog state needed for prompt regeneration and progressive context extension. | 3 |
turns | Static list of predefined user prompts used for simulating a back-and-forth chat. These test context accumulation and model memory, including cross-reference prompts (e.g. asking for earlier variables). Appended to `chat` turn by turn. | 3 |
cp | Global token cursor that tracks how many input tokens have been sent to the model. Used to index into new prompt slices and generate the absolute position tensor (`pos`) that aligns with attention memory. Without correct `cp`, the model’s `cache_position` will desynchronize, leading to logical hallucination, misaligned responses, or KV reuse errors. Every token sent or sampled must advance `cp` accordingly. It defines the left-edge of `newids`, sets range for `pos`, and is incremented at each generation step. | 2 |
dev | String representing compute target (e.g. `"cpu"` or `"cuda"`). Used when loading the model and when transferring token tensors and position indices to match the execution context. Affects speed and float precision availability. | — |
dtype | Selected PyTorch floating-point precision: `torch.float32`, `torch.float16`, or `torch.bfloat16`. Determines numerical fidelity and memory use during model inference. Passed during model load via `torch_dtype=dtype`. | — |
modpath | Resolved identifier for model to be loaded, either a snapshot directory or remote hub string (e.g. `google/gemma-3-4b-it`). Selected based on hostname to allow per-machine configuration. | — |
mp | Dictionary mapping model size keys (like `"1b"` or `"4b"`) to specific snapshot paths or identifiers. Provides a clean indirection layer for selecting the desired model variant. | — |
dp | Maps friendly device labels (`"cpu"`/`"gpu"`) to strings PyTorch expects (`"cpu"`/`"cuda"`). Simplifies platform branching logic. | — |
tp | Maps float type labels (`"f32"`, `"f16"`, etc.) to actual PyTorch `torch.float*` types. Used for model loading. | — |
rng | Tuple representing shape of dummy input token tensor: `(1, maxlen)`. Defines single-batch input of length 1024. | 4 |
dummyids | Tensor of shape `[1, maxlen]` filled with the `padid` token. This dummy input is sent to the model during setup to force initialization of internal attention layers and preallocate `past_key_values`. Required to bootstrap cache structure. | 4 |
r | Model output from dummy forward call. Used only to extract `past_key_values` once. Contains `logits`, `pkv`, and other optional outputs depending on model config. | 4 |
promptids | Token tensor returned by `apply_chat_template()`, containing full chat history as a flat input sequence. Used to feed the model on each turn. | 3 |
newids | Subset of `promptids`, starting from current position `cp`. Represents newly added input tokens for this turn. This slice is passed into `mod()` for generation. | 3 |
newlen | Width (number of tokens) in the current `newids` slice. Used to build `pos` and update `cp` after generation. | 2 |
pos | Tensor of absolute token positions starting at `cp` and ending at `cp+newlen`. Passed to model as `cache_position` to preserve positional encoding across cached inputs. Critical for correct reuse of KV cache across turns. | 2 |
out | Output object returned from model forward pass, typically containing `logits`, `past_key_values`, and potentially other fields like `attentions`. Used to extract prediction logits for next token sampling. | 4 |
outstr | String buffer used to accumulate generated output character-by-character during the autoregressive sampling loop. Appended one decoded token at a time. | 3 |
i | Loop index variable in generation loop. Used to limit token count to 60 per reply to avoid runaway generation. | 4 |
t | Multipurpose temp variable reused for `time.time()` and `out.logits`. Context determines use. Reuse is unsafe but tolerated in this context. | — |
nxt | Token ID with highest probability in the model's output logits. Computed by greedy argmax. This is decoded and fed back into the model loop. | 4 |
tokstr | String representation of `nxt`, derived using `tok.decode()`. If it matches ``, generation halts early. | 3 |
v | Single-token tensor reshaped to 2D `[1,1]` for model compatibility. Used in loop for single-token stepwise decoding. | 4 |