This document captures the complete development progression of the context save/load system. The problem was defined as: make transformer conversational memory persist across runs. It has now been solved.
g123.py
g123.py
is the final working program. It fully saves the past_key_values (PKV) cache, chat history, token positions, logits, and restores them robustly without assumptions. Restoration works under both uc=False
and uc=True
.
Main structure:
initmodel()
: loads tokenizer and model, captures the blank PKV type.initsession()
: fresh session with primed cache and reset chat, if needed.contextsave()
: stores cache tensors, prompt, new token slices, logits, chat.contextload()
: loads all data, rehydrates PKV via dummy forward pass.runturn()
: tokenizes, updates cp
, generates with position-aware cache.The cache is fixed in size and cannot be dynamically extended beyond its preset length. Attempting to overrun it results in failure or truncation. This principle forces discipline around preallocation and input slicing.
Before using the cache, it must be created with maximum anticipated length. A dummy pass at startup allocates this capacity, ensuring later appends do not break alignment or size assumptions.
All conversational inputs must be passed through the tokenizer's apply_chat_template
method. This preserves expected token structure and speaker roles, ensuring model attention aligns with dialogue format.
Cache is not enough. Full session includes chat log, token position cp
, and cache tensors. If any part is mismatched (e.g. cache is loaded but cp
reset), the system enters a logically invalid state. The solution was strict separation of initmodel()
and initsession()
.
Only the model can define a valid empty cache object. You must not impose previously saved structure directly. Instead, generate a new valid cache via dummy pass, then inject saved tensors into that shell. This is what g123.py
does and is why it succeeds.
With g123.py
, restoration now works under full conditions. This confirms all five pillars. The system can now perform real chat with persistent memory.