tranSymbolics

Using Gemma 3 KV cache For Mult-turn Chat

Introduction

The pursuit of local, private, and performant large language models (LLMs) is a significant area of development. Moving from simple text generation to interactive, stateful conversation presents a primary challenge: efficiency. A naive chat implementation, which re-evaluates the entire conversation history for every new message, becomes unusably slow. The key to performant chat is state management. This paper documents the iterative process of creating a robust script that leverages the model's internal Key-Value (KV) cache, addressing challenges such as cache management to achieve a functional and efficient solution.

A transformer model's performance relies on its ability to calculate attention scores between tokens. In a conversation, most tokens from previous turns remain constant. Re-calculating their attention states repeatedly is computationally wasteful. The KV cache prevents this by storing the keys and values for each layer as a cache object (past_key_values or pkv). For subsequent inputs, this cache is provided with only new tokens, allowing the model to use the cached state for old tokens and compute the state for new ones, reducing processing time.

The primary technical hurdle was managing the StaticCache, which requires the cache tensor to be allocated to its maximum size initially. Failure to do so results in an IndexError when the conversation length exceeds the initial allocation. A one-time dummy forward pass with an input tensor of the desired maximum length initializes the cache correctly. Additionally, instruction-tuned models require a strict chat template to ensure conversational behavior.

Final Program Implementation (g99.py)

The final script, g99.py, is a concise implementation of the caching strategy, incorporating the three pillars: respecting the Static Cache, executing a Cache Priming Pass, and adhering to the Mandate of the Chat Template.

Configuration and Loading

tok = AutoTokenizer.from_pretrained(modpath)mod = AutoModelForCausalLM.from_pretrained(modpath, torch_dtype=dtype, device_map={"":dev})

Cache Pre-allocation

A dummy tensor of max_len (e.g., 1024) filled with the padding token ID is created. A single forward pass with use_cache=True populates the pkv object with correctly-sized, neutral cache tensors.

max_len = 1024pad_id = tok.pad_token_id if tok.pad_token_id is not None else tok.eos_token_idwith torch.no_grad():  dummy_ids = torch.full((1, max_len), pad_id, dtype=torch.long, device=dev)  pkv = mod(input_ids=dummy_ids, use_cache=True).past_key_values

Conversation Loop

The script iterates through a list of turns, managing state and processing incrementally.

State Management

A chat list holds the conversation history in the required role/content format. cp (cache position) tracks the current context length.

Incremental Processing

chat.append({"role": "user", "content": t})prompt_ids = tok.apply_chat_template(chat, add_generation_prompt=True, return_tensors="pt").to(dev)new_ids = prompt_ids[:, cp:]new_len = new_ids.shape[1]pos = torch.arange(cp, cp + new_len, device=dev)with torch.no_grad():  out = mod(input_ids=new_ids, use_cache=True, past_key_values=pkv, cache_position=pos)cp += new_len

Token-by-Token Generation

for i in range(30):  pos = torch.tensor([cp], device=dev)  out = mod(input_ids=nxt.view(1, 1), use_cache=True, past_key_values=pkv, cache_position=pos)  cp += 1

History Update

The complete model response is appended to the chat list, preparing the state for the next turn.

Conclusion

This implementation demonstrates an efficient, stateful conversational AI system. By leveraging the KV cache, priming it correctly, and adhering to the chat template, the script achieves performance improvements over naive re-processing. The performance difference between 1B and 4B model variants highlights the relationship between parameter count and reasoning capabilities.

Reflective Analysis: Why This Was Hard

The development of a performant, local, and stateful conversational AI faced non-obvious pitfalls. While the KV cache concept is well-known, its practical implementation for instruction-tuned models like Gemma 3 presented challenges that misled even experienced developers.

The Allure and the Abyss

The promise of a local chatbot—offering privacy, no API costs, and fast responses—was compelling. However, the naive approach of appending new user input to the history and feeding it to the model became computationally untenable as the conversation grew.

Chronicle of Failures

The insight was that errors stemmed from misunderstanding the model’s architecture. Success required guiding the model rather than imposing external solutions.

The Three Pillars of Success

  1. Respect the Static Cache: Gemma 3’s StaticCache allocates memory on the first call. Subsequent inputs exceeding this size fail.
  2. Cache Priming Pass: A dummy input tensor of the desired maximum length (e.g., 1024) filled with pad_token_id initializes a large, neutral cache.
  3. Mandate of the Chat Template: Instruction-tuned models require structured input via apply_chat_template to interpret conversational roles correctly.

Conclusion

The journey from failure to fluidity revealed that efficiency requires precise alignment with the model’s internals. The three pillars—respecting the cache, priming it correctly, and using the chat template—transformed a broken script into a robust solution, highlighting the importance of understanding model architecture over fighting it.

Big 8 + 4: Transformer Elements and Insights

Big 8 Alignment Table

Big 8 ElementRequired BehaviorGyrator Insight Ref(s)
KV CacheMust be captured, updated, and aligned to every token and position. Immutable if reused.1, 4, 7, 11, 21, 30, 40, 44
token EmbeddingsMust remain in sync with tokenizer; out-of-sync embeddings cause hallucinations.2, 5, 42, 43, 50
Positional StateAbsolute position must be correct and continuous. Matches history and cache shape.2, 3, 6, 16, 28, 33, 34, 35, 44
Input token BufferReflects every submitted token in correct order for context replay.12, 17, 22, 23, 27
Attention MaskMatches token shape and preserves causal behavior. Affected by pad tokens.13, 14, 24, 49
Model Config SnapshotEnsures invariant head counts, layer structure, and generation options.10, 20, 45, 52
LayerNorm StateStatic, but adaptive variants require snapshot to avoid logit drift.10, 46
Attention Module WeightsMust not be tuned mid-inference to avoid state mismatch.41, 44, 53

+4 Snapshot Extensions

+4 ElementWhy It MattersInsight Tie-ins
LogitsValidates replay; divergence indicates incorrect context restoration.41, 46, 47
Tokenizer StateMust match training vocab and special token rules to preserve token identity.9, 12, 36, 42
KV Format VersionFormat mismatches between model versions invalidate reuse attempts.1, 19, 35, 58
Prompt Injection MetaPrefix tokens shape behavior; loss or duplication skews intent and position.25, 36, 48

60 Gyrator Insights

Each insight below captures a critical behavior, failure mode, or requirement in transformer-based context restoration. These are drawn from real-world debugging of the g99.py system and its live use of KV cache in multi-turn inference.

  1. "KV cache is transient and model-owned." You do not control its lifecycle. Every forward pass rewrites it. Holding pkv outside the model without explicit capture means you’re using outdated memory.
  2. "cache_position must match absolute token position." This is not relative to the input slice, but to the full, cumulative position since cache start. Feeding mismatched cache_position leads to deeply broken attention behavior without error.
  3. "Prompt length is not fixed; cp must track tokens, not turns." Using turn count or line number is a fallacy—token count expands unpredictably, especially with template wrapping. Misestimating this creates irreversible misalignment.
  4. "Every forward pass updates the internal cache." Even 1-token sampling alters past_key_values. If not reassigned, continued use reflects past state—not the updated context.
  5. "Variable names are not remembered—only patterns." There's no symbol table. Variables like k=5 aren't stored semantically; the model only reuses textual patterns that look like memory.
  6. "Sampling must also advance the context position." Each generated token mutates state and increases position. Failing to increment cp after sampling yields divergent cache behavior.
  7. "Dummy pkv from pad tokens is misleading." Initializing with uniform tokens like [PAD]*1024 doesn’t simulate real attention. Rotary or ALiBi buffers may misconfigure based on dummy inputs.
  8. "Context errors are silent and look valid." Transformers will not raise exceptions on misaligned pkv or position—output continues, but semantic coherence is lost subtly and irreversibly.
  9. "Tokenizer inflation breaks assumptions." Turn-by-turn prompts grow larger than expected due to token expansion, emoji encoding, and role prefixes. Visual turn count ≠ actual token count.
  10. "Precision mismatches cause drift." Using float16 on CPU coerces to float32, breaking numerical parity between init and sampling. These effects compound across turns.
  1. "pkv is not deep-copied by default." Standard Python assignment (pkv2 = pkv) keeps tensor references live—so any mutation during forward pass affects the original unexpectedly.
  2. "Chat templates hide growing prompt cost." Templates like apply_chat_template() inject formatting tokens that expand with each turn. Re-tokenizing history causes unpredictable sequence inflation.
  3. "Input tokens affect model cache shape." Even simple inputs may trigger different cache structures. [PAD], <bos>, or unknown tokens interact with model internals in complex ways.
  4. "Model hallucination mimics memory." A coherent answer may look like recall, but is often just pattern-matching from training—not actual memory reconstruction.
  5. "Hardcoded cp offsets are brittle." Using slicing tricks like prompt_ids[:,cp:] assumes token alignment from prompt regeneration—but this breaks unless token count is manually confirmed.
  6. "Sampling step lags output if cp isn't advanced in-place." Generated tokens need their own cache_position. If cp is not incremented per sample, positional encoding and rotary drift silently.
  7. "Initial pkv must reflect first real input, not dummy." Cache primed with dummy data leads to structurally invalid restoration later. First real prompt should establish cache structure.
  8. "Chat history has non-linear token growth." Short-looking turns may tokenize into long sequences due to formatting, encoding artifacts, or template-induced overhead.
  9. "Cache drift takes time to surface." You may see apparent success for a few turns, but each turn compounds any misalignment. Failures are delayed, not immediate.
  10. "No warning exists for invalid cache_position." Model APIs accept incorrect position values silently. There's no validation layer—only broken inference downstream.
  11. "Sampling loop must reassign pkv every step." Each new token modifies the full cache state. If you don’t rebind the return value of mod(...), you're discarding continuity.
  12. "Prompt truncation is not symmetric." When reaching model limits, the truncation logic may drop context unevenly—e.g., biasing removal toward earlier turns.
  13. "Regenerating prompt overwrites inference state." If you rebuild the prompt each time and ignore prior cache, the model always starts from scratch—no internal continuity persists.
  14. "Device mismatch causes invisible failures." If your tokens are on cuda but cache on cpu, PyTorch may silently cast or copy behind the scenes—breaking timing and tensor integrity.
  15. "Chat templates are lossy to intent." While visually organizing turns, templates distort raw text flow, introduce noise tokens, and affect model pattern recognition—especially for logic-heavy queries.
  16. "Model's sense of time is positional, not semantic." It has no concept of clock or conversation time—only relative position. Removing or reordering tokens breaks continuity.
  17. "Re-tokenization isn't idempotent under prompt growth." Tokenizing a growing prompt may shift all downstream positions. Even identical turns will have different token offsets across calls.
  18. "Rotary embeddings are position-coupled and not recoverable." Once position IDs have moved, you can’t start mid-cache unless all prior state—including rotary phase—is known.
  19. "KV length != attention window." Even if past_key_values are long, the model’s attention window (e.g., 2048 tokens) limits what it can see—older cache entries may be masked out.
  20. "Text coherence does not imply context coherence." The model can produce fluent output with no memory. Language generation skill masks alignment or cache integrity failures.
  1. "Token reuse does not imply token equivalence." Two identical strings may tokenize differently if formatting or surrounding context changes—breaking cp alignment.
  2. "Model internal memory is bounded by its sliding window." Beyond the model’s max context length, no prior KV survives—even if you try to replay it.
  3. "Each sampled token is a context event." Treat each token generation as a state mutation, not just output—it affects everything downstream.
  4. "Positional encoding is cumulative, not absolute." There's no concept of time reset. Each token adds to a growing position tensor—resetting cp=0 mid-session is invalid.
  5. "KV eviction is silent and model-specific." Some models start evicting old cache entries before the max token limit—resulting in partial memory retention.
  6. "Hidden token insertion by tokenizer corrupts alignment." Special tokens (like <bos>) may appear implicitly. These are not obvious in text but change token count.
  7. "Chat formatting conventions are not universal." Some models (e.g. ChatGLM vs Gemma) expect different dialogue formats. Incorrect format can silently invalidate context.
  8. "Autoregressive models have no selective recall." They cannot ‘jump to’ prior facts—you must repeat context or they forget. Cache doesn’t help without position alignment.
  9. "Response length affects internal memory pressure." Longer generations consume more cache. Generated tokens reduce how far back the model can ‘see’.
  10. "Each turn is memory dilation." Chat formatting increases the token:turn ratio. After a few turns, you're at token saturation.
  11. "Replay ≠ Restore." Replaying prompt history does not restore internal attention dynamics—rotary phase and head states are not recoverable.
  12. "Embeddings don't encode memory—they encode position." Re-embedding prior tokens doesn't recover meaning. The context is positional, not semantic.
  13. "Tokens are not symbols." "m" means nothing unless the transformer has seen it used consistently. It’s not a variable until inferred.
  14. "KV cache is not full state." Other internal buffers—like attention masks, rotary phases, and head routes—are essential for full restoration.
  15. "Initial prompt length determines rotary clock start." Changing the seed prompt changes all rotary offsets. You can’t hot-swap prompts and expect continuation.
  16. "Inference is not idempotent." Re-running the same tokens won’t produce the same logits unless all hidden buffers are identical.
  17. "Cache corruption is progressive." Small misalignments early cause larger incoherences in later turns. There's no early warning.
  18. "Semantic compression during generation masks memory loss." The model can appear coherent even when earlier context has been truncated—because it's learned how to ‘fake it.’
  19. "Pad tokens are not context-neutral." They interact with attention masks and positional encodings—filling a prompt with pads is not a safe noop.
  20. "Long prompt + long generation = KV loss." You can’t have both indefinitely. Past turns are evicted silently as generation proceeds.
  21. "Prompt-injected facts decay non-linearly." Facts mentioned early are forgotten faster than recent ones—not by time, but by position and token count.
  22. "Model tuning affects memory behavior." Chat-tuned models like -it variants learn to recall symbolically. Non-instruction-tuned models don’t.
  23. "Session restart with same prompt ≠ same internal state." Even if token-for-token identical, cache is not shared unless explicitly captured and restored.
  24. "Attention bias accumulates over generation." As tokens accumulate, softmax distribution shifts subtly—context is seen differently over time.
  25. "Token position determines memory priority." Tokens near the end of the prompt are better remembered. Important facts should be placed late.
  26. "Multiple-choice prompts distort rotary space." A/B/C structures induce unnatural attention splits across heads—causing semantic diffusion.
  27. "Positional overflows produce looping attention." If position IDs exceed expected length, models may recycle or reset attention—leading to nonsense output.
  28. "Models trained with ALiBi cannot be safely restored mid-cache." ALiBi biases are position-dependent and cumulative—partial restoration produces irreversible distortion.
  29. "Head dropout during generation affects cache evolution." If attention heads are dropped (e.g. via dropout), cache grows asymmetrically—producing subtle inconsistencies.
  30. "Batch inference hides cache alignment issues." Inference with batch_size > 1 can suppress positional errors due to averaging—but will break when scaled down.

Appendix: Glossary

Context and Cache Mechanics

Tokenizer and Prompt Variables

Model and Sampling Variables

Positional and Temporal Concepts

Chat and Flow Structures

Failures, Pitfalls, and Edge States

vardescriptioninsight #
modLoaded transformer model (e.g., Gemma) with all weights, layers, and attention parameters. Provides the forward pass, generates logits, and mutates the KV cache.1, 4, 10, 21
tokTokenizer tied to the model. Applies chat template, encodes inputs, decodes outputs. Misalignment with model causes token drift and hallucination.9, 12, 36, 42
maxlenMaximum length used to prefill dummy input, controlling StaticCache size. Must match intended conversation capacity.7, 32
padidUsed to fill dummy input. Non-neutral: affects attention mask and rotary even if unused semantically.7, 49
pkvCaptured past_key_values from dummy or live forward pass. Required for continuation. Rewritten each forward call. Not deep-copied.1, 4, 11, 21, 30, 40, 44
chatOrdered list of role/content turns. Input to chat template. Expands nonlinearly. Used for prompt regeneration.12, 23, 25
turnsList of scripted prompts simulating multi-turn dialog. Source of user queries for the test loop.3
cpGlobal token cursor. Tracks total token count across prompt and generation. Defines alignment for cache_position.2, 3, 6, 15, 27, 34
devDevice assignment string ('cpu' or 'cuda'). Mismatch with tensors causes silent transfer or failure.24
dtypePrecision for model weights. Impacts memory, speed, and reproducibility. float16 on CPU coerces to float32.10
modpathFilesystem or hub ID used to load the correct model and tokenizer. Must be consistent across components.52
mpDictionary mapping model nicknames to local or hub paths. Used to select model based on host or config.
dpDevice alias map ('cpu' → 'cpu', 'gpu' → 'cuda'). Abstracts platform logic.
tpMap from string keys ('f16', etc.) to torch data types. Used for loading weights.
rngShape tuple (1, maxlen) defining the dummy input tensor size. Matches cache allocation intent.7, 17
dummyidsFilled with `padid`, this tensor primes the cache during the initialization pass.7, 17
rResult of the dummy forward. Contains logits and `past_key_values`. Used only to extract the cache.4
promptidsFull tokenized chat after applying chat template. Grows with each turn. Token count ≠ turn count.9, 12, 15, 27
newidsSlice of `promptids` from `cp:`. Only the new user message. Required for incremental cache extension.15, 27
newlenLength of the current input slice. Used to update cursor and position encoding range.6
posAbsolute position tensor aligned to `cp`. Needed for cache_position. Controls rotary phase.2, 6, 16, 28, 34
outResult from model forward. Contains logits and updated `pkv`. Must be reassigned each call to preserve state.4, 21
outstrAccumulated string of model output for this turn. Built token-by-token from decoded samples.14, 30
iLoop counter for sampling. Limits number of generated tokens per turn to prevent runaway generation.4
tReused temp var: both time and logits. Naming collision risk, but harmless here.
nxtNext token ID from argmax(logits). Fed back into model for next-step generation.6, 21
tokstrDecoded text of `nxt`. Checked against end-of-turn marker to stop sampling early.30
v2D reshaped tensor of `nxt`. Required shape (1, 1) for model input compatibility.6
#!/media/krusty/gm/gm120/anaconda3/envs/apy/bin/pythonimport os,sys,timesys.path.insert(0,"/webroot/lib")import plib#!/media/krusty/gm/gm120/anaconda3/envs/apy/bin/pythonimport os,sys,time,socket  # Import core Python modules for file operations, system access, timing, and networkingsys.path.insert(0,"/webroot/lib")  # Prepend custom path for module resolution; allows loading local libs like 'plib' belowimport plib  # Custom library, assumed project-specific, loaded from /webroot/libfrom transformers import AutoTokenizer,AutoModelForCausalLM  # HuggingFace interface: AutoTokenizer = text to token IDs; AutoModelForCausalLM = transformer for text generationimport torch  # PyTorch library provides tensor ops, model loading, GPU support, and KV cache infrastructuredef init():  # Initializes the transformer pipeline: loads tokenizer, model, sets device, precision, and prepares dummy cache  global mod, tok, maxlen, padid, pkv, chat, turns, cp, dev, dtype, modpath  # Expose these as globals for use in forward, decoding, and chat state management  mp={  # Model path dictionary. Keys are string sizes (1b, 4b, etc), values are either local snapshot paths or remote model hub identifiers    "1b":"/home/krusty/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752",    "4b":"/home/krusty/.cache/huggingface/hub/models--google--gemma-3-4b-it/snapshots/093f9f388b31de276ce2de164bdc2081324b9767",    "9b":"google/gemma-3-9b-it",  # remote: this entry uses huggingface's repo format    "27b":"google/gemma-3-27b-it"  }  dp={"cpu":"cpu","gpu":"cuda"}  # Device map for abstraction; simplifies logic later  tp={"bf":torch.bfloat16,"f16":torch.float16,"f32":torch.float32}  # Abbreviation map from string dtype labels to PyTorch precision types  # Machine-specific model loading: uses hostname to select model size, device type, and precision mode  if socket.gethostname()=="machf":      modpath=mp["1b"]  # Load smallest model for fast testing    dev=dp["cpu"]  # Use CPU for local deterministic runs    dtype=tp["f16"]  # Use float16 even on CPU (nonstandard, experimental)  elif socket.gethostname()=="machh":      modpath=mp["4b"]  # Load mid-size 4b model on GPU for higher throughput    dev=dp["gpu"]  # Activate CUDA execution    dtype=tp["f32"]  # Use full float precision (more accurate, slower)  tok=AutoTokenizer.from_pretrained(modpath)  # Load the tokenizer. This defines vocabulary, special tokens, and chat templates  mod=AutoModelForCausalLM.from_pretrained(modpath,torch_dtype=dtype,device_map={"":dev})  # Load transformer model with given precision and device override. This allows GPU memory control.  maxlen=1024  # Define maximum context size for input tokens. Most LLMs have 1024 or 2048 as limit. Used for dummy init.  padid=tok.pad_token_id  # Try to extract pad token from tokenizer config. This is used to fill dummy sequences or pad real ones.  if padid is None:     padid=tok.eos_token_id  # If tokenizer doesn't define padding token, fallback to EOS as a proxy. This affects dummy KV structure.  with torch.no_grad():  # Disable gradients for this block. We are only warming up model state with dummy inputs    rng=(1,maxlen)  # Create dummy input shape: batch of 1, 1024 tokens    dummyids=torch.full(rng,padid,dtype=torch.long,device=dev)  # Fill dummy tensor with pad tokens. Model will treat this as a no-op input for memory allocation.    r=mod(input_ids=dummyids,use_cache=True)  # Feed dummy input to model. This triggers KV cache allocation. No content, just side effect.    pkv=r.past_key_values  # Capture the initialized KV cache. This state object is reused across forward calls for continued generation.  chat=[]  # Empty list to store full conversation history as alternating role="user" and role="model"  turns=[  # Hardcoded multi-turn simulated user dialog. Purpose: test long-term memory and reasoning across multiple instructions.    "Let's start by defining a variable k equal to 5. What is k?",    "Now set a new variable m equal to k multiplied by 11. What is m?",    "If we increment k by 1, what is the new value of m?",    "Okay, forget k and m. Let x = 100 and y = 25. What is x divided by y?",    "Now, a new variable z is the product of x and y. Calculate z.",    "If we subtract 500 from z, what is the result?",    "What was the value of m from our earlier conversation?",    "Final question: what was the first variable we defined in this entire chat?"  ]  cp=0  # Cache position counter. Tracks how many tokens have been sent to model so far. This value is critical for position alignment.def atc():  # Returns input tensor from current chat history, formatted using the tokenizer's template logic  r=tok.apply_chat_template(chat,add_generation_prompt=True,return_tensors="pt")  # Wraps conversation in system/user/model markers and converts to token tensor  return r.to(dev)  # Sends tensor to correct compute device to match model weightsdef dumoda():  # Perform model forward pass for multiple tokens at once (batch slice from prompt)  return mod(input_ids=newids,use_cache=True,past_key_values=pkv,cache_position=pos)  # Provides position tensor and cache for accurate generation across turnsdef dumodb():  # Perform model forward for a single decoded token  v=nxt.view(1,1)  # Reshape scalar token ID into 2D batch form for model input  return mod(input_ids=v,use_cache=True,past_key_values=pkv,cache_position=pos)  # Forward next token while preserving continuity via cached attentioninit()  # Execute full setup sequence: model loading, tokenizer, dummy KV init, and conversation state resetfor t in turns:  # Iterate through user prompts, simulating turn-by-turn interaction  t0=time.time()  # Start timing this turn for performance stats  chat.append({"role":"user","content":t})  # Add user message to chat history. Used by tokenizer template logic.  promptids=atc()  # Convert current chat history into model-ready token tensor using HuggingFace chat templates  newids=promptids[:,cp:]  # Slice token tensor to include only newly added user prompt tokens  newlen=newids.shape[1]  # Count number of new tokens generated in this turn  pos=torch.arange(cp,cp+newlen,device=dev)  # Build position tensor to pass to model. Must align with full cumulative token index (not just turn-local)  with torch.no_grad():    out=dumoda()  # Run model forward pass on new token segment. Output includes logits for next prediction and updated KV  cp+=newlen  # Update global cursor to reflect how many total tokens were sent. Critical for maintaining cache alignment.  outstr=""  # Start building model's reply string, token by token  for i in range(60):  # Generate up to 60 output tokens, one at a time    t=out.logits[:,-1,:]  # Extract logits for final position in sequence. These represent model's belief over next token.    nxt=torch.argmax(t,dim=-1)  # Select the most likely token (greedy decode)    tokstr=tok.decode(nxt)  # Convert token ID back into string    if tokstr=="": break  # Stop generation if special token is reached    outstr+=tokstr  # Accumulate generated text    pos=torch.tensor([cp],device=dev)  # Update position to reflect the single-token continuation point    with torch.no_grad():      out=dumodb()  # Forward next token using updated KV state and pos    cp+=1  # Increment cursor. Each sampled token must advance the absolute position  chat.append({"role":"model","content":outstr})  # Save model reply in chat history to be used for next prompt  print(outstr)  # Display model's full response for the turn  print("Response time:",round(time.time()-t0,3),"sec")  # Show how long generation took for diagnostic purposes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24