Architecture

LLM Backends

Both backends expose the same OpenAI-compatible /v1/chat/completions endpoint. The LLM::Backend abstract class defines chat, start, stop, healthy?

Ollama (default)

Detects a running instance or starts one. Best for local dev on constrained VRAM — MoE-aware offloading keeps hot experts on GPU.

llama-server

Managed subprocess. Better for CI/deployment with dedicated hardware. Pinned to b8665 for gemma4 support.

Tool System

Tools implement an abstract interface: name, description, parameters (JSON Schema), execute. A Registry generates the OpenAI function-calling definitions.

bash

Execute shell commands via Process.run. 120s timeout, 30KB output truncation, semantic exit codes.

read_file

Smart file reader with structural awareness. Markdown: heading outlines with line numbers. HTML: styles, structure, CSS class patterns. Reduces context waste 50-80% vs raw cat.

Agent Loop

loop do
  # Check context pressure, compact if needed
  response = llm.chat(messages, tools)
  if response.has_tool_calls?
    execute each tool, append results
  else
    print response.content  # done
    break
  end
  break if turn >= max_turns
end

Context Management

Tracks token usage from the API response. When prompt tokens exceed 75% of the context window:

Ask the model to summarize progress (what's done, what remains, current state)
Replace full history with: system prompt + original task + summary + "continue"
If still over threshold after compaction: bail with "task too complex for context"

System Prompt

Load-bearing infrastructure, not a nicety. At Q4 quantization, Gemma 4 defaults to chatbot behavior without aggressive framing:

✓ Role as "task executor" — not "assistant"
✓ Negative constraints: "NEVER review, NEVER suggest"
✓ Positive action verbs: "write immediately", "modify immediately"
✗ "Helpful assistant" framing → model reviews files instead of creating them

Key Findings

From our experiment log:

EXP-0001

Ollama beats llama.cpp on 12GB VRAM due to MoE-aware per-expert offloading. Context management is essential for multi-step tasks. Smart file reading reduces context waste 50-80%.

EXP-0010

System prompt wording is the primary control surface for Q4 model behavior. "Executor" framing + negative constraints required to prevent review-mode drift.

EXP-0011

sed is unreliable for HTML modification by LLMs. Atomic file rewrites are safer than incremental edits for structured documents.