LLM Backends
Both backends expose the same OpenAI-compatible /v1/chat/completions endpoint. The LLM::Backend abstract class defines chat, start, stop, healthy?
Ollama (default)
Detects a running instance or starts one. Best for local dev on constrained VRAM — MoE-aware offloading keeps hot experts on GPU.
llama-server
Managed subprocess. Better for CI/deployment with dedicated hardware. Pinned to b8665 for gemma4 support.
Tool System
Tools implement an abstract interface: name, description, parameters (JSON Schema), execute. A Registry generates the OpenAI function-calling definitions.
bash
Execute shell commands via Process.run. 120s timeout, 30KB output truncation, semantic exit codes.
read_file
Smart file reader with structural awareness. Markdown: heading outlines with line numbers. HTML: styles, structure, CSS class patterns. Reduces context waste 50-80% vs raw cat.
Agent Loop
loop do
# Check context pressure, compact if needed
response = llm.chat(messages, tools)
if response.has_tool_calls?
execute each tool, append results
else
print response.content # done
break
end
break if turn >= max_turns
end
Context Management
Tracks token usage from the API response. When prompt tokens exceed 75% of the context window:
- Ask the model to summarize progress (what's done, what remains, current state)
- Replace full history with: system prompt + original task + summary + "continue"
- If still over threshold after compaction: bail with "task too complex for context"
System Prompt
Load-bearing infrastructure, not a nicety. At Q4 quantization, Gemma 4 defaults to chatbot behavior without aggressive framing:
- ✓ Role as "task executor" — not "assistant"
- ✓ Negative constraints: "NEVER review, NEVER suggest"
- ✓ Positive action verbs: "write immediately", "modify immediately"
- ✗ "Helpful assistant" framing → model reviews files instead of creating them
Key Findings
From our experiment log:
Ollama beats llama.cpp on 12GB VRAM due to MoE-aware per-expert offloading. Context management is essential for multi-step tasks. Smart file reading reduces context waste 50-80%.
System prompt wording is the primary control surface for Q4 model behavior. "Executor" framing + negative constraints required to prevent review-mode drift.
sed is unreliable for HTML modification by LLMs. Atomic file rewrites are safer than incremental edits for structured documents.