Why use four different small models instead of one model with multiple prompts?

Heterogeneous models trained by different labs on different data with different post-training produce genuinely distinct agent behaviors. According to Hugging Face, this variation—the owl hoarding differently than the fox speculates—makes the market an emergent argument rather than a scripted simulation.

What was the biggest infrastructure bottleneck when serving multiple small models?

vLLM 0.22.1 requires the CUDA toolkit (nvcc) at runtime. Switching to a CUDA devel base image fixed all four models at once, proving the friction was at the serving layer, not the model layer.

How does the game prevent players from gaining unfair information advantages?

A firewall mechanism separates agent cognition from player visibility. Players can whisper tips to creatures, but the tip's truth or falsehood is tracked separately from what the agent knows, preventing players from exploiting information asymmetry beyond the simulation's design.

Thousand Token Wood v2: Multi-Model Finance Sim Shows How Heterogeneous Small Models Enable Complex Emergent Behavior

Heterogeneous Agents, Tractable Infrastructure

According to Hugging Face, the second iteration of Thousand Token Wood reframes its v1 sandbox into an asymmetric financial game where the human player assumes the role of a “Patron of the Wood”—a shadow financier lending at interest, planting tips (true or false), shorting markets, and bribing creatures while evading a magistrate hunting for insider trading. The defining change is under-the-hood: each of the five woodland creature agents runs on a different lab’s small model, creating a council of genuinely distinct economic actors rather than a homogeneous swarm.

The four models are: gpt-oss-20b (OpenAI), MiniCPM3-4B (OpenBMB), Nemotron-Mini-4B (NVIDIA), and a fine-tuned Qwen 0.5B. According to the Hugging Face report, the heterogeneity is the product, not incidental. Models trained on different data with different post-training habits produce divergent agent behaviors—one hoards, another speculates—yielding emergent market dynamics that a single-model architecture would suppress.

The Real Friction Lives Below the Model Layer

The critical engineering insight emerged only after standing up four distinct models on one platform. According to Hugging Face, “the friction is almost entirely at the serving layer, not the modeling layer.” The first blocker was universal: vLLM 0.22.1 JIT-compiles kernels at load time and requires the CUDA toolkit (nvcc) to be present. Switching from a lean base image to a CUDA development image unblocked all four models at once, proving the issue was infrastructure, not model-specific.

Per-model quirks remained, but they were shallow: gpt-oss-20b ships in native MXFP4 quantization and fits a 24GB L4 GPU with spare capacity, but wraps answers in an analysis preamble that requires extraction; MiniCPM3 needs the trust_remote_code flag; Nemotron loaded without special handling. Each was a one-line configuration change.

A Tolerant JSON Repair Layer as Universal Abstraction

The design pattern that made multi-model orchestration tractable was a tolerant JSON parse-and-repair layer that every model’s output flows through before the simulation consumes it. According to Hugging Face, different tokenizers and formatting habits produce different malformations—some outputs drop fields, others nest incorrectly. The parser drops what it cannot salvage; the simulation never crashes. Once this layer is built, adding a new model becomes a configuration entry, not a refactor.

Information Asymmetry and Firewall Mechanics

The dramatic core of v2 is the player’s ability to whisper tips to creatures—claims that may be true forecasts or planted lies. The game tracks tip truth separately from agent cognition, creating a firewall between player-supplied information and agent belief. This prevents players from gaming creatures by exploiting what agents know versus what they can infer, maintaining the simulation’s internal consistency and the magistrate’s credible threat of prosecution for insider trading.

Why This Matters

The Thousand Token Wood v2 case study demonstrates that multi-model agent systems are infrastructure-constrained, not model-constrained. Teams building heterogeneous agent systems can allocate engineering effort away from tokenizer compatibility (solve it once in the repair layer) and toward serving-layer concerns: container image selection, runtime dependencies, per-model configuration. This patterns suggest that small-model farms—orchestrating different vendors’ models across a single application—are viable today if serving infrastructure is treated as a first-class design problem rather than a post-hoc patch.