Multi-Agent Reinforcement Learning and Coordination: Game of Knights and Arrows
Multi-agent systems are everywhere — warehouse robots, traffic control, distributed services, autonomous defense. The hard part isn't making one agent smart. It's making multiple agents smart *together*.
This project builds a full MARL pipeline in PettingZoo's Knights-Archers-Zombies environment, trains two algorithms head-to-head (`DQN` vs `PPO`), and measures not just whether agents improved — but whether they actually coordinated.
What is Multi-Agent RL?
In standard RL, one agent learns a policy against a fixed environment. MARL extends this to multiple agents acting simultaneously:
Agent 1 ─┐ Agent 2 ─┼─→ actions → Environment → (next states, rewards) → all agents Agent 3 ─┤ Agent 4 ─┘
The critical difference: the environment is no longer stationary. Every agent's best move keeps changing as teammates update their policies. That single fact creates three compounding problems:
- Non-stationarity — no agent ever sees a stable environment. Optimal behavior is a moving target.
- Credit assignment — when the team succeeds or fails, which agent was responsible? Shared reward can cause free-riding.
- Emergence — agents can develop complementary roles and coordinated tactics. The cost: emergence is unpredictable and takes time.
Key Design Decisions
Environment — cooperative combat in knights_archers_zombies_v10. Four agents (archer_0, archer_1, knight_0, knight_1) share a single reward signal (zombie kill = team point). Each agent observes only its own 84×84 grayscale 3-frame stack — no global map, no communication. Stack: Ray RLlib + PettingZoo + SuperSuit.
Shared policy — one network serves all four agents, giving ~4× the training data of a single-agent setup. Trade-off: the network must generalize across agent types, which can break down when agent types require different optimal strategies (archers vs knights).
Centralised training, decentralised execution (CTDE) — RLlib aggregates all agents' trajectories into one batch during training. At execution, each agent acts from its own local observation only. No communication channel required — the standard production-ready MARL paradigm.
DQN (off-policy) — transitions stored in a replay buffer shared across all agents. Random mini-batch sampling breaks temporal correlation and smooths non-stationarity by mixing experience from different training stages. n_step=10, epsilon 1.5→0.01 over 1M steps, 4 rollout workers.
PPO (on-policy) — fresh experience each iteration; no replay. Entropy regularisation (entropy_coeff=0.15) keeps exploration alive longer, causing a mid-training dip but enabling stronger strategies later. Hyperparameters from Optuna search: lr=2e-5, gamma=0.90. 7 rollout workers.
Results
What the training achieved
Two algorithms competed head-to-head over one hour of CPU training. PPO won clearly.
| DQN | PPO | |
|---|---|---|
| Total episodes trained | 6,938 | 14,007 |
| Iterations | 408 | 156 |
| Reward mean (all episodes) | 8.661 | 8.601 |
| Final performance (last 100 eps) | 8.79 | 10.47 |
| Hit the success target (≥10 reward) | 36.8% | 39.7% |
PPO crossed the success threshold on average in its final window and was still improving at the one-hour cutoff. DQN stabilized earlier but at a lower level.
The gap comes down to two design choices:
- PPO’s entropy bonus keeps exploration alive longer.
- PPO’s 7 workers generate nearly twice as many training episodes in the same wall-clock time as DQN’s 4.


How coordinated were the agents?
Beyond reward, the project measures how the team performed:
- Action diversity (0.78): Agents consistently chose different actions — no collapse into redundant behavior.
- Synchronisation (~0.008 PPO, ~0.002 DQN): Near zero for both. Agents killed zombies in parallel, not in coordinated sequences.
- Role specialisation score (~0.01): Effectively none. Specialisation into distinct archer vs knight roles likely needs longer training or separate policies per agent type.
Honest takeaway: reward improved, coordination did not emerge yet. That distinction is the point.
What the data reveals about architecture
Archers scored ~91% of all kills; knights ~8.5%.
This isn’t a bug — it’s what happens when one shared network has to serve two agent types with fundamentally different combat mechanics. The network optimizes for ranged combat and applies that same strategy to melee knights, where it doesn’t fit.
Known fixes
- separate policy heads per agent type
- or a role-conditioning input so the shared policy can condition on which agent it’s controlling
Data provenance note: DQN training metrics and per-agent plots are sourced from run 2efaa (started 17:59). The DQN coordination JSON is from run 9ef25 (started 18:17) — a separate execution recorded in run_all.log. These are not the same DQN training run.
Why Reward Alone Is Not Enough
A single reward number hides failure modes that matter in production:
| What reward misses | How to measure it |
|---|---|
| One agent carrying the team | Per-agent contribution share |
| Agents doing the same thing | Action diversity |
| Parallel action, zero coordination | Reward synchronisation |
| Policy collapse | Role specialisation entropy |
| Lucky single run | Cross-seed variance |
This project implements all five. The result isn’t just “reward went up” — it’s a diagnostic showing exactly where the policy is strong, where it’s structurally limited, and what to fix next.
Real-World Transfer
The same coordination fingerprint — synchronisation, diversity, specialisation — maps directly onto production problems:
- Warehouse robotics: throughput, deadlock avoidance, idle-time fairness
- Autonomous traffic: turn-taking synchronisation, lane role allocation
- Cybersecurity: complementary detection coverage across IDS, firewall, honeypot layers
- Microservice orchestration: load-shedding coordination, SLA fairness across service types
The zombie kill signal is a proxy. The measurement infrastructure is the transferable artifact.
What This Demonstrates
A complete ML engineering loop — not just a model, but a system:
- reproducible training pipeline with environment and seed controls
- multi-dimensional evaluation beyond a single loss curve
- honest diagnostic interpretation, including when results fall short
- qualitative validation via rollout GIFs
The coordination metrics are the contribution. They exist because reward alone cannot tell you whether your multi-agent system is working for the right reasons.
