top of page

Multi-Agent Reinforcement Learning and Coordination: Game of Knights and Arrows

Multi-agent systems are everywhere — warehouse robots, traffic control, distributed services, autonomous defense. The hard part isn't making one agent smart. It's making multiple agents smart *together*.

 

This project builds a full MARL pipeline in PettingZoo's Knights-Archers-Zombies environment, trains two algorithms head-to-head (`DQN` vs `PPO`), and measures not just whether agents improved — but whether they actually coordinated.

What is Multi-Agent RL?

In standard RL, one agent learns a policy against a fixed environment. MARL extends this to multiple agents acting simultaneously:

Agent 1 ─┐
Agent 2 ─┼─→ actions → Environment → (next states, rewards) → all agents
Agent 3 ─┤
Agent 4 ─┘

The critical difference: the environment is no longer stationary. Every agent's best move keeps changing as teammates update their policies. That single fact creates three compounding problems:

  • Non-stationarity — no agent ever sees a stable environment. Optimal behavior is a moving target.
  • Credit assignment — when the team succeeds or fails, which agent was responsible? Shared reward can cause free-riding.
  • Emergence — agents can develop complementary roles and coordinated tactics. The cost: emergence is unpredictable and takes time.

Key Design Decisions

Environment — cooperative combat in knights_archers_zombies_v10. Four agents (archer_0, archer_1, knight_0, knight_1) share a single reward signal (zombie kill = team point). Each agent observes only its own 84×84 grayscale 3-frame stack — no global map, no communication. Stack: Ray RLlib + PettingZoo + SuperSuit.

Shared policy — one network serves all four agents, giving ~4× the training data of a single-agent setup. Trade-off: the network must generalize across agent types, which can break down when agent types require different optimal strategies (archers vs knights).

Centralised training, decentralised execution (CTDE) — RLlib aggregates all agents' trajectories into one batch during training. At execution, each agent acts from its own local observation only. No communication channel required — the standard production-ready MARL paradigm.

DQN (off-policy) — transitions stored in a replay buffer shared across all agents. Random mini-batch sampling breaks temporal correlation and smooths non-stationarity by mixing experience from different training stages. n_step=10, epsilon 1.5→0.01 over 1M steps, 4 rollout workers.

PPO (on-policy) — fresh experience each iteration; no replay. Entropy regularisation (entropy_coeff=0.15) keeps exploration alive longer, causing a mid-training dip but enabling stronger strategies later. Hyperparameters from Optuna search: lr=2e-5, gamma=0.90. 7 rollout workers.


Results

What the training achieved

Two algorithms competed head-to-head over one hour of CPU training. PPO won clearly.

DQN PPO
Total episodes trained 6,938 14,007
Iterations 408 156
Reward mean (all episodes) 8.661 8.601
Final performance (last 100 eps) 8.79 10.47
Hit the success target (≥10 reward) 36.8% 39.7%

PPO crossed the success threshold on average in its final window and was still improving at the one-hour cutoff. DQN stabilized earlier but at a lower level.

The gap comes down to two design choices:

  • PPO’s entropy bonus keeps exploration alive longer.
  • PPO’s 7 workers generate nearly twice as many training episodes in the same wall-clock time as DQN’s 4.
DQN_marl_playing_latest.gif
SYSTEM_DESIGN.png

How coordinated were the agents?

Beyond reward, the project measures how the team performed:

  • Action diversity (0.78): Agents consistently chose different actions — no collapse into redundant behavior.
  • Synchronisation (~0.008 PPO, ~0.002 DQN): Near zero for both. Agents killed zombies in parallel, not in coordinated sequences.
  • Role specialisation score (~0.01): Effectively none. Specialisation into distinct archer vs knight roles likely needs longer training or separate policies per agent type.

Honest takeaway: reward improved, coordination did not emerge yet. That distinction is the point.


What the data reveals about architecture

Archers scored ~91% of all kills; knights ~8.5%.

This isn’t a bug — it’s what happens when one shared network has to serve two agent types with fundamentally different combat mechanics. The network optimizes for ranged combat and applies that same strategy to melee knights, where it doesn’t fit.

Known fixes

  • separate policy heads per agent type
  • or a role-conditioning input so the shared policy can condition on which agent it’s controlling

Data provenance note: DQN training metrics and per-agent plots are sourced from run 2efaa (started 17:59). The DQN coordination JSON is from run 9ef25 (started 18:17) — a separate execution recorded in run_all.log. These are not the same DQN training run.


Why Reward Alone Is Not Enough

A single reward number hides failure modes that matter in production:

What reward misses How to measure it
One agent carrying the team Per-agent contribution share
Agents doing the same thing Action diversity
Parallel action, zero coordination Reward synchronisation
Policy collapse Role specialisation entropy
Lucky single run Cross-seed variance

This project implements all five. The result isn’t just “reward went up” — it’s a diagnostic showing exactly where the policy is strong, where it’s structurally limited, and what to fix next.


Real-World Transfer

The same coordination fingerprint — synchronisation, diversity, specialisation — maps directly onto production problems:

  • Warehouse robotics: throughput, deadlock avoidance, idle-time fairness
  • Autonomous traffic: turn-taking synchronisation, lane role allocation
  • Cybersecurity: complementary detection coverage across IDS, firewall, honeypot layers
  • Microservice orchestration: load-shedding coordination, SLA fairness across service types

The zombie kill signal is a proxy. The measurement infrastructure is the transferable artifact.


What This Demonstrates

A complete ML engineering loop — not just a model, but a system:

  • reproducible training pipeline with environment and seed controls
  • multi-dimensional evaluation beyond a single loss curve
  • honest diagnostic interpretation, including when results fall short
  • qualitative validation via rollout GIFs

The coordination metrics are the contribution. They exist because reward alone cannot tell you whether your multi-agent system is working for the right reasons.

bottom of page