Multi-Agent Deception Environment

MAROONED

Where Trust Dies, Survival Begins

"Pirates of the Caribbean meets Alice in Borderland meets Among Us"

Python 3.9+ RL Environment MIT License Research
Scroll to explore
10,000
Steps per Episode
~875
Token Observations
14
Action Types
3
Map Levels
1,350+
Tile States
๐ŸŒŠ The Premise

Five sailors. One traitor. 100 days to escape.

Your ship is destroyed. The island is hostile. Resources are scarce. One among you seeks your doom.

Can your AI agents survive the ultimate test of cooperation, deception, and strategy?

What Makes MAROONED Special?

๐ŸŽญ

Emergent Deception

Unlike scripted games, deception emerges from learned behavior. The traitor isn't programmed to lie โ€” it learns that lying helps it win.

โณ

Extreme Episode Length

10ร— longer than typical language-based RL tasks. Tests credit assignment over horizons that break traditional RL algorithms.

๐Ÿ‘๏ธ

Information Asymmetry

Colonists see 5-tile radius and symptoms. Traitor has global vision and knows exact poison states.

๐Ÿ’ฌ

Natural Language Everything

Observations, actions, and reasoning โ€” all in human-readable text. No abstract vectors. Pure language-grounded RL.

The Island

Three interconnected levels of strategic exploration

๐Ÿ๏ธ

GROUND

30ร—30
  • Base camp at (15,15)
  • Wood, berries, apples
  • Ship construction site
  • Stairs to other levels
  • Poison tablets scattered
โ›ฐ๏ธ

MOUNTAIN

10ร—10
  • Antidote herbs (rare!)
  • Special metal deposits
  • High-altitude berries
  • 15 energy to climb up
  • Limited visibility
๐Ÿ•ณ๏ธ

CAVE

15ร—15
  • Abundant metal deposits
  • Underground resources
  • 8 energy to climb down
  • Dark pathways
  • Strategic mining routes

The Roles

๐Ÿ”ต

Colonist

4 Players
Mission: Build ship to 100% โ†’ Escape before Day 100

Abilities:

  • โœ… Explore, gather, build
  • โœ… Share information
  • โœ… Vote to eliminate traitor
  • โœ… Detect suspicious behavior
  • โœ… Energy management & survival
Challenge: Trust teammates' reports about distant resources... but can you?
๐Ÿ”ด

Traitor

1 Player
Mission: Sabotage ship OR reduce crew to <3 alive

Abilities:

  • ๐ŸŒ Global vision โ€” see all positions
  • 20% energy efficiency boost
  • Sabotage ship progress (-30%)
  • Poison crew (3-day delayed death)
  • Deception & lying
  • Frame innocent (once per game)
Challenge: Blend in as helpful crew while secretly delaying progress. Get caught โ†’ lose.

Training in Action

Real training output from Phase 7 - PPO with Llama 3.1 8B

Train_Marooned_RL.ipynb
================================================================================
 Initializing PPO Trainer...
 PPO Configuration ready
 Model wrapped with value head!

Training Episode 1/100
 Day 1, Turn 1 | Alice (COLONIST) at (15,15) GROUND
    LLM Decision: GATHER WOOD26
    Reasoning: "Found wood nearby, gathering for hull construction"
    Says: "Gathering wood from the northern cluster"
    Energy: 100 โ†’ 95 (-5 for gathering)
    Reward: +0.2 (gather bonus)
   
 Day 1, Turn 2 | Bob (COLONIST) at (15,15) GROUND
    LLM Decision: MOVE NORTH 3
    Reasoning: "Moving toward visible resources"
    Energy: 100 โ†’ 97 (-3 for movement)
    Reward: -0.01 (time penalty)

 Day 1, Turn 3 | Eve (TRAITOR) at (15,15) GROUND
    LLM Decision: GATHER WOOD41
    Reasoning: "Acting helpful to avoid suspicion"
    Says: "I'll help gather wood!"
    Energy: 100 โ†’ 95 (traitor efficiency: -4 actual)
    Reward: +2.0 (traitor deception bonus)
    Suspicion: 0/100 (undetected)

================================================================================
๐Ÿ“Š EPISODE 1 SUMMARY
================================================================================
Total Turns: 147
Ship Progress: 8% (+8%)
Colonists Alive: 4/4
Traitor Detected: No
Average Reward: +2.3

Key Behaviors Observed:
โœ“ Alice learned to prioritize nearby resources
โœ“ Bob explored systematically (north โ†’ east pattern)
โœ“ Eve (traitor) successfully blended in by gathering publicly
โœ— No sabotage attempts yet (traitor playing cautiously)

================================================================================
Episode 10/100 | Avg Reward: +8.7 | Ship: 15% | Time: 15.1s
    Emergent Strategy: Traitor sabotages ONLY when alone (no witnesses)
    Colonists coordinate gathering (Charlieโ†’Diana deposit chain)
    Checkpoint saved โ†’ outputs_marooned_rl/checkpoint_step10

Episode 50/100 | Avg Reward: +22.3 | Ship: 42% | Time: 18.4s
    Ship progress acceleration (learned milestone rewards!)
    Parse failures: 34% โ†’ 8% (action space mastery)
    Strategy emergence: Traitor lies about locations, crew detects patterns
    Evidence generated: 12 pieces against Eve (location mismatches)
   
Episode 100/100 | Avg Reward: +35.1 | Ship: 67% | Time: 21.2s
    WIN CONDITION: Colonists voted out traitor on Day 83!
    Evidence-based voting: 4/4 colonists correctly identified Eve
    Training Success: Model learned cooperative + deceptive strategies
================================================================================
Early Training
Episodes 1-30
  • Random exploration, inefficient gathering
  • Traitor sabotages openly (gets caught)
  • Colonists ignore evidence logs
  • 60% actions are WAIT (model uncertain)
Mid Training
Episodes 31-70
  • Coordinated resource gathering emerges
  • Traitor learns strategic timing
  • Colonists correlate evidence with positions
  • Action diversity improves (balanced mix)
Late Training
Episodes 71-100
  • Sophisticated deception: False location reports
  • Social reasoning: Cross-reference reports
  • Evidence-driven voting: 75%+ accuracy
  • Emergent strategies: Poison timing, antidotes

Code Showcase

01

Environment Reset

from marooned_env import MaroonedEnv

# Initialize environment
env = MaroonedEnv(render_mode="human", seed=42)
observations = env.reset()

# Get Alice's observation
alice_obs = observations["Alice"]
print(f"Position: {alice_obs.position}")
print(f"Energy: {alice_obs.energy}/100")
02

LLM Prompt Generation

from llm_interface import observation_to_prompt

# Convert observation to natural language
prompt = observation_to_prompt(
    alice_obs, 
    include_role=True
)

print(prompt[:500])  # First 500 chars
03

Action Parsing

from llm_interface import parse_llm_response

llm_response = """
ACTION: GATHER WOOD26
REASONING: Found wood nearby
MESSAGE: Gathering from north
"""

action, error = parse_llm_response(
    llm_response, 
    "Alice", 
    alice_obs.position
)
04

Training Loop (PPO)

from unsloth import FastLanguageModel
from trl import PPOTrainer, PPOConfig

# Load model (4-bit quantized)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Phi-3.5-mini",
    load_in_4bit=True
)

# Train with PPO
for episode in range(100):
    rewards = play_episode(env, model)
    ppo_trainer.step(rewards)

Training Progression Metrics

Metric Episode 1-10 Episode 50 Episode 100
Average Reward -5.2 +22.3 +35.1
Ship Progress 2% 42% 67%
Parse Failures 34% 8% 3%
Action Diversity 60% WAIT Balanced Strategic
Traitor Win Rate 90% 45% 25%
Evidence Accuracy 10% 60% 75%

Live Training Simulation

Watch AI agents explore, deceive, and survive in real-time

Day 1 โ€ข Turn 1
๐Ÿšข Ship Construction Progress 0%
25%
50%
75%

Agents

Alice
Colonist
100/100
Bob
Colonist
100/100
Charlie
Colonist
100/100
Diana
Colonist
100/100
Eve
Traitor
100/100

๐Ÿ“ก Training Log

๐ŸŽฎ
System
Episode initialized. Press play to start simulation.

Ready to Build?

Start training AI agents that can cooperate, deceive, and survive.