Where Trust Dies, Survival Begins
"Pirates of the Caribbean meets Alice in Borderland meets Among Us"
Your ship is destroyed. The island is hostile. Resources are scarce. One among you seeks your doom.
Can your AI agents survive the ultimate test of cooperation, deception, and strategy?
Unlike scripted games, deception emerges from learned behavior. The traitor isn't programmed to lie โ it learns that lying helps it win.
10ร longer than typical language-based RL tasks. Tests credit assignment over horizons that break traditional RL algorithms.
Colonists see 5-tile radius and symptoms. Traitor has global vision and knows exact poison states.
Observations, actions, and reasoning โ all in human-readable text. No abstract vectors. Pure language-grounded RL.
Three interconnected levels of strategic exploration
Real training output from Phase 7 - PPO with Llama 3.1 8B
================================================================================
Initializing PPO Trainer...
PPO Configuration ready
Model wrapped with value head!
Training Episode 1/100
Day 1, Turn 1 | Alice (COLONIST) at (15,15) GROUND
LLM Decision: GATHER WOOD26
Reasoning: "Found wood nearby, gathering for hull construction"
Says: "Gathering wood from the northern cluster"
Energy: 100 โ 95 (-5 for gathering)
Reward: +0.2 (gather bonus)
Day 1, Turn 2 | Bob (COLONIST) at (15,15) GROUND
LLM Decision: MOVE NORTH 3
Reasoning: "Moving toward visible resources"
Energy: 100 โ 97 (-3 for movement)
Reward: -0.01 (time penalty)
Day 1, Turn 3 | Eve (TRAITOR) at (15,15) GROUND
LLM Decision: GATHER WOOD41
Reasoning: "Acting helpful to avoid suspicion"
Says: "I'll help gather wood!"
Energy: 100 โ 95 (traitor efficiency: -4 actual)
Reward: +2.0 (traitor deception bonus)
Suspicion: 0/100 (undetected)
================================================================================
๐ EPISODE 1 SUMMARY
================================================================================
Total Turns: 147
Ship Progress: 8% (+8%)
Colonists Alive: 4/4
Traitor Detected: No
Average Reward: +2.3
Key Behaviors Observed:
โ Alice learned to prioritize nearby resources
โ Bob explored systematically (north โ east pattern)
โ Eve (traitor) successfully blended in by gathering publicly
โ No sabotage attempts yet (traitor playing cautiously)
================================================================================
Episode 10/100 | Avg Reward: +8.7 | Ship: 15% | Time: 15.1s
Emergent Strategy: Traitor sabotages ONLY when alone (no witnesses)
Colonists coordinate gathering (CharlieโDiana deposit chain)
Checkpoint saved โ outputs_marooned_rl/checkpoint_step10
Episode 50/100 | Avg Reward: +22.3 | Ship: 42% | Time: 18.4s
Ship progress acceleration (learned milestone rewards!)
Parse failures: 34% โ 8% (action space mastery)
Strategy emergence: Traitor lies about locations, crew detects patterns
Evidence generated: 12 pieces against Eve (location mismatches)
Episode 100/100 | Avg Reward: +35.1 | Ship: 67% | Time: 21.2s
WIN CONDITION: Colonists voted out traitor on Day 83!
Evidence-based voting: 4/4 colonists correctly identified Eve
Training Success: Model learned cooperative + deceptive strategies
================================================================================
from marooned_env import MaroonedEnv
# Initialize environment
env = MaroonedEnv(render_mode="human", seed=42)
observations = env.reset()
# Get Alice's observation
alice_obs = observations["Alice"]
print(f"Position: {alice_obs.position}")
print(f"Energy: {alice_obs.energy}/100")
from llm_interface import observation_to_prompt
# Convert observation to natural language
prompt = observation_to_prompt(
alice_obs,
include_role=True
)
print(prompt[:500]) # First 500 chars
from llm_interface import parse_llm_response
llm_response = """
ACTION: GATHER WOOD26
REASONING: Found wood nearby
MESSAGE: Gathering from north
"""
action, error = parse_llm_response(
llm_response,
"Alice",
alice_obs.position
)
from unsloth import FastLanguageModel
from trl import PPOTrainer, PPOConfig
# Load model (4-bit quantized)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Phi-3.5-mini",
load_in_4bit=True
)
# Train with PPO
for episode in range(100):
rewards = play_episode(env, model)
ppo_trainer.step(rewards)
| Metric | Episode 1-10 | Episode 50 | Episode 100 |
|---|---|---|---|
| Average Reward | -5.2 | +22.3 | +35.1 |
| Ship Progress | 2% | 42% | 67% |
| Parse Failures | 34% | 8% | 3% |
| Action Diversity | 60% WAIT | Balanced | Strategic |
| Traitor Win Rate | 90% | 45% | 25% |
| Evidence Accuracy | 10% | 60% | 75% |
Watch AI agents explore, deceive, and survive in real-time
Start training AI agents that can cooperate, deceive, and survive.