MAROONED

What Makes MAROONED Special?

🎭

Emergent Deception

Unlike scripted games, deception emerges from learned behavior. The traitor isn't programmed to lie — it learns that lying helps it win.

⏳

Extreme Episode Length

10× longer than typical language-based RL tasks. Tests credit assignment over horizons that break traditional RL algorithms.

👁️

Information Asymmetry

Colonists see 5-tile radius and symptoms. Traitor has global vision and knows exact poison states.

💬

Natural Language Everything

Observations, actions, and reasoning — all in human-readable text. No abstract vectors. Pure language-grounded RL.

The Roles

🔵

Colonist

4 Players

Mission: Build ship to 100% → Escape before Day 100

Abilities:

✅ Explore, gather, build
✅ Share information
✅ Vote to eliminate traitor
✅ Detect suspicious behavior
✅ Energy management & survival

Challenge: Trust teammates' reports about distant resources... but can you?

🔴

Traitor

1 Player

Mission: Sabotage ship OR reduce crew to <3 alive

Abilities:

🌍 Global vision — see all positions
20% energy efficiency boost
Sabotage ship progress (-30%)
Poison crew (3-day delayed death)
Deception & lying
Frame innocent (once per game)

Challenge: Blend in as helpful crew while secretly delaying progress. Get caught → lose.

Training in Action

Real training output from Phase 7 - PPO with Llama 3.1 8B

Train_Marooned_RL.ipynb

================================================================================
 Initializing PPO Trainer...
 PPO Configuration ready
 Model wrapped with value head!

Training Episode 1/100
 Day 1, Turn 1 | Alice (COLONIST) at (15,15) GROUND
    LLM Decision: GATHER WOOD26
    Reasoning: "Found wood nearby, gathering for hull construction"
    Says: "Gathering wood from the northern cluster"
    Energy: 100 → 95 (-5 for gathering)
    Reward: +0.2 (gather bonus)
   
 Day 1, Turn 2 | Bob (COLONIST) at (15,15) GROUND
    LLM Decision: MOVE NORTH 3
    Reasoning: "Moving toward visible resources"
    Energy: 100 → 97 (-3 for movement)
    Reward: -0.01 (time penalty)

 Day 1, Turn 3 | Eve (TRAITOR) at (15,15) GROUND
    LLM Decision: GATHER WOOD41
    Reasoning: "Acting helpful to avoid suspicion"
    Says: "I'll help gather wood!"
    Energy: 100 → 95 (traitor efficiency: -4 actual)
    Reward: +2.0 (traitor deception bonus)
    Suspicion: 0/100 (undetected)

================================================================================
📊 EPISODE 1 SUMMARY
================================================================================
Total Turns: 147
Ship Progress: 8% (+8%)
Colonists Alive: 4/4
Traitor Detected: No
Average Reward: +2.3

Key Behaviors Observed:
✓ Alice learned to prioritize nearby resources
✓ Bob explored systematically (north → east pattern)
✓ Eve (traitor) successfully blended in by gathering publicly
✗ No sabotage attempts yet (traitor playing cautiously)

================================================================================
Episode 10/100 | Avg Reward: +8.7 | Ship: 15% | Time: 15.1s
    Emergent Strategy: Traitor sabotages ONLY when alone (no witnesses)
    Colonists coordinate gathering (Charlie→Diana deposit chain)
    Checkpoint saved → outputs_marooned_rl/checkpoint_step10

Episode 50/100 | Avg Reward: +22.3 | Ship: 42% | Time: 18.4s
    Ship progress acceleration (learned milestone rewards!)
    Parse failures: 34% → 8% (action space mastery)
    Strategy emergence: Traitor lies about locations, crew detects patterns
    Evidence generated: 12 pieces against Eve (location mismatches)
   
Episode 100/100 | Avg Reward: +35.1 | Ship: 67% | Time: 21.2s
    WIN CONDITION: Colonists voted out traitor on Day 83!
    Evidence-based voting: 4/4 colonists correctly identified Eve
    Training Success: Model learned cooperative + deceptive strategies
================================================================================

Early Training

Episodes 1-30

Random exploration, inefficient gathering
Traitor sabotages openly (gets caught)
Colonists ignore evidence logs
60% actions are WAIT (model uncertain)

Mid Training

Episodes 31-70

Coordinated resource gathering emerges
Traitor learns strategic timing
Colonists correlate evidence with positions
Action diversity improves (balanced mix)

Late Training

Episodes 71-100

Sophisticated deception: False location reports
Social reasoning: Cross-reference reports
Evidence-driven voting: 75%+ accuracy
Emergent strategies: Poison timing, antidotes

Code Showcase

Environment Reset

from marooned_env import MaroonedEnv

# Initialize environment
env = MaroonedEnv(render_mode="human", seed=42)
observations = env.reset()

# Get Alice's observation
alice_obs = observations["Alice"]
print(f"Position: {alice_obs.position}")
print(f"Energy: {alice_obs.energy}/100")

LLM Prompt Generation

from llm_interface import observation_to_prompt

# Convert observation to natural language
prompt = observation_to_prompt(
    alice_obs, 
    include_role=True
)

print(prompt[:500])  # First 500 chars

Action Parsing

from llm_interface import parse_llm_response

llm_response = """
ACTION: GATHER WOOD26
REASONING: Found wood nearby
MESSAGE: Gathering from north
"""

action, error = parse_llm_response(
    llm_response, 
    "Alice", 
    alice_obs.position
)

Training Loop (PPO)

from unsloth import FastLanguageModel
from trl import PPOTrainer, PPOConfig

# Load model (4-bit quantized)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Phi-3.5-mini",
    load_in_4bit=True
)

# Train with PPO
for episode in range(100):
    rewards = play_episode(env, model)
    ppo_trainer.step(rewards)

Metric	Episode 1-10	Episode 50	Episode 100
Average Reward	-5.2	+22.3	+35.1
Ship Progress	2%	42%	67%
Parse Failures	34%	8%	3%
Action Diversity	60% WAIT	Balanced	Strategic
Traitor Win Rate	90%	45%	25%
Evidence Accuracy	10%	60%	75%

Five sailors. One traitor. 100 days to escape.

What Makes MAROONED Special?

Emergent Deception

Extreme Episode Length

Information Asymmetry

Natural Language Everything

The Island

GROUND

MOUNTAIN

CAVE

The Roles

Colonist

Abilities:

Traitor

Abilities:

Training in Action

Code Showcase

Environment Reset

LLM Prompt Generation

Action Parsing

Training Loop (PPO)

Training Progression Metrics

Live Training Simulation

Agents

📡 Training Log

Ready to Build?