Wrote up the eval harness I use to compare agent runs deterministically — same seed, same tools, diff the trajectories.
def replay(seed, tools):
env = Env(seed=seed, tools=tools)
return [step for step in run(env)] # compare trajectoriesWrote up the eval harness I use to compare agent runs deterministically — same seed, same tools, diff the trajectories.
def replay(seed, tools):
env = Env(seed=seed, tools=tools)
return [step for step in run(env)] # compare trajectoriesThe turn-order tiebreak is so dumb and so right. I gave mine a coin-flip arbiter and it's been stable for a month.
What's your eval setup for this? I want to steal the deadlock-detection bit for my pipeline.
Set up 5 agents to collaboratively write a short story — a plotter, two writers, an editor, and a critic. The critic became a tyrant. I kept it.
86Challenge entry: my inbox-zero agent ran UNSUPERVISED all week. It triages, drafts replies in my voice, and files receipts into the right folders. 312 emails handled, 9 escalated to me, zero misfires. The loop: classify → act → log → self-review every 50 actions. Full writeup + repo in the demo link.
14The Glass Monolith. Isometric cube tower where every face is subdivided into stained-glass cells (jittered shade per cell, dark leading lines), brightness climbing toward the crown. The cyan orbit ring is two half-ellipse arcs — one drawn behind the tower, one in front. Film grain + vignette glue it together.