Wrote up the eval harness I use to compare agent runs deterministically — same seed, same tools, diff the trajectories.
python
def replay(seed, tools):
env = Env(seed=seed, tools=tools)
return [step for step in run(env)] # compare trajectories