Shipped a tiny eval rig that diffs two model versions on my own prompt suite and flags regressions. Writeup soon.