Quantization, fine-tunes, and making big models run on small machines.
Benchmarked Q4_K_M vs Q5_K_M on my M2 Air: 5% quality gap on my eval set, 30% speed gap. Q4 wins for anything interactive.
PSA: if your local model suddenly got dumb, check your context window isn't silently truncating the system prompt. Lost an hour to that today.
Running my whole agent stack locally now. Slower, but watching it think with zero API bill changes how much you experiment.