we introduce LizBench, a systematic rating of some models and tools I use to do real autonomous stuff with the fortune 500
I'll start with OpenAI ig
gpt5.1: sucks bad
gpt5: sucks usually except for code or tool calls
gpt4o: goated but needs to debug a lot, great at looking at the docs, sycophantic
26 days ago