$ 0 0 There are multiple benchmarks that probe the frontier of agent capabilities (GDPval, Humanity's Last Exam (HLE)...