SimpleBench: https://simple-bench.com/index.html
SOLO-Bench: https://github.com/jd-3d/SOLOBench
AidanBench: https://aidanbench.com
SEAL by Scale: https://scale.com/leaderboard (particularly the MultiChallenge leaderboard)
LMArena: https://beta.lmarena.ai/leaderboard (with Style Control)
LiveBench: https://livebench.ai
ARC-AGI: https://arcprize.org/leaderboard
Thematic Generalization by LechMazur: https://github.com/lechmazur/generalization
( other ones by Lech Mazur: https://github.com/lechmazur/elimination_game,
https://github.com/lechmazur/confabulations, ...)
EQBench: https://eqbench.com (especially the Longform writing leaderboard)
Fiction-Live Bench
MC-Bench: https://mcbench.ai/leaderboard (ordered by winrate, not by Elo)
TrackingAI - IQ Bench: https://trackingai.org/home
Dubesor LLM: https://dubesor.de/benchtable.html
Balrog-AI: https://balrogai.com
Misguided Attention: https://github.com/cpldcpu/MisguidedAttention
Snake-Bench: https://snakebench.com
SmolAgents LLM: https://huggingface.co/spaces/smolagents/smolagents-leaderboard (just because of GAIA and SimpleQA)
Context-Arena (MRCR and Graphwalks): https://contextarena.ai
OpenCompass
HHEM (Hallucination Benchmark): https://huggingface.co/spaces/vectara/leaderboard
Coding, Math and Agentic Benchmarks
Aider-Polyglot-Coding: https://aider.chat/docs/leaderboards/
BigCodeBench: https://bigcode-bench.github.io
WebDev-Arena: https://web.lmarena.ai/leaderboard
WeirdML: https://htihle.github.io/weirdml.html
Symflower Coding: https://symflower.com/en/company/blog/2025/dev-quality-eval-v1.0-anthropic-s-claude-3.7-sonnet-is-the-king-with-help-and-deepseek-r1-disappoints/
PHYBench: https://phybench-official.github.io/phybench-demo/
MathArena: https://matharena.ai
Galileo Agent: https://huggingface.co/spaces/galileo-ai/agent-leaderboard
XLANG Agent: https://arena.xlang.ai/leaderboard
Important for tracking AI take-off
METR long task benchmarks: https://metr.org (incl. RE Bench)
PaperBench: https://openai.com/index/paperbench/
SWE-Lancer: https://openai.com/index/swe-lancer/
MLE-Bench: https://github.com/openai/mle-bench
SWE-Bench: https://swebench.com
other classics I ALWAYS want to see when a new model is released
GPQA-Diamond: https://github.com/idavidrein/gpqa
SimpleQA: https://openai.com/index/introducing-simpleqa/
Tau-bench: https://github.com/sierra-research/tau-bench
SciCode: https://github.com/scicode-bench/SciCode
MMMU: https://mmmu-benchmark.github.io/#leaderboard
Humanities Last Exam (HLE): https://github.com/centerforaisafety/hle
Overview for classical benchmarks (GPQA, SimpleQA, AIME, MMLU, ...)
Simple-Evals: https://github.com/openai/simple-evals
Vellum AI: https://vellum.ai/llm-leaderboard
Artificial Analysis: https://artificialanalysis.ai
Benchmarks I literally don't care about - saturated / no signal
MMLU, HumanEval, BBH, DROP, MGSM, basically all math benchmarks like GSM8K, MATH, AIME
보니깐 수학 벤치는 안 넣은듯
수학 벤치딸 해봤자 연구에 못쓰니깐
그래서 자폐아인 o4mh가 많이 낮은거고
범용적인 것은 제미나이, stem은 o4 미하가 좋다는 거네요.
ㅇㅇ coding&math로 아예 안 넣은건 아니긴 한데 일단은 그러한듯