SimpleBench: https://simple-bench.com/index.html

SOLO-Bench: https://github.com/jd-3d/SOLOBench

AidanBench: https://aidanbench.com

SEAL by Scale: https://scale.com/leaderboard (particularly the MultiChallenge leaderboard)

LMArena: https://beta.lmarena.ai/leaderboard (with Style Control)

LiveBench: https://livebench.ai

ARC-AGI: https://arcprize.org/leaderboard

Thematic Generalization by LechMazur: https://github.com/lechmazur/generalization

( other ones by Lech Mazur: https://github.com/lechmazur/elimination_game, 

https://github.com/lechmazur/confabulations, ...)

EQBench: https://eqbench.com (especially the Longform writing leaderboard)

Fiction-Live Bench

MC-Bench: https://mcbench.ai/leaderboard (ordered by winrate, not by Elo)

TrackingAI - IQ Bench: https://trackingai.org/home

Dubesor LLM: https://dubesor.de/benchtable.html

Balrog-AI: https://balrogai.com

Misguided Attention: https://github.com/cpldcpu/MisguidedAttention

Snake-Bench: https://snakebench.com

SmolAgents LLM: https://huggingface.co/spaces/smolagents/smolagents-leaderboard (just because of GAIA and SimpleQA)

Context-Arena (MRCR and Graphwalks): https://contextarena.ai

OpenCompass

HHEM (Hallucination Benchmark):  https://huggingface.co/spaces/vectara/leaderboard


Coding, Math and Agentic Benchmarks

Aider-Polyglot-Coding: https://aider.chat/docs/leaderboards/

BigCodeBench: https://bigcode-bench.github.io

WebDev-Arena: https://web.lmarena.ai/leaderboard

WeirdML: https://htihle.github.io/weirdml.html

Symflower Coding: https://symflower.com/en/company/blog/2025/dev-quality-eval-v1.0-anthropic-s-claude-3.7-sonnet-is-the-king-with-help-and-deepseek-r1-disappoints/

PHYBench: https://phybench-official.github.io/phybench-demo/

MathArena: https://matharena.ai

Galileo Agent: https://huggingface.co/spaces/galileo-ai/agent-leaderboard

XLANG Agent: https://arena.xlang.ai/leaderboard


Important for tracking AI take-off

METR long task benchmarks: https://metr.org (incl. RE Bench)

PaperBench: https://openai.com/index/paperbench/

SWE-Lancer: https://openai.com/index/swe-lancer/

MLE-Bench: https://github.com/openai/mle-bench

SWE-Bench: https://swebench.com


other classics I ALWAYS want to see when a new model is released

GPQA-Diamond: https://github.com/idavidrein/gpqa

SimpleQA: https://openai.com/index/introducing-simpleqa/

Tau-bench: https://github.com/sierra-research/tau-bench

SciCode: https://github.com/scicode-bench/SciCode

MMMU: https://mmmu-benchmark.github.io/#leaderboard

Humanities Last Exam (HLE): https://github.com/centerforaisafety/hle


Overview for classical benchmarks (GPQA, SimpleQA, AIME, MMLU, ...)

Simple-Evals: https://github.com/openai/simple-evals

Vellum AI: https://vellum.ai/llm-leaderboard

Artificial Analysis: https://artificialanalysis.ai


Benchmarks I literally don't care about - saturated / no signal

MMLU, HumanEval, BBH, DROP, MGSM, basically all math benchmarks like GSM8K, MATH, AIME


보니깐 수학 벤치는 안 넣은듯

수학 벤치딸 해봤자 연구에 못쓰니깐

그래서 자폐아인 o4mh가 많이 낮은거고