머숨 미러

SimpleBench: https://simple-bench.com/index.html

SOLO-Bench: https://github.com/jd-3d/SOLOBench

AidanBench: https://aidanbench.com

SEAL by Scale: https://scale.com/leaderboard (particularly the MultiChallenge leaderboard)

LMArena: https://beta.lmarena.ai/leaderboard (with Style Control)

LiveBench: https://livebench.ai

ARC-AGI: https://arcprize.org/leaderboard

Thematic Generalization by LechMazur: https://github.com/lechmazur/generalization

( other ones by Lech Mazur: https://github.com/lechmazur/elimination_game,

https://github.com/lechmazur/confabulations, ...)

EQBench: https://eqbench.com (especially the Longform writing leaderboard)

Fiction-Live Bench

MC-Bench: https://mcbench.ai/leaderboard (ordered by winrate, not by Elo)

TrackingAI - IQ Bench: https://trackingai.org/home

Dubesor LLM: https://dubesor.de/benchtable.html

Balrog-AI: https://balrogai.com

Misguided Attention: https://github.com/cpldcpu/MisguidedAttention

Snake-Bench: https://snakebench.com

SmolAgents LLM: https://huggingface.co/spaces/smolagents/smolagents-leaderboard (just because of GAIA and SimpleQA)

Context-Arena (MRCR and Graphwalks): https://contextarena.ai

OpenCompass

HHEM (Hallucination Benchmark): https://huggingface.co/spaces/vectara/leaderboard

Coding, Math and Agentic Benchmarks

Aider-Polyglot-Coding: https://aider.chat/docs/leaderboards/

BigCodeBench: https://bigcode-bench.github.io

WebDev-Arena: https://web.lmarena.ai/leaderboard

WeirdML: https://htihle.github.io/weirdml.html

Symflower Coding: https://symflower.com/en/company/blog/2025/dev-quality-eval-v1.0-anthropic-s-claude-3.7-sonnet-is-the-king-with-help-and-deepseek-r1-disappoints/

PHYBench: https://phybench-official.github.io/phybench-demo/

MathArena: https://matharena.ai

Galileo Agent: https://huggingface.co/spaces/galileo-ai/agent-leaderboard

XLANG Agent: https://arena.xlang.ai/leaderboard

Important for tracking AI take-off

METR long task benchmarks: https://metr.org (incl. RE Bench)

PaperBench: https://openai.com/index/paperbench/

SWE-Lancer: https://openai.com/index/swe-lancer/

MLE-Bench: https://github.com/openai/mle-bench

SWE-Bench: https://swebench.com

other classics I ALWAYS want to see when a new model is released

GPQA-Diamond: https://github.com/idavidrein/gpqa

SimpleQA: https://openai.com/index/introducing-simpleqa/

Tau-bench: https://github.com/sierra-research/tau-bench

SciCode: https://github.com/scicode-bench/SciCode

MMMU: https://mmmu-benchmark.github.io/#leaderboard

Humanities Last Exam (HLE): https://github.com/centerforaisafety/hle

Overview for classical benchmarks (GPQA, SimpleQA, AIME, MMLU, ...)

Simple-Evals: https://github.com/openai/simple-evals

Vellum AI: https://vellum.ai/llm-leaderboard

Artificial Analysis: https://artificialanalysis.ai

Benchmarks I literally don't care about - saturated / no signal

MMLU, HumanEval, BBH, DROP, MGSM, basically all math benchmarks like GSM8K, MATH, AIME

보니깐 수학 벤치는 안 넣은듯

수학 벤치딸 해봤자 연구에 못쓰니깐

그래서 자폐아인 o4mh가 많이 낮은거고

[일반] 밑에 종합벤치딸에 쓰인 벤치들

댓글 2

[일반] 밑에 종합벤치딸에 쓰인 벤치들

댓글 2

다른 게시글

그록 3.5 올라오면 바로 테스트 해본다

ASI는 나노머신이나 물질재조합장치를 만들 수 있을 것인가

역노화보다 새 장기 만드는 게 빠를 듯

o3이랑 o4 high 쓰다가 2.5 pro 쓰면 딱 느껴지는게

제미나이로 trpg했던것 중 인상 깊었던 에피소드

아까 GPT 무료인데 유로기능 써진다는 글인데

알트만 2025 agi라고한거

그록은 병신같은 안드 어플이나 어케 해봐라

veo2 좋다해서 유료결제도 해보고 ais 에서도 써보니까 ㅈㄴ 별론데?

유럽애들은 llm못써먹겧다하고 중국은 잘써먹늠거보면