Last updated: April 5, 2026 | Reviewed by Daniel Ashford

The LLM Judge Index

Independent, multi-dimensional AI model evaluation by Daniel Ashford. 516+ models ranked. Methodology

Full Leaderboard - 516 Models

Data by Artificial Analysis | Updated hourly
#ModelIntelGPQACodeInput $/MLicense
1
GPT-5.5 (xhigh)OpenAI
60.293.5%59.1$5.0Prop.
2
GPT-5.5 (high)OpenAI
58.993.2%58.5$5.0Prop.
3
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)Anthropic
57.391.4%52.5$6.3Prop.
4
Gemini 3.1 Pro PreviewGoogle
57.294.1%55.5$2.0Prop.
5
GPT-5.4 (xhigh)OpenAI
56.892%57.2$2.5Prop.
6
GPT-5.5 (medium)OpenAI
56.792.6%56.2$5.0Prop.
7
Kimi K2.6Kimi
53.991.1%47.1$0.95Prop.
8
MiMo-V2.5-ProXiaomi
53.886.6%45.5$1.0Prop.
9
GPT-5.3 Codex (xhigh)OpenAI
53.691.5%53.1$1.8Prop.
10
Grok 4.3 (high)xAI
53.290.1%41.0$1.3Prop.
11
Claude Opus 4.6 (Adaptive Reasoning, Max Effort)Anthropic
52.989.6%48.1$6.3Prop.
12
Muse SparkMeta
52.288.4%47.5-Prop.
13
Claude Opus 4.7 (Non-reasoning, High Effort)Anthropic
51.888.5%53.1$6.3Prop.
14
Qwen3.6 Max PreviewAlibaba
51.888.8%44.9$1.3Prop.
15
Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)Anthropic
51.787.5%50.9$3.8Prop.
16
DeepSeek V4 Pro (Reasoning, Max Effort)DeepSeek
51.588.8%47.5$1.7Prop.
17
GLM-5.1 (Reasoning)Z AI
51.486.8%43.4$1.4Prop.
18
GPT-5.2 (xhigh)OpenAI
51.390.3%48.7$1.8Prop.
19
GPT-5.5 (low)OpenAI
50.891%52.1$5.0Prop.
20
Qwen3.6 PlusAlibaba
50.088.2%42.9$0.50Prop.
21
DeepSeek V4 Pro (Reasoning, High Effort)DeepSeek
49.890.5%43.2$1.7Prop.
22
GLM-5 (Reasoning)Z AI
49.882%44.2$1.0Prop.
23
Claude Opus 4.5 (Reasoning)Anthropic
49.786.6%47.8$6.3Prop.
24
MiniMax-M2.7MiniMax
49.687.4%41.9$0.30Prop.
25
Grok 4.20 0309 v2 (Reasoning)xAI
49.391.1%40.5$2.0Prop.
26
MiMo-V2-ProXiaomi
49.287%41.4$1.0Prop.
27
MiMo-V2.5Xiaomi
49.084.9%42.1$0.36Prop.
28
GPT-5.2 Codex (xhigh)OpenAI
49.089.9%43.0$1.8Prop.
29
GPT-5.4 mini (xhigh)OpenAI
48.987.5%51.5$0.75Prop.
30
Grok 4.20 0309 (Reasoning)xAI
48.588.5%42.2$2.0Prop.
31
Gemini 3 Pro Preview (high)Google
48.490.8%46.5$2.0Prop.
32
GPT-5.4 (low)OpenAI
47.987.1%45.6$2.5Prop.
33
GPT-5.1 (high)OpenAI
47.787.3%44.7$1.3Prop.
34
GLM-5-TurboZ AI
46.884.7%36.8-Prop.
35
Kimi K2.5 (Reasoning)Kimi
46.887.9%39.6$0.54Prop.
36
GPT-5.2 (medium)OpenAI
46.686.4%44.2$1.8Prop.
37
DeepSeek V4 Flash (Reasoning, Max Effort)DeepSeek
46.589.4%38.7$0.14Prop.
38
Claude Opus 4.6 (Non-reasoning, High Effort)Anthropic
46.584%47.6$6.3Prop.
39
Gemini 3 Flash Preview (Reasoning)Google
46.489.8%42.6$0.50Prop.
40
DeepSeek V4 Flash (Reasoning, High Effort)DeepSeek
46.086.7%39.8$0.14Prop.
41
Qwen3.6 27B (Reasoning)Alibaba
45.884.2%36.5$0.60Prop.
42
Qwen3.5 397B A17B (Reasoning)Alibaba
45.089.3%41.3$0.60Prop.
43
MiMo-V2-Omni-0327Xiaomi
44.985.5%36.9$0.40Prop.
44
GPT-5 (high)OpenAI
44.685.4%36.0$1.3Prop.
45
GPT-5 Codex (high)OpenAI
44.683.7%38.9$1.3Prop.
46
Claude Sonnet 4.6 (Non-reasoning, High Effort)Anthropic
44.479.9%46.4$3.8Prop.
47
GPT-5.4 nano (xhigh)OpenAI
44.081.7%43.9$0.20Prop.
48
KAT Coder Pro V2KwaiKAT
43.885.5%45.6$0.30Prop.
49
GLM-5.1 (Non-reasoning)Z AI
43.883.9%35.8$1.4Prop.
50
Qwen3.6 35B A3B (Reasoning)Alibaba
43.584.1%35.2$0.25Prop.
Showing top 50 of 516 models. Full data powered by Artificial Analysis.

Best LLM By Use Case

💻
Best for Code Generation
Weighted ranking
💬
Best for Customer Chatbot
Weighted ranking
✍️
Best for Content Writing
Weighted ranking
📊
Best for Data Analysis
Weighted ranking
🔬
Best for Research & RAG
Weighted ranking
🛡️
Best for Safety-Critical
Weighted ranking

Best LLM By Industry

🎓
Education
Schools, tutoring and edtech
🏥
Healthcare
Hospitals, clinics and health tech
🏦
Financial Services
Banking, investment and fintech
⚖️
Legal
Law firms, contracts and legal tech
💬
Customer Support
Help desks, chatbots and CX

Popular Comparisons

Claude Opus 4 vs GPT-5.3 Codex
Full comparison
Claude Opus 4 vs Gemini 2.5 Ultra
Full comparison
Claude Opus 4 vs Claude Sonnet 4
Full comparison
Claude Opus 4 vs GPT-4o
Full comparison
Claude Opus 4 vs Llama 4 405B
Full comparison
Claude Opus 4 vs Mistral Large 3
Full comparison
Claude Opus 4 vs Qwen 3.5 Plus
Full comparison
Claude Opus 4 vs DeepSeek V3
Full comparison

LLM Glossary

47 AI and language model terms explained. Browse all

Large Language Model (LLM)
An AI system trained on massive text data to understand and generate human language.
Tokens
The basic units of text that LLMs process — roughly 3/4 of a word.
Context Window
The maximum amount of text an LLM can process in a single request.
Hallucination
When an LLM generates plausible-sounding but factually incorrect information.
Inference
The process of an LLM generating a response to your input.
Prompt
The text input you send to an LLM to get a response.
Fine-Tuning
Customizing a pre-trained LLM on your specific data to improve performance for your use case.
RAG (Retrieval-Augmented Generation)
A technique that gives LLMs access to external documents to improve accuracy and reduce hallucination.

By Provider

Anthropic (3)OpenAI (3)Google (2)Meta (1)Mistral (1)Alibaba (1)DeepSeek (1)