DeepSeekMoEMIT

DeepSeek V3 (MoE)

DeepSeek V3 is the non-reasoning base model sharing R1's architecture — 671B total MoE with 37B active per token using MLA for compressed KV cache. It delivers frontier-class general performance on par with GPT-4o and Claude 3.5 Sonnet. MIT

671.0B

Parameters

37.0B

Active

64K

Max Context

MoE

Architecture

Dec 26, 2024

Released

Text

Modality

About DeepSeek V3 (MoE)

DeepSeek V3 is the non-reasoning base model sharing R1's architecture — 671B total MoE with 37B active per token using MLA for compressed KV cache. It delivers frontier-class general performance on par with GPT-4o and Claude 3.5 Sonnet. MIT licensed, making it the most capable truly open-weight model available. Like R1, the full model requires server/cluster hardware (~370 GB at Q4_K_M), but its architecture innovations (MLA, auxiliary-loss-free load balancing, multi-token prediction) have influenced the entire field.

General PurposeCodeResearchEnterprise

Technical Specifications

Total Parameters671.0B
Active Parameters37.0B per token
ArchitectureMixture of Experts
Total Experts256
Active Experts8 per token
Attention TypeMLA (Multi-head Latent Attention)
Hidden Dimensiond = 7,168
Transformer Layers61
Attention Heads56
KV Headsn_kv = 8
Head Dimensiond_head = 128
Activation FunctionSwiGLU
NormalizationRMSNorm
Position EmbeddingYaRN-extended RoPE

System Requirements

Estimated VRAM at 10% overhead for different quantization methods and context sizes.

Quantization1K ctx64K ctx
Q4_K_M0.50 B/W
~97% of FP16
347.1Cluster / Multi-GPU
362.1Cluster / Multi-GPU
Q8_01.00 B/W
~100% of FP16
693.9Cluster / Multi-GPU
708.9Cluster / Multi-GPU
F162.00 B/W
Reference
1387.6Cluster / Multi-GPU
1402.6Cluster / Multi-GPU
Fits 24 GB consumer GPU
Fits 80 GB datacenter GPU
Requires cluster / multi-GPU

Other DeepSeek Models

View All

Find the right GPU for DeepSeek V3 (MoE)

Use the interactive VRAM Calculator to see exactly how much memory you need at any quantization level, context length, and overhead setting.