DeepSeek V4-Pro (MoE)
DeepSeek V4-Pro is the April 2026 frontier model pushing to 1.6 trillion total parameters with 49B active per token. It introduces Dynamic Sparse Attention (DSA) and token compression for efficient 1M token context processing. At this scale…
1.6T
Parameters
49.0B
Active
1.0M
Max Context
MoE
Architecture
Apr 1, 2026
Released
Text
Modality
About DeepSeek V4-Pro (MoE)
DeepSeek V4-Pro is the April 2026 frontier model pushing to 1.6 trillion total parameters with 49B active per token. It introduces Dynamic Sparse Attention (DSA) and token compression for efficient 1M token context processing. At this scale it is cluster-class only — requiring over 800 GB VRAM at Q4_K_M. The architecture represents the bleeding edge of open-weight AI: 80 transformer layers, 8192 hidden dimension, and advanced load-balanced MoE routing. Primarily accessed via API, with open weights available for research and enterprise self-hosting.
Technical Specifications
System Requirements
Estimated VRAM at 10% overhead for different quantization methods and context sizes.
| Quantization | 1K ctx | 195K ctx | 1.0M ctx | 1.0M ctx |
|---|---|---|---|---|
Q4_K_M0.50 B/W ~97% of FP16 | 827.3Cluster / Multi-GPU | 888.0Cluster / Multi-GPU | 1132.2Cluster / Multi-GPU | 1147.0Cluster / Multi-GPU |
Q8_01.00 B/W ~100% of FP16 | 1654.3Cluster / Multi-GPU | 1715.1Cluster / Multi-GPU | 1959.2Cluster / Multi-GPU | 1974.0Cluster / Multi-GPU |
F162.00 B/W Reference | 3308.4Cluster / Multi-GPU | 3369.1Cluster / Multi-GPU | 3613.2Cluster / Multi-GPU | 3628.1Cluster / Multi-GPU |
Other DeepSeek Models
View AllDeepSeek R1 (MoE)
Params
671.0B
Layers
61
Context
64K
DeepSeek V3 (MoE)
Params
671.0B
Layers
61
Context
64K
DeepSeek V3 0324 (MoE)
Params
685.0B
Layers
61
Context
64K
DeepSeek V4-Flash (MoE)
Params
284.0B
Layers
48
Context
1.0M
DeepSeek R1 Distill Qwen 1.5B
Params
1.5B
Layers
28
Context
32K
DeepSeek R1 Distill Qwen 7B
Params
7.6B
Layers
28
Context
32K
Find the right GPU for DeepSeek V4-Pro (MoE)
Use the interactive VRAM Calculator to see exactly how much memory you need at any quantization level, context length, and overhead setting.