Llama-Nemotron 8B
Llama-Nemotron 8B is a dense transformer language model from the Nvidia family, containing 8B parameters across 32 layers. It supports up to 131K tokens of context with a hidden dimension of 4096 and 8 KV heads for efficient grouped-query a…
8.0B
Parameters
128K
Max Context
Dense
Architecture
—
Released
Text
Modality
About Llama-Nemotron 8B
Llama-Nemotron 8B is a dense transformer language model from the Nvidia family, containing 8B parameters across 32 layers. It supports up to 131K tokens of context with a hidden dimension of 4096 and 8 KV heads for efficient grouped-query attention (GQA). Nvidia fine-tune of Llama 3.1 8B for reasoning.
Technical Specifications
System Requirements
Estimated VRAM at 10% overhead for different quantization methods and context sizes.
| Quantization | 1K ctx | 128K ctx |
|---|---|---|
Q4_K_M0.50 B/W ~97% of FP16 | 4.26Consumer GPU | 20.14Consumer GPU |
Q8_01.00 B/W ~100% of FP16 | 8.40Consumer GPU | 24.27Datacenter GPU |
F162.00 B/W Reference | 16.67Consumer GPU | 32.54Datacenter GPU |
Other Nvidia Models
View AllFind the right GPU for Llama-Nemotron 8B
Use the interactive VRAM Calculator to see exactly how much memory you need at any quantization level, context length, and overhead setting.