Here we use Processor as the general name for CPU, GPU, AI accelerator, SoC...
This indicates that the processor does not have dedicated hardware units for this data type. To work around this, computations may be converted to a supported data type—typically from lower precision to higher precision.
Example: Running an FP16-based LLM on an older GPU like the GTX 1080 Ti (which doesn't have Tensor cores, and lacks native FP16 support in its CUDA cores), FP8 operations will be internally upcasted to FP32 and run on CUDA cores. You might even observe slight performance improvements in memory-bound workloads since FP16 models reduce memory usage, but for compute-bound workloads, you should see some performance regression compared to due to conversion overheads.
Requires specific matrix patterns to activate hardware optimizations, so a majority of AI tasks will not benefit from this.
"NVIDIA Sparse Tensor Cores use a 2:4 pattern, meaning that two out of each contiguous block of four values must be zero. In other words, we follow a 50% fine-grained structured sparsity recipe, with no computations being done on zero-values due to the available support directly on the Tensor Cores. " --- NVIDIA Blogs
When a processor is labeled as supporting Unified Memory Architecture (UMA) in the pages, it typically means:
As AI booms, more processor vendors are producing this kinds of processors, like:
Vendor | Processor |
---|---|
Apple | M1, M2, M3, M4 |
AMD | AMD Ryzen AI MAX+ (Strix Halo series) |
NVIDIA | DGX Spark(Project Digits) |
Unified Memory Notes:
- While the iGPU shares access to the full memory pool, not all of it may be available as VRAM. A portion is reserved for the operating system and general system use.
- In Apple’s M-series SoCs with ≥64GB memory, the iGPU can use up to 75% of the memory as VRAM. For models with <64GB , the iGPU is limited to ~66%.
If you prefer accuracy over speed and have enough memory, FP16 would always be the be best choice, otherwise you can choose quantization types like: GGUF format's Q2~Q8 corresponding to INT2~INT8 GPTQ/AWQ typically use INT4
And the calculation based on the assumption that batch size = 1, in which the bandwidth would be only spec thats matters (as the computation requirment can be fullfilled by almost any processors).
Token/s = Bandwidth / Model size
Integrated Memory refers to following memory modules:
Integrated Memory offers faster access because it is physically closer to the compute units, reducing latency and increasing bandwidth. However, its capacity is limited, as increasing it significantly raises fabrication and packaging costs, as well as power and thermal requirements etc., etc.
For processors with a Unified Memory Architecture, Integrated Memory can be shared by the iGPU and CPU (e.g., Apple’s M-series SoC). For other CPUs with Integrated Memory—such as Intel’s Xeon Max series, which includes 64GB of HBM—the Integrated Memory behaves like Socketed Memory and cannot be directly used by the GPU, their specifications only matter if you’re using the CPU for AI workloads; otherwise, the memory simply functions as system memory.
Unified Memory Notes:
- In Apple’s M-series SoCs with ≥64GB memory, the iGPU can use up to 75% of the memory as VRAM.
- For models with <64GB (e.g., ≤32GB), the iGPU is limited to ~66%.
Intel Xeon Max Series:
While its built-in HBM and support for Intel AMX (Advanced Matrix Extensions) theoretically provide a significant speedup for AI workloads, real-world performance has been underwhelming.
Matrix Compute (or Tensor Compute) refers to hardware unit designed to accelerate matrix operations at lower precisions, which are the core of modern AI workloads (e.g., training and inference of deep neural networks).
Processor | Technology | Brief |
---|---|---|
NVIDIA GPU | Tensor Core | Volta (V100, 2017) and GeForce RTX (Turing, 2018) onward |
AMD GPU | Matrix Core | CDNA (Instinct MI100, 2020) / RDNA 3 (Radeon RX 7000) |
Intel GPU | XMX Engine | Xe-HPG (Arc series, 2022) |
Intel CPU | AMX (Advanced Matrix Extensions) | 4th Gen Xeon Scalable CPUs |
ARM CPUs | SME (Scalable Matrix Extension) / Apple’s custom AMX | Apple M1–M4 |
RISC-V | Vendor-specific matrix units (no standard yet) | Varies |
AI Accelerators | TPUs, Trainium, Maia | Custom accelerators from Google, AWS, Microsoft |
Type | Usage Notes |
---|---|
FP64 | Scientific workloads; maily support by server GPUs |
FP32 | Used in training; gradually giving way to TF32/BF16 |
TF32 / BF16 / FP16 | Training models with higher efficiency and acceptable precision |
FP8 / FP4 / INT8 / INT4 | Primarily used in inference; smaller size reduces memory and bandwidth demands, with some precision tradeoffs |
Nvidia GPU Notes:
- FP8 FLOPS = "Peak FP8 Tensor TFLOPS with FP32 Accumulate" (doubles with FP16 accumulate).
- FP16 FLOPS = "Peak FP16 Tensor TFLOPS with FP32 Accumulate" (doubles with FP16 accumulate).
Vector Compute (also called Shader Compute, General-Purpose Compute, SIMD Compute, CUDA Compute, or Non-Tensor Compute), while not specifically built for AI but can execute AI tasks, especially if dedicated matrix accelerators are not available.
Processor | Technology | Brief |
---|---|---|
NVIDIA GPU | CUDA Core | Tesla architecture (2006) |
AMD GPU | Stream Processor | TeraScale architecture (2007) |
Intel GPU | Xe Vector Engine | Xe-HPG (Arc series, 2022) |
x86 CPU | SSE, AVX, AVX2, AVX-512 | |
ARM CPU | NEON and SVE/SVE2 | |
RISC-V | RVV (RISC-V Vector Extension) |
Type | Usage Notes |
---|---|
FP64 (Double Precision) | Used in scientific computing and simulations requiring high numerical precision. |
FP32 (Single Precision) | Standard for general compute and legacy deep learning tasks. |
INT64 / INT32 | Used in tasks like cryptography, video&audio encoding,compression, and system-level software. |
FP16 / BF16 / INT16 / INT8 | Efficient formats increasingly used for AI workloads. |
Socketed Memory refers to memory modules that are removable and upgradable, such as:
While socketed memory offers significantly higher capacities in general (e.g., modern server CPU can support up to 6 TB capacity ), it typically provides lower bandwidth than Integrated Memory.
Socketed Memory's bandwidth and capacity only play a big role if you're not running AI workloads with CPU, if you run AI workloads with GPUs these specs are trivial, they just play the 'system memory' role, the GPUs performance are determined by its own specs and Host bus speed (like PCIe) as GPUs may exchange data with system memory.
Its specs impact depends on where the AI workload is running:
If you're using CPU-based AI inference/training, their capacity and bandwidth specs are critical.
If you're using GPU-based AI workloads, socketed memory mainly serves as system memory. GPU performance is then influenced more by: