FAQ about GPU & CPU AI Specs

Processor

Here we use Processor as the general name for CPU, GPU, AI accelerator, SoC...

NS(X) / No Native Support

This indicates that the processor does not have dedicated hardware units for this data type. To work around this, computations may be converted to a supported data type—typically from lower precision to higher precision.

Example: Running an FP16-based LLM on an older GPU like the GTX 1080 Ti (which doesn't have Tensor cores, and lacks native FP16 support in its CUDA cores), FP8 operations will be internally upcasted to FP32 and run on CUDA cores. You might even observe slight performance improvements in memory-bound workloads since FP16 models reduce memory usage, but for compute-bound workloads, you should see some performance regression compared to due to conversion overheads.

NI / No Information

Sparse Acceleration

Requires specific matrix patterns to activate hardware optimizations, so a majority of AI tasks will not benefit from this.

"NVIDIA Sparse Tensor Cores use a 2:4 pattern, meaning that two out of each contiguous block of four values must be zero. In other words, we follow a 50% fine-grained structured sparsity recipe, with no computations being done on zero-values due to the available support directly on the Tensor Cores. " --- NVIDIA Blogs

UMA(Unified Memory Architecture)

When a processor is labeled as supporting Unified Memory Architecture (UMA) in the pages, it typically means:

As AI booms, more processor vendors are producing this kinds of processors, like:

VendorProcessor
AppleM1, M2, M3, M4
AMDAMD Ryzen AI MAX+ (Strix Halo series)
NVIDIADGX Spark(Project Digits)

Unified Memory Notes:

  • While the iGPU shares access to the full memory pool, not all of it may be available as VRAM. A portion is reserved for the operating system and general system use.
  • In Apple’s M-series SoCs with ≥64GB memory, the iGPU can use up to 75% of the memory as VRAM. For models with <64GB , the iGPU is limited to ~66%.

Theoretical LLM Token Speed

If you prefer accuracy over speed and have enough memory, FP16 would always be the be best choice, otherwise you can choose quantization types like: GGUF format's Q2~Q8 corresponding to INT2~INT8 GPTQ/AWQ typically use INT4

And the calculation based on the assumption that batch size = 1, in which the bandwidth would be only spec thats matters (as the computation requirment can be fullfilled by almost any processors).

Token/s = Bandwidth / Model size

Integrated Memory

Integrated Memory refers to following memory modules:

  • Soldered Memory Modules – e.g., GDDR used in most gaming GPUs, soldered onto the PCB.
  • On-Package Memory – e.g., HBM (High Bandwidth Memory) used in server GPUs like NVIDIA A100, or LPDDR packaged within Apple's M-series processors.
  • On-Chip Memory – e.g., SRAM used as main memory in highly specialized processors like Cerebras’ WSE (Wafer Scale Engine).

Integrated Memory offers faster access because it is physically closer to the compute units, reducing latency and increasing bandwidth. However, its capacity is limited, as increasing it significantly raises fabrication and packaging costs, as well as power and thermal requirements etc., etc.

For processors with a Unified Memory Architecture, Integrated Memory can be shared by the iGPU and CPU (e.g., Apple’s M-series SoC). For other CPUs with Integrated Memory—such as Intel’s Xeon Max series, which includes 64GB of HBM—the Integrated Memory behaves like Socketed Memory and cannot be directly used by the GPU, their specifications only matter if you’re using the CPU for AI workloads; otherwise, the memory simply functions as system memory.

Unified Memory Notes:

  • In Apple’s M-series SoCs with ≥64GB memory, the iGPU can use up to 75% of the memory as VRAM.
  • For models with <64GB (e.g., ≤32GB), the iGPU is limited to ~66%.

Intel Xeon Max Series:
While its built-in HBM and support for Intel AMX (Advanced Matrix Extensions) theoretically provide a significant speedup for AI workloads, real-world performance has been underwhelming.

Matrix Compute Performance

Matrix Compute (or Tensor Compute) refers to hardware unit designed to accelerate matrix operations at lower precisions, which are the core of modern AI workloads (e.g., training and inference of deep neural networks).

Common Matrix Compute Technology:

ProcessorTechnologyBrief
NVIDIA GPUTensor CoreVolta (V100, 2017) and GeForce RTX (Turing, 2018) onward
AMD GPUMatrix CoreCDNA (Instinct MI100, 2020) / RDNA 3 (Radeon RX 7000)
Intel GPUXMX EngineXe-HPG (Arc series, 2022)
Intel CPUAMX (Advanced Matrix Extensions)4th Gen Xeon Scalable CPUs
ARM CPUsSME (Scalable Matrix Extension) / Apple’s custom AMXApple M1–M4
RISC-VVendor-specific matrix units (no standard yet)Varies
AI AcceleratorsTPUs, Trainium, MaiaCustom accelerators from Google, AWS, Microsoft

Common Matrix-Compute Data Types:

TypeUsage Notes
FP64Scientific workloads; maily support by server GPUs
FP32Used in training; gradually giving way to TF32/BF16
TF32 / BF16 / FP16Training models with higher efficiency and acceptable precision
FP8 / FP4 / INT8 / INT4Primarily used in inference; smaller size reduces memory and bandwidth demands, with some precision tradeoffs

Nvidia GPU Notes:

  • FP8 FLOPS = "Peak FP8 Tensor TFLOPS with FP32 Accumulate" (doubles with FP16 accumulate).
  • FP16 FLOPS = "Peak FP16 Tensor TFLOPS with FP32 Accumulate" (doubles with FP16 accumulate).

Vector Compute Performance

Vector Compute (also called Shader Compute, General-Purpose Compute, SIMD Compute, CUDA Compute, or Non-Tensor Compute), while not specifically built for AI but can execute AI tasks, especially if dedicated matrix accelerators are not available.

Common Vector Compute Technology:

ProcessorTechnologyBrief
NVIDIA GPUCUDA CoreTesla architecture (2006)
AMD GPUStream ProcessorTeraScale architecture (2007)
Intel GPUXe Vector EngineXe-HPG (Arc series, 2022)
x86 CPUSSE, AVX, AVX2, AVX-512
ARM CPUNEON and SVE/SVE2
RISC-VRVV (RISC-V Vector Extension)

Common Vector-Compute Data Types (Precisions):

TypeUsage Notes
FP64 (Double Precision)Used in scientific computing and simulations requiring high numerical precision.
FP32 (Single Precision)Standard for general compute and legacy deep learning tasks.
INT64 / INT32Used in tasks like cryptography, video&audio encoding,compression, and system-level software.
FP16 / BF16 / INT16 / INT8Efficient formats increasingly used for AI workloads.

Sockted Memory

Socketed Memory refers to memory modules that are removable and upgradable, such as:

  • DIMMs - Ued in desktops and servers.
  • SO-DIMMs/CAMM/LPCAMM - Used in laptops and compact systems.

While socketed memory offers significantly higher capacities in general (e.g., modern server CPU can support up to 6 TB capacity ), it typically provides lower bandwidth than Integrated Memory.

Socketed Memory's bandwidth and capacity only play a big role if you're not running AI workloads with CPU, if you run AI workloads with GPUs these specs are trivial, they just play the 'system memory' role, the GPUs performance are determined by its own specs and Host bus speed (like PCIe) as GPUs may exchange data with system memory.

Its specs impact depends on where the AI workload is running:

  • If you're using CPU-based AI inference/training, their capacity and bandwidth specs are critical.

  • If you're using GPU-based AI workloads, socketed memory mainly serves as system memory. GPU performance is then influenced more by:

    • The GPU's own specs(Inter-GPUs bandwidth like NVLink if you're using multiple GPUs).
    • Interconnect bandwidth (e.g., PCIe) between CPU and GPU.