FAQ about GPU & CPU AI Specs

Processor

Here we use Processor as the general name for CPU, GPU, AI accelerator, SoC...

NS(X) / No Native Support

This indicates that the processor does not have dedicated hardware units for this data type. To work around this, computations may be converted to a supported data type—typically from lower precision to higher precision.

Example: Running an FP16-based LLM on an older GPU like the GTX 1080 Ti (which doesn't have Tensor cores, and lacks native FP16 support in its CUDA cores), FP8 operations will be internally upcasted to FP32 and run on CUDA cores. You might even observe slight performance improvements in memory-bound workloads since FP16 models reduce memory usage, but for compute-bound workloads, you should see some performance regression compared to due to conversion overheads.

NI / No Information

The datatype is expected to have native hardware support, but no public benchmarks or detailed specifications are available.
If you have test results or documentation, welcome to share them by commenting on processor pages.

Sparse Acceleration

Requires specific matrix patterns to activate hardware optimizations, so a majority of AI tasks will not benefit from this.

"NVIDIA Sparse Tensor Cores use a 2:4 pattern, meaning that two out of each contiguous block of four values must be zero. In other words, we follow a 50% fine-grained structured sparsity recipe, with no computations being done on zero-values due to the available support directly on the Tensor Cores. " --- NVIDIA Blogs

UMA(Unified Memory Architecture)

When a processor is labeled as supporting Unified Memory Architecture (UMA) in the pages, it typically means:

The processor includes a powerful integrated GPU (iGPU), potentially comparable in performance to entry-level or even mid-range desktop discrete GPUs (dGPUs).
The processor has built-in high bandwidth Integrated Memory
The iGPU can access this shared memory directly and efficiently, without bandwidth limitations.

As AI booms, more processor vendors are producing this kinds of processors, like:

Vendor	Processor
Apple	M1, M2, M3, M4
AMD	AMD Ryzen AI MAX+ (Strix Halo series)
NVIDIA	DGX Spark(Project Digits)

Unified Memory Notes:

While the iGPU shares access to the full memory pool, not all of it may be available as VRAM. A portion is reserved for the operating system and general system use.

In Apple’s M-series SoCs with ≥64GB memory, the iGPU can use up to 75% of the memory as VRAM. For models with <64GB , the iGPU is limited to ~66%.

Theoretical LLM Token Speed

If you prefer accuracy over speed and have enough memory, FP16 would always be the be best choice, otherwise you can choose quantization types like: GGUF format's Q2~Q8 corresponding to INT2~INT8 GPTQ/AWQ typically use INT4

And the calculation based on the assumption that batch size = 1, in which the bandwidth would be only spec thats matters (as the computation requirment can be fullfilled by almost any processors).

Token/s = Bandwidth / Model size

Integrated Memory

Integrated Memory refers to following memory modules:

Soldered Memory Modules – e.g., GDDR used in most gaming GPUs, soldered onto the PCB.
On-Package Memory – e.g., HBM (High Bandwidth Memory) used in server GPUs like NVIDIA A100, or LPDDR packaged within Apple's M-series processors.
On-Chip Memory – e.g., SRAM used as main memory in highly specialized processors like Cerebras’ WSE (Wafer Scale Engine).

Integrated Memory offers faster access because it is physically closer to the compute units, reducing latency and increasing bandwidth. However, its capacity is limited, as increasing it significantly raises fabrication and packaging costs, as well as power and thermal requirements etc., etc.

For processors with a Unified Memory Architecture, Integrated Memory can be shared by the iGPU and CPU (e.g., Apple’s M-series SoC). For other CPUs with Integrated Memory—such as Intel’s Xeon Max series, which includes 64GB of HBM—the Integrated Memory behaves like Socketed Memory and cannot be directly used by the GPU, their specifications only matter if you’re using the CPU for AI workloads; otherwise, the memory simply functions as system memory.

Unified Memory Notes:

In Apple’s M-series SoCs with ≥64GB memory, the iGPU can use up to 75% of the memory as VRAM.

For models with <64GB (e.g., ≤32GB), the iGPU is limited to ~66%.

Intel Xeon Max Series:
While its built-in HBM and support for Intel AMX (Advanced Matrix Extensions) theoretically provide a significant speedup for AI workloads, real-world performance has been underwhelming.

Matrix Compute Performance

Matrix Compute (or Tensor Compute) refers to hardware unit designed to accelerate matrix operations at lower precisions, which are the core of modern AI workloads (e.g., training and inference of deep neural networks).

Common Matrix Compute Technology:

Processor	Technology	Brief
NVIDIA GPU	Tensor Core	Volta (V100, 2017) and GeForce RTX (Turing, 2018) onward
AMD GPU	Matrix Core	CDNA (Instinct MI100, 2020) / RDNA 3 (Radeon RX 7000)
Intel GPU	XMX Engine	Xe-HPG (Arc series, 2022)
Intel CPU	AMX (Advanced Matrix Extensions)	4th Gen Xeon Scalable CPUs
ARM CPUs	SME (Scalable Matrix Extension) / Apple’s custom AMX	Apple M1–M4
RISC-V	Vendor-specific matrix units (no standard yet)	Varies
AI Accelerators	TPUs, Trainium, Maia	Custom accelerators from Google, AWS, Microsoft

Common Matrix-Compute Data Types:

Type	Usage Notes
FP64	Scientific workloads; maily support by server GPUs
FP32	Used in training; gradually giving way to TF32/BF16
TF32 / BF16 / FP16	Training models with higher efficiency and acceptable precision
FP8 / FP4 / INT8 / INT4	Primarily used in inference; smaller size reduces memory and bandwidth demands, with some precision tradeoffs

Nvidia GPU Notes:

FP8 FLOPS = "Peak FP8 Tensor TFLOPS with FP32 Accumulate" (doubles with FP16 accumulate).

FP16 FLOPS = "Peak FP16 Tensor TFLOPS with FP32 Accumulate" (doubles with FP16 accumulate).

Vector Compute Performance

Vector Compute (also called Shader Compute, General-Purpose Compute, SIMD Compute, CUDA Compute, or Non-Tensor Compute), while not specifically built for AI but can execute AI tasks, especially if dedicated matrix accelerators are not available.

Common Vector Compute Technology:

Processor	Technology	Brief
NVIDIA GPU	CUDA Core	Tesla architecture (2006)
AMD GPU	Stream Processor	TeraScale architecture (2007)
Intel GPU	Xe Vector Engine	Xe-HPG (Arc series, 2022)
x86 CPU	SSE, AVX, AVX2, AVX-512
ARM CPU	NEON and SVE/SVE2
RISC-V	RVV (RISC-V Vector Extension)

Common Vector-Compute Data Types (Precisions):

Type	Usage Notes
FP64 (Double Precision)	Used in scientific computing and simulations requiring high numerical precision.
FP32 (Single Precision)	Standard for general compute and legacy deep learning tasks.
INT64 / INT32	Used in tasks like cryptography, video&audio encoding,compression, and system-level software.
FP16 / BF16 / INT16 / INT8	Efficient formats increasingly used for AI workloads.

Sockted Memory

Socketed Memory refers to memory modules that are removable and upgradable, such as:

DIMMs - Ued in desktops and servers.
SO-DIMMs/CAMM/LPCAMM - Used in laptops and compact systems.

While socketed memory offers significantly higher capacities in general (e.g., modern server CPU can support up to 6 TB capacity ), it typically provides lower bandwidth than Integrated Memory.

Socketed Memory's bandwidth and capacity only play a big role if you're not running AI workloads with CPU, if you run AI workloads with GPUs these specs are trivial, they just play the 'system memory' role, the GPUs performance are determined by its own specs and Host bus speed (like PCIe) as GPUs may exchange data with system memory.

Its specs impact depends on where the AI workload is running:

If you're using CPU-based AI inference/training, their capacity and bandwidth specs are critical.
If you're using GPU-based AI workloads, socketed memory mainly serves as system memory. GPU performance is then influenced more by:
- The GPU's own specs(Inter-GPUs bandwidth like NVLink if you're using multiple GPUs).
- Interconnect bandwidth (e.g., PCIe) between CPU and GPU.