Llama-bench on some consumer-grade AI hardware
I have been on a bender this weekend experimenting with various LLM-capable
machines in my homelab, specially the very capable yet fast
Qwen3.6-35B-A3B. I haven’t found good benchmarks, though, so I ran the small
Gemma4 E4B Q4_K model (4.62 GiB, 7.52B params) using llm-bench. This has
two measures: prompt processing 512 (pp512) is how quickly in tokens/second
the LLM can read a 512-token prompt, i.e. how good the LLM is at “reading”,
and token generation 128 (tg128) is how quickly it can write 128 tokens’ worth
of text, i.e. how fast it is at answering the question.
| Hostname | Backend | pp512 t/s | tg128 t/s | Machine |
|---|---|---|---|---|
| xhystos | ROCm | 291.48 | 6.65 | AMD Ryzen AI 7 350 Krackan Point 32GB |
| utumno | Metal,BLAS | 1172.93 | 69.73 | Mac Studio M1 Ultra 128GB |
| ai-x1-pro | ROCm | 568.54 | 21.16 | AMD AI 9 HX 370 Strix Point 96GB |
| dgx1 | CUDA | 3633.84 | 59.42 | NVIDIA DGX Spark 128GB |
| zanzibar | CUDA | 1831.78 | 51.92 | NVIDIA A2000 12GB |
(click on the hostnames to get the raw report)
The Mac Studio performs very well at token generation, despite being a 4 year old machine, but perhaps that reflects how Llama.cpp is particularly optimized for Apple Silicon. I was also surprised at how strong the performance of the A2000 is, despite it being a fairly weak low-power graphics card meant for 2D CAD in small form-factor workstations like my HP Z2 Mini G9 where it lives. Conversely, the Strix Point performance is underwhelming, even if subjectively it performs reasonably well with Qwen 3.