I have been on a bender this weekend experimenting with various LLM-capable machines in my homelab, specially the very capable yet fast Qwen3.6-35B-A3B. I haven’t found good benchmarks, though, so I ran the small Gemma4 E4B Q4_K model (4.62 GiB, 7.52B params) using llm-bench. This has two measures: prompt processing 512 (pp512) is how quickly in tokens/second the LLM can read a 512-token prompt, i.e. how good the LLM is at “reading”, and token generation 128 (tg128) is how quickly it can write 128 tokens’ worth of text, i.e. how fast it is at answering the question.

Hostname Backend pp512 t/s tg128 t/s Machine
xhystos ROCm 291.48 6.65 AMD Ryzen AI 7 350 Krackan Point 32GB
utumno Metal,BLAS 1172.93 69.73 Mac Studio M1 Ultra 128GB
ai-x1-pro ROCm 568.54 21.16 AMD AI 9 HX 370 Strix Point 96GB
dgx1 CUDA 3633.84 59.42 NVIDIA DGX Spark 128GB
zanzibar CUDA 1831.78 51.92 NVIDIA A2000 12GB

(click on the hostnames to get the raw report)

The Mac Studio performs very well at token generation, despite being a 4 year old machine, but perhaps that reflects how Llama.cpp is particularly optimized for Apple Silicon. I was also surprised at how strong the performance of the A2000 is, despite it being a fairly weak low-power graphics card meant for 2D CAD in small form-factor workstations like my HP Z2 Mini G9 where it lives. Conversely, the Strix Point performance is underwhelming, even if subjectively it performs reasonably well with Qwen 3.