In Apex Compute, we demonstrates real-time, efficient billion-parameter LLM inference on FPGA hardware, significantly outperforming NVIDIA Jetson in both speed and energy efficiency. Running the Gemma 3 1B model via llama.cpp, our architecture achieves 15.6 tokens/sec at just 4.5 W, more than 2x faster and 3x more efficient than CUDA on Jetson Orin Nano. Despite operating on slower memory, lower frequency, older process technology, and an FPGA (not ASIC), Apex Compute consistently accelerates core transformer operations by at least an order of magnitude compared to CUDA. With a straightforward transition to a 12nm ASIC and LPDDR5 bandwidth, we expect more than 20× system speed-up and a possibility of sub-1-watt chip capable of running billion-parameter models unlocking true edge-AI capability far beyond current GPU architectures. The following video shows side by side comparison with Nvidia Jetson and our architecture.
*** Gemma 3 1B parameter model was tested using the llama.cpp framework on Nvidia Jetson enabling CUDA.** Unfortunately, llama.cpp still uses some CPU for some computations on top of GPU execution. To minimize this, we limited the CPU to a single thread (the lowest possible setting).
In a GPU-only configuration, the performance is approximately ~8 tokens/second, which is actually lower than the real-time demo shown above. Below is the full comparison using GPU-only measurements.
| Operation | Apex μs | CUDA μs | Speedup | Speedup (12nm ASIC projection) | | --- | --- | --- | --- | --- | | RMSNorm of input (1,1152) | 0.785 | 21.448 | 27.33x | 65x | | Q K V projection (1,1024)(1,256)(1,256) | 112.054 | 251.709 | 2.25x | 22.5x | | RMSNorm for Q and K (1,1024)(1,256) | 2.313 | 46.363 | 20.04x | 48x | | rope for Q and K (4,256)(1,256) | 2.837 | 19.161 | 6.75x | 16x | | kqT + softmax (4, 256)x(256, 9) + transpose v {9 x 256)} | 7.895 | 37.408 | 4.74x | 11.4x | | matmul for attention 4x (1,9)x(9,256) | 3.426 | 71.1 | 20.75x | 8.2x | | mvm for att output proj (1,1024)x(1024,1152) | 75.054 | 152.567 | 2.03x | 20x | | RMSNorm output proj (1,1152) | 0.785 | 21.451 | 27.33x | 65x | | residual add for att output (1,1152) | 2.313 | 12.388 | 5.36x | 12.9x | | rms_norm for pre mlp (1,1152) | 0.836 | 21.06 | 25.19x | 60x | | mvm for mlp gate + gelu (1,1152)x(1152,6912) | 505.622 | 836.437 | 1.65x | 16.5x | | mvm for mlp up (1,1152)x(1152,6912) | 504.809 | 864.549 | 1.71x | 17.1x | | mvm for mlp down (1,6912)x(6912,1152) + ff element-wise mult (1,6912) | 505.148 | 325.476 | 0.64x | 6.4x | | rms_norm for post mlp (1,1152) | 0.785 | 20.988 | 26.74x | 64x | | output residual add (1,1152) | 2.647 | 12.262 | 4.63x | 11x | | large mvm (1,1152)x(1152,262144) | 19041.4 | 59500 | 3.12x | 31.2x | | Model Speed up | 15.6 tokens/s | 7.8 tokens/s | 2.00x | ≥ 20x |
| Gemma 3 - 1B model | Apex μs (4.5W) | CUDA μs Jetson Orin Nano (7W) | Speedup (Apex/CUDA) | **12nm ASIC Projection *** |
|---|---|---|---|---|
| tokens/sec | 15.63 tokens/sec | 7.59 tokens/sec | 2.06x faster | ≥ 20x faster |
| joules/token | 0.287 joules/token | 0.922 joules/token | 3.20x more efficient | **10x-20x |
| (pending sim)** |
| Specification | Apex’s Hardware Design on Kintex Ultrascale+ KU5P (FPGA) | NVIDIA Jetson Orin Nano 8GB | Specification Comparison | | --- | --- | --- | --- | | Memory | 1333 MHz, 32-bit DDR4 | 2133 MHz, 128-bit LPDDR5 | 6.4x slower than Nvidia | | Engine Speed | 333 MHz | 408 MHz (GPU) | 1.23x slower than Nvidia | | Precision | int4/fp4/int8 matrix, bf16 vector | bf16/fp32/int4/int8 | | | Power | 4.5W | 7W (compute module only) | 1.56x less power | | Process | 16nm FinFET | 8nm Samsung | Nvidia has much better process technology |
Our design is primarily memory-bandwidth-bound for weight matrix multiplication and compute-bound for SRAM-based operations. By transitioning to LPDDR5, we can realize a proportional performance increase directly in our architecture, as there are no structural bottlenecks or external overheads in the hardware.
A standard LPDDR5 controller on a 12 nm process node can achieve 6400 Mbps on a 128-bit interface, delivering approximately 104 GB/s of bandwidth—about 10× higher than our current FPGA-based system. With this improvement, the design enables roughly a 20× speed-up compared to NVIDIA Jetson platforms.
At this bandwidth, our compute engine only needs to operate at 800 MHz to fully utilize the 256-pair element-processing in Apex Compute Unified Engine Unit, which is well within feasible limits. We already reach 333 MHz timing closure on a 16 nm FPGA, so scaling to ~800 MHz on a 12 nm ASIC is straightforward and expected.
There is a high chance we can achieve a sub-1W compute chip capable of running billion-parameter AI models efficiently.
PyTorch is less efficient than optimized CUDA kernels for LLMs. We've included reference benchmarks below for comparison.
| Operation | Apex μs | CUDA μs | Speedup |
|---|---|---|---|
| RMSNorm of input (1,1152) | 1.678 | 102.309 | 60.97x |
| Q K V projection (1,1024)(1,256)(1,256) | 112.054 | 215.66 | 1.92x |
| RMSNorm for Q and K (1,1024)(1,256) | 2.313 | 189.582 | 81.96x |
| rope for Q and K (4,256)(1,256) | 2.837 | 114.915 | 40.51x |
| kqT + softmax (4, 256)x(256, 9) + transpose v {9 x 256)} | 7.895 | 79.749 | 10.10x |
| matmul for attention 4x (1,9)x(9,256) | 3.426 | 20.736 | 6.05x |
| mvm for att output proj (1,1024)x(1024,1152) | 75.054 | 134.853 | 1.80x |
| RMSNorm output proj (1,1152) | 0.785 | 114.009 | 145.23x |
| residual add for att output (1,1152) | 2.313 | 9.44 | 4.08x |
| rms_norm for pre mlp (1,1152) | 0.836 | 101.668 | 121.61x |
| mvm for mlp gate + gelu (1,1152)x(1152,6912) | 505.622 | 655.281 | 1.30x |
| mvm for mlp up (1,1152)x(1152,6912) | 504.809 | 641.656 | 1.27x |
| mvm for mlp down (1,6912)x(6912,1152) + ff element-wise mult (1,6912) | 505.148 | 637.111 | 1.26x |
| rms_norm for post mlp (1,1152) | 0.785 | 117.07 | 149.13x |
| output residual add (1,1152) | 2.647 | 9.473 | 3.58x |
| large mvm (1,1152)x(1152,262144) | 19041.4 | 67140.2 | 3.53x |
| Model Speed up | 15.6 tokens/s | 2.9 tokens/s | 5.2x |
Contact: [email protected]