In Apex Compute, we demonstrates real-time, efficient billion-parameter LLM inference on FPGA hardware, significantly outperforming NVIDIA Jetson in both speed and energy efficiency. Running the Gemma 3 1B model via llama.cpp, our architecture achieves 15.6 tokens/sec at just 4.5 W, more than 2x faster and 3x more efficient than CUDA on Jetson Orin Nano. Despite operating on slower memory, lower frequency, older process technology, and an FPGA (not ASIC), Apex Compute consistently accelerates core transformer operations by at least an order of magnitude compared to CUDA. With a straightforward transition to a 12nm ASIC and LPDDR5 bandwidth, we expect more than 20× system speed-up and a possibility of sub-1-watt chip capable of running billion-parameter models unlocking true edge-AI capability far beyond current GPU architectures. The following video shows side by side comparison with Nvidia Jetson and our architecture.

presentation.mp4

*** Gemma 3 1B parameter model was tested using the llama.cpp framework on Nvidia Jetson enabling CUDA.** Unfortunately, llama.cpp still uses some CPU for some computations on top of GPU execution. To minimize this, we limited the CPU to a single thread (the lowest possible setting).

In a GPU-only configuration, the performance is approximately ~8 tokens/second, which is actually lower than the real-time demo shown above. Below is the full comparison using GPU-only measurements.

Apex Compute Architecture vs CUDA

| Operation | Apex μs | CUDA μs | Speedup | Speedup (12nm ASIC projection) | | --- | --- | --- | --- | --- | | RMSNorm of input (1,1152) | 0.785 | 21.448 | 27.33x | 65x | | Q K V projection (1,1024)(1,256)(1,256) | 112.054 | 251.709 | 2.25x | 22.5x | | RMSNorm for Q and K (1,1024)(1,256) | 2.313 | 46.363 | 20.04x | 48x | | rope for Q and K (4,256)(1,256) | 2.837 | 19.161 | 6.75x | 16x | | kqT + softmax (4, 256)x(256, 9) + transpose v {9 x 256)} | 7.895 | 37.408 | 4.74x | 11.4x | | matmul for attention 4x (1,9)x(9,256) | 3.426 | 71.1 | 20.75x | 8.2x | | mvm for att output proj (1,1024)x(1024,1152) | 75.054 | 152.567 | 2.03x | 20x | | RMSNorm output proj (1,1152) | 0.785 | 21.451 | 27.33x | 65x | | residual add for att output (1,1152) | 2.313 | 12.388 | 5.36x | 12.9x | | rms_norm for pre mlp (1,1152) | 0.836 | 21.06 | 25.19x | 60x | | mvm for mlp gate + gelu (1,1152)x(1152,6912) | 505.622 | 836.437 | 1.65x | 16.5x | | mvm for mlp up (1,1152)x(1152,6912) | 504.809 | 864.549 | 1.71x | 17.1x | | mvm for mlp down (1,6912)x(6912,1152) + ff element-wise mult (1,6912) | 505.148 | 325.476 | 0.64x | 6.4x | | rms_norm for post mlp (1,1152) | 0.785 | 20.988 | 26.74x | 64x | | output residual add (1,1152) | 2.647 | 12.262 | 4.63x | 11x | | large mvm (1,1152)x(1152,262144) | 19041.4 | 59500 | 3.12x | 31.2x | | Model Speed up | 15.6 tokens/s | 7.8 tokens/s | 2.00x | ≥ 20x |

Gemma 3 - 1B model Apex μs (4.5W) CUDA μs Jetson Orin Nano (7W) Speedup (Apex/CUDA) **12nm ASIC Projection ***
tokens/sec 15.63 tokens/sec 7.59 tokens/sec 2.06x faster ≥ 20x faster
joules/token 0.287 joules/token 0.922 joules/token 3.20x more efficient **10x-20x
(pending sim)**

| Specification | Apex’s Hardware Design on Kintex Ultrascale+ KU5P (FPGA) | NVIDIA Jetson Orin Nano 8GB | Specification Comparison | | --- | --- | --- | --- | | Memory | 1333 MHz, 32-bit DDR4 | 2133 MHz, 128-bit LPDDR5 | 6.4x slower than Nvidia | | Engine Speed | 333 MHz | 408 MHz (GPU) | 1.23x slower than Nvidia | | Precision | int4/fp4/int8 matrix, bf16 vector | bf16/fp32/int4/int8 | | | Power | 4.5W | 7W (compute module only) | 1.56x less power | | Process | 16nm FinFET | 8nm Samsung | Nvidia has much better process technology |

* Potential 12nm ASIC Speed Up Notes

Our design is primarily memory-bandwidth-bound for weight matrix multiplication and compute-bound for SRAM-based operations. By transitioning to LPDDR5, we can realize a proportional performance increase directly in our architecture, as there are no structural bottlenecks or external overheads in the hardware.

A standard LPDDR5 controller on a 12 nm process node can achieve 6400 Mbps on a 128-bit interface, delivering approximately 104 GB/s of bandwidth—about 10× higher than our current FPGA-based system. With this improvement, the design enables roughly a 20× speed-up compared to NVIDIA Jetson platforms.

At this bandwidth, our compute engine only needs to operate at 800 MHz to fully utilize the 256-pair element-processing in Apex Compute Unified Engine Unit, which is well within feasible limits. We already reach 333 MHz timing closure on a 16 nm FPGA, so scaling to ~800 MHz on a 12 nm ASIC is straightforward and expected.

There is a high chance we can achieve a sub-1W compute chip capable of running billion-parameter AI models efficiently.

More Benchmark Results

PyTorch is less efficient than optimized CUDA kernels for LLMs. We've included reference benchmarks below for comparison.

Apex Compute Architecture vs Pytorch CUDA

Operation Apex μs CUDA μs Speedup
RMSNorm of input (1,1152) 1.678 102.309 60.97x
Q K V projection (1,1024)(1,256)(1,256) 112.054 215.66 1.92x
RMSNorm for Q and K (1,1024)(1,256) 2.313 189.582 81.96x
rope for Q and K (4,256)(1,256) 2.837 114.915 40.51x
kqT + softmax (4, 256)x(256, 9) + transpose v {9 x 256)} 7.895 79.749 10.10x
matmul for attention 4x (1,9)x(9,256) 3.426 20.736 6.05x
mvm for att output proj (1,1024)x(1024,1152) 75.054 134.853 1.80x
RMSNorm output proj (1,1152) 0.785 114.009 145.23x
residual add for att output (1,1152) 2.313 9.44 4.08x
rms_norm for pre mlp (1,1152) 0.836 101.668 121.61x
mvm for mlp gate + gelu (1,1152)x(1152,6912) 505.622 655.281 1.30x
mvm for mlp up (1,1152)x(1152,6912) 504.809 641.656 1.27x
mvm for mlp down (1,6912)x(6912,1152) + ff element-wise mult (1,6912) 505.148 637.111 1.26x
rms_norm for post mlp (1,1152) 0.785 117.07 149.13x
output residual add (1,1152) 2.647 9.473 3.58x
large mvm (1,1152)x(1152,262144) 19041.4 67140.2 3.53x
Model Speed up 15.6 tokens/s 2.9 tokens/s 5.2x

Contact: [email protected]