University of Toronto Students Burn MicroGPT Onto FPGA, Hit 53,000 Tokens Per Second Without a GPU

Two University of Toronto undergraduate engineers, Luthira Abeykoon and Krish Chhajer, have published TALOS-V2, an open-source project that implements Andrej Karpathy's MicroGPT transformer entirely in FPGA hardware. The release, dated May 1, 2026, generates more than 50,000 tokens per second on a Terasic DE1-SoC board priced at around 300 US dollars, with no GPU, no PyTorch, and no CPU inference loop.
Key Highlights
- TALOS-V2 burns the full MicroGPT inference path into RTL on a Cyclone V FPGA, including embeddings, self-attention, normalization, the MLP, the language-model head, and token sampling.
- The team measured a sustained throughput of roughly 53,000 tokens per second on character-level name generation, running on a custom 56.25 MHz PLL clock.
- The codebase is released under an open-source license on GitHub, with the stated goal that "accelerator design is easier to learn when the full stack is visible."
Details
MicroGPT is the roughly 200-line educational transformer that Andrej Karpathy released earlier this year, with about 4,192 trainable parameters and a single-character token vocabulary trained on his classic names dataset. TALOS-V2 takes that small but complete architecture and translates each step into explicit fixed-point datapaths written in SystemVerilog.
At the core of the design is a 16-lane streamed matrix-vector tile using Q4.12 fixed-point arithmetic. That single tile is time-multiplexed across the Q, K, and V projections, the MLP layers, and the language-model head, which is how the team fits the full network onto a teaching-grade Cyclone V chip. Weights are stored in on-chip ROM rather than fetched from external memory, eliminating the bandwidth bottleneck that usually dominates inference.
Attention was the hardest part to translate, the authors note. What is a single line in PyTorch becomes an eight-stage hardware pipeline: generate Q, K, and V; scan the dot products; track the running maximum; approximate the exponential; accumulate; divide; mix the values; and project back out.
Impact
The project is small in absolute terms, but the demonstration matters. It shows that a complete transformer inference loop can run end-to-end as a hardware pipeline, with tokens streaming in and out of the chip with no software in the path. For edge AI, robotics, and any latency-sensitive embedded scenario, that is a meaningful proof point.
The benchmarks have already drawn pushback. Alex Cheema and other developers have shown that an M4 Max MacBook running pure C code on a single performance core hits more than 3.7 million tokens per second on the same model, and an M5 Pro reaches around 6.7 million. On raw throughput per dollar and per watt for this specific tiny workload, modern Apple silicon wins decisively.
The TALOS-V2 authors are not arguing otherwise. Their pitch is pedagogical and architectural rather than benchmark-driven. The point is to make every step of transformer inference visible as memories, counters, state machines, and lookup tables, rather than as opaque CUDA kernels.
Background
FPGA-based AI inference is not new at the data center scale. Microsoft has used Intel FPGAs for Bing inference for years, and AWS, Alibaba Cloud, and others offer FPGA instances for custom accelerators. What is unusual is a fully open-source, end-to-end transformer on a teaching-grade board, accompanied by readable RTL that students can clone and modify.
The release lands at a moment when the industry is openly debating whether the future of inference is more GPUs, custom ASICs like Groq's LPU and Nvidia's recently announced Vera Rubin systems, or reconfigurable fabric. TALOS-V2 is one more data point that the design space is still wide open.
What's Next
The authors have stated they intend to keep the project as a learning artifact rather than chase larger models, which would not fit on a Cyclone V regardless. Realistically, scaling the same approach to billion-parameter models would require either much larger FPGAs with HBM, or a move to custom ASICs. Several developers on X are already experimenting with porting the design to bigger boards and to other small open-source models, and a community has formed around the GitHub repository in the first 48 hours after release.
For developers in MENA hardware programs and embedded AI startups, TALOS-V2 is a rare resource: a complete, readable, end-to-end FPGA transformer that can be studied, simulated, and extended on affordable hardware.
Source: TALOS-V2 official site and GitHub repository
Discuss Your Project with Us
We're here to help with your web development needs. Schedule a call to discuss your project and how we can assist you.
Let's find the best solutions for your needs.