Inspiration
Many, many tailwinds are pushing inference to become distributed in the future. The same scaling laws that govern frontier cloud models also apply to open-weight models. The intelligence available on-premises, from either your personal computer or an edge node, is increasingly encompassing the set of tasks for the average person. Training data, distillation, quantization, and regularization continue to push capability into smaller parameters and memory budgets. Apple Silicon's unified memory and Nvidia's RTX Spark point towards consumer hardware scaling around the aforementioned principles.
Simply put, the set of useful tasks that fit on your machine grows every quarter. On-premises inference is OOM cheaper, private, and less power-consuming. Additional benefits, such as government regulation of cloud providers, unbind dependencies on these vendors.
What it does
Typhoon is a high-performance inference engine built from scratch that runs open weights on-device. On top of the raw engine, a harness and agent framework wrap around to provide instant access to tool calls, app integration, and frictionless use. Typhoon comes with streaming, RAG grounding, local speech-to-text, native web search, reasoning and deep thinking, and agent workflows that can execute real multi-step tasks.
How we built it
The engine is written from scratch. We designed a kernel architecture built around the hardware, model architecture, prompt structure, and prompt context. This allowed us to outclass llama.cpp, MLX, and vLLM in terms of end-to-end tok/s.
Challenges we ran into
Performance was the hardest part. Beating a mature, fast-moving, open-source project like llama.cpp meant engineering the kernels and memory layout to be essentially perfect. We put a ton of thought into the degrees of freedom and designing a structured architecture before we started writing kernels.
Accomplishments that we're proud of
It works! And we beat the incumbents by a noticeable margin. We're also excited that we built a real inference engine from scratch, not a llama wrapper. We're also incredibly happy with the final product; the app requires zero technical setup and works straight out of the box, which is a game-changer for self-hosting and on-premises hosting in the future.
What we learned
Investment in architecture and approach beats brute-force auto-research or tuning. We had less compute and manpower than llama.cpp, MLX, and vLLM, so we had to be intentional at every possible level.
What's next for Typhoon
There are a lot of models and a lot of potential tools. The landscape of local AI has an extremely high operational velocity, so we are highly confident that, even on weekly timescales, there will be new models, quantizations, distillations, regularizations, and other strategies that we can explore and implement. We are excited to continue building, improving, and optimizing to the maximum!
Log in or sign up for Devpost to join the conversation.