oLLM - LLM inference for large-context offline workloads

Inspiration

3 weeks ago I needed to analyze customer chats using gpt-oss-20B model in order to extract the most popular questions users are asking. Since these were private conversations, I couldn’t send the logs to ChatGPT - I had to run everything locally. The challenge? My PC only has 8GB VRAM and 16GB RAM. That’s when I decided to build a new library designed to make running LLMs feasible on consumer hardware like mine.

What it does

oLLM is a lightweight Python library for large-context LLM inference, built on top of Huggingface Transformers and PyTorch. It enables running models like gpt-oss-20B or Llama-3.1-8B-Instruct on 100k context using ~$200 consumer GPU with 8GB VRAM. No quantization is used—only fp16/bf16 precision.

8GB Nvidia 3060 Ti Inference memory usage:

Model	Weights	Context length	KV cache	Baseline VRAM (no offload)	oLLM GPU VRAM	oLLM Disk (SSD)
gpt-oss-20B	13 GB (packed bf16)	10k	1.4 GB	~40 GB	~7.3GB	15 GB
llama3-1B-chat	2 GB (fp16)	100k	12.6 GB	~16 GB	~5 GB	15 GB
llama3-3B-chat	7 GB (fp16)	100k	34.1 GB	~42 GB	~5.3 GB	42 GB
llama3-8B-chat	16 GB (fp16)	100k	52.4 GB	~71 GB	~6.6 GB	69 GB

By "Baseline" we mean typical inference without any offloading

How did we achieve this

Loading layer weights from SSD directly to GPU one by one
Offloading some layer weights to CPU for speed boost (depending on available RAM)
Offloading KV cache to SSD and loading back directly to GPU, no quantization or PagedAttention
FlashAttention-like implementation. Full attention matrix is never materialized
Chunked MLP. Intermediate upper projection layer output may get large, so we chunk MLP as well

Typical use cases include:

Analyze contracts, regulations, and compliance reports in one pass
Summarize or extract insights from massive patient histories or medical literature
Process very large log files or threat reports locally
Analyze historical chats to extract the most common issues/questions users have

Challenges we ran into

Unpacking original int8, bf16 weights was challenging. Luckily, I found good implementation for it.
Efficient KV Cache offloading. We offload the KV Cache to SSD only once for the first token, after that only load it from SSD and keep the rest of it on GPU.
Original Huggingface code uses Standard attention with full matrix materialization. So I had to integrate Flash-attention-like implementation.

Accomplishments that we're proud of

Over 🏅650 downloads and ✨ 81 stars on Github in 2 weeks (~500 from pip https://pypistats.org/packages/ollm)
It takes less than an hour to generate 500 tokens (with large input), which makes oLLM useful for offline analytical tasks where GPU is constraint

What's next for oLLM - LLM inference for large-context offline tasks

Running gpt-oss-20b on 100k context using GPU with 12GB VRAM (8GB only got me 10k, which is not so bad)
Improve MoE KV Cache SSD offloading algorithm
Add support for more LLMs

Built With

gpt-oss-20b
huggingface
python
torch
transformers

Updates

Anuar Sharafudinov started this project — Sep 10, 2025 09:57 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.