Inspiration
3 weeks ago I needed to analyze customer chats using gpt-oss-20B model in order to extract the most popular questions users are asking. Since these were private conversations, I couldn’t send the logs to ChatGPT - I had to run everything locally. The challenge? My PC only has 8GB VRAM and 16GB RAM. That’s when I decided to build a new library designed to make running LLMs feasible on consumer hardware like mine.
What it does
oLLM is a lightweight Python library for large-context LLM inference, built on top of Huggingface Transformers and PyTorch. It enables running models like gpt-oss-20B or Llama-3.1-8B-Instruct on 100k context using ~$200 consumer GPU with 8GB VRAM. No quantization is used—only fp16/bf16 precision.
8GB Nvidia 3060 Ti Inference memory usage:
| Model | Weights | Context length | KV cache | Baseline VRAM (no offload) | oLLM GPU VRAM | oLLM Disk (SSD) |
|---|---|---|---|---|---|---|
| gpt-oss-20B | 13 GB (packed bf16) | 10k | 1.4 GB | ~40 GB | ~7.3GB | 15 GB |
| llama3-1B-chat | 2 GB (fp16) | 100k | 12.6 GB | ~16 GB | ~5 GB | 15 GB |
| llama3-3B-chat | 7 GB (fp16) | 100k | 34.1 GB | ~42 GB | ~5.3 GB | 42 GB |
| llama3-8B-chat | 16 GB (fp16) | 100k | 52.4 GB | ~71 GB | ~6.6 GB | 69 GB |
By "Baseline" we mean typical inference without any offloading
How did we achieve this
- Loading layer weights from SSD directly to GPU one by one
- Offloading some layer weights to CPU for speed boost (depending on available RAM)
- Offloading KV cache to SSD and loading back directly to GPU, no quantization or PagedAttention
- FlashAttention-like implementation. Full attention matrix is never materialized
- Chunked MLP. Intermediate upper projection layer output may get large, so we chunk MLP as well
Typical use cases include:
- Analyze contracts, regulations, and compliance reports in one pass
- Summarize or extract insights from massive patient histories or medical literature
- Process very large log files or threat reports locally
- Analyze historical chats to extract the most common issues/questions users have
Challenges we ran into
- Unpacking original int8, bf16 weights was challenging. Luckily, I found good implementation for it.
- Efficient KV Cache offloading. We offload the KV Cache to SSD only once for the first token, after that only load it from SSD and keep the rest of it on GPU.
- Original Huggingface code uses Standard attention with full matrix materialization. So I had to integrate Flash-attention-like implementation.
Accomplishments that we're proud of
- Over 🏅650 downloads and ✨ 81 stars on Github in 2 weeks (~500 from pip https://pypistats.org/packages/ollm)
- It takes less than an hour to generate 500 tokens (with large input), which makes oLLM useful for offline analytical tasks where GPU is constraint
What's next for oLLM - LLM inference for large-context offline tasks
- Running gpt-oss-20b on 100k context using GPU with 12GB VRAM (8GB only got me 10k, which is not so bad)
- Improve MoE KV Cache SSD offloading algorithm
- Add support for more LLMs
Built With
- gpt-oss-20b
- huggingface
- python
- torch
- transformers

Log in or sign up for Devpost to join the conversation.