Inspiration

3 weeks ago I needed to analyze customer chats using gpt-oss-20B model in order to extract the most popular questions users are asking. Since these were private conversations, I couldn’t send the logs to ChatGPT - I had to run everything locally. The challenge? My PC only has 8GB VRAM and 16GB RAM. That’s when I decided to build a new library designed to make running LLMs feasible on consumer hardware like mine.

What it does

oLLM is a lightweight Python library for large-context LLM inference, built on top of Huggingface Transformers and PyTorch. It enables running models like gpt-oss-20B or Llama-3.1-8B-Instruct on 100k context using ~$200 consumer GPU with 8GB VRAM. No quantization is used—only fp16/bf16 precision.

8GB Nvidia 3060 Ti Inference memory usage:

Model Weights Context length KV cache Baseline VRAM (no offload) oLLM GPU VRAM oLLM Disk (SSD)
gpt-oss-20B 13 GB (packed bf16) 10k 1.4 GB ~40 GB ~7.3GB 15 GB
llama3-1B-chat 2 GB (fp16) 100k 12.6 GB ~16 GB ~5 GB 15 GB
llama3-3B-chat 7 GB (fp16) 100k 34.1 GB ~42 GB ~5.3 GB 42 GB
llama3-8B-chat 16 GB (fp16) 100k 52.4 GB ~71 GB ~6.6 GB 69 GB

By "Baseline" we mean typical inference without any offloading

How did we achieve this

  • Loading layer weights from SSD directly to GPU one by one
  • Offloading some layer weights to CPU for speed boost (depending on available RAM)
  • Offloading KV cache to SSD and loading back directly to GPU, no quantization or PagedAttention
  • FlashAttention-like implementation. Full attention matrix is never materialized
  • Chunked MLP. Intermediate upper projection layer output may get large, so we chunk MLP as well

Typical use cases include:

  • Analyze contracts, regulations, and compliance reports in one pass
  • Summarize or extract insights from massive patient histories or medical literature
  • Process very large log files or threat reports locally
  • Analyze historical chats to extract the most common issues/questions users have

Challenges we ran into

  • Unpacking original int8, bf16 weights was challenging. Luckily, I found good implementation for it.
  • Efficient KV Cache offloading. We offload the KV Cache to SSD only once for the first token, after that only load it from SSD and keep the rest of it on GPU.
  • Original Huggingface code uses Standard attention with full matrix materialization. So I had to integrate Flash-attention-like implementation.

Accomplishments that we're proud of

  • Over 🏅650 downloads and ✨ 81 stars on Github in 2 weeks (~500 from pip https://pypistats.org/packages/ollm)
  • It takes less than an hour to generate 500 tokens (with large input), which makes oLLM useful for offline analytical tasks where GPU is constraint

What's next for oLLM - LLM inference for large-context offline tasks

  • Running gpt-oss-20b on 100k context using GPU with 12GB VRAM (8GB only got me 10k, which is not so bad)
  • Improve MoE KV Cache SSD offloading algorithm
  • Add support for more LLMs

Built With

  • gpt-oss-20b
  • huggingface
  • python
  • torch
  • transformers
Share this project:

Updates