Inspiration

Being a software performance test engineer who recently started in the field of machine learning, this challenge intrigued me. I am used to the identification of CPU and memory-related issues in applications while doing load tests. The PLiOPS challenge inspired me to perform research and gather knowledge about LLMs, GPU, and KV Cache.

What it does

The 'GPU Memory Enhancer' solution discusses the possibility of quantizing the KV cache to reduce memory usage during large context windows.

How we built it

The solution is based on KIVI algorithm - a tuning-free 2-bit KV cache quantization algorithm that quantizes the key cache per channel and the value cache, per token.

Challenges we ran into

  • Knowledge about LLMs, token generation, vLLM, GPUs, and KV Cache
  • Gathering relevant information and articles/research papers

Accomplishments that we're proud of

  • I am extremely proud that I could learn about new ideas and topics as part of this challenge.
  • I am confident that this learning will help in further research and work.

What we learned

  1. How Large Language Models generate tokens
  2. How GPUs aid in LLM inference and various bottlenecks associated with memory
  3. How critical KV Cache is in LLM inference
  4. How vLLM works
  5. How to bring in KV Cache quantization to reduce memory-related issues in GPUs.

What's next for GPU Memory Enhancer

  • Explore more about Large Language Models, GPUs, vLLM, and KV Cache

Built With

  • gpu
  • kvcache
  • llm
  • memory
  • vllm
Share this project:

Updates