InspirationAs LLMs become widely used for various tasks ranging from writing papers

to solving complex math problems, it becomes more important to ensure the user gets accurate responses as quickly as possible as well as context being effectively passed from the beginning to end of the conversations. In this write up, we propose various strategies for both inference time speed ups such as a combination of pruning+SVD and quantization as well as Chain-Of-Thought memory improvements.

What it does

InferX accelerates large language model inference through a hybrid optimization strategy that integrates pruning, low-rank factorization (SVD), and quantization while maintaining model reasoning quality via a recursive Chain-of-Thought (CoT) memory system. It enables:

  • Dynamic pruning with consensus-based mask selection
  • Low-rank SVD compression for pruned layers
  • Hardware-aware quantization with mixed precision
  • Recursive CoT refinement using a smaller auxiliary model Together, these components allow InferX to deliver significantly faster inference times without degrading the contextual accuracy of generated outputs.

How we built it

We implemented two interconnected modules:

  • HardwareOptimizer — a PyTorch-based optimization engine that performs mixed-precision quantization and structured pruning using custom calibration routines.
  • Recursive CoT Controller — a hybrid inference loop linking a large model (Qwen3-8B) with a smaller attention model (70M parameters). The smaller model refines prompts mid-generation, compressing and reorganizing reasoning context. The smaller model was pre-trained on TinyStories and SlimPajamas datasets to ensure coherence despite its compact size, and later fine-tuned on the OpenThoughts dataset using REINFORCE to teach effective CoT compression. All evaluations and fine-tuning were conducted on a MacBook Air M2 GPU backend.

Challenges we ran into

  • Precision loss: Initial quantization caused noticeable drops in reasoning accuracy.
  • Model stability: Aggressive pruning destabilized layer outputs, requiring careful consensus mask selection.
  • Limited compute: Training and testing on limited hardware demanded efficient batching and scaling.
  • Prompt drift: Maintaining consistent chain-of-thought alignment between the small and large models was nontrivial.

Accomplishments that we're proud of

  • Achieved up to 1.74× inference speedup with minimal loss in reasoning quality.
  • Successfully implemented a hybrid reasoning loop between two differently scaled models.
  • Developed a novel consensus-based pruning algorithm with SVD re-densification.
  • Trained a 70M-parameter model from scratch that effectively refines chain-of-thought reasoning.

What we learned

We learned that efficient LLM deployment requires more than just compression — it needs system-level balance between compute, memory, and reasoning fidelity. Recursive reasoning and context management can offset quality loss from quantization and pruning. Even small auxiliary models can deliver large performance and interpretability gains when orchestrated intelligently.

What's next for InferX

  • Extend hybrid reasoning to multi-turn dialogue systems.
  • Implement adaptive quantization with dynamic bit-width adjustment.
  • Add hardware introspection for real-time device-specific optimization.
  • Train on more diverse datasets to enhance reasoning robustness across languages and modalities.

Built With

Share this project:

Updates