Inspiration

I’ve always felt like most AI assistants are impressive but kind of fake. They don’t actually know you. They don’t know how you talk, what you care about, or how you respond to things. They just autocomplete in a vacuum.

I wanted something that actually understands me — but without shipping my private conversations to the cloud.

So Jarvis started as a simple question: what if I could build a local, privacy-first AI assistant that actually learns from my real conversations and retrieves relevant context before generating anything?

Also honestly I just wanted to try building something hard. I didn’t fully know how to do it. That was part of the point.

What it does

What it does

Jarvis is a local AI assistant that:

Analyzes past conversations

Retrieves relevant context using hybrid search

Reranks results with a cross-encoder

Generates responses that match tone and intent

Runs locally instead of defaulting to cloud APIs

Instead of just “LLM generate reply,” it’s:

retrieve → rerank → inject context → generate

It tries to feel less generic and more aligned with how I actually communicate.

How we built it

Jarvis is built as a structured pipeline.

Messages are segmented into structured units

Each message is labeled (question, commitment, reaction, etc.)

Embeddings are generated locally

Hybrid retrieval runs (BM25 + vector similarity)

A cross-encoder reranks the candidates

The best context is injected into a local model for generation

Retrieval scoring is roughly:

Score ( 𝑐

)

𝛼 ⋅ BM25 ( 𝑐 ) + 𝛽 ⋅ EmbeddingSim ( 𝑐 ) Score(c)=α⋅BM25(c)+β⋅EmbeddingSim(c)

Then reranked for semantic precision.

It’s less about fancy prompts and more about designing the system correctly.

Everything was built under hardware constraints (8GB RAM), so model size, routing, and latency actually mattered.

Challenges we ran into

A lot.

Tried models too large for my machine

Built retrieval pipelines that didn’t actually improve metrics

Fought latency constantly

Realized reranking isn’t magic

Had to rethink architecture multiple times

Working with limited RAM forced discipline. Every model choice mattered. Every token mattered. There’s no hiding behind a giant 70B model.

Also evaluation is harder than generation. Measuring whether you’re actually improving something is way more annoying than just getting output.

Accomplishments that we're proud of

Fully local-first architecture

Hybrid retrieval working reliably

Cross-encoder reranking integrated and measurable

Structured labeling pipeline for message intent

Real evaluation metrics instead of vibes

Jarvis isn’t just a chatbot wrapper. It’s a system with measurable retrieval and generation improvements.

And I built it from scratch while figuring things out as I went.

What we learned

Retrieval quality > prompt cleverness. Smaller models + better routing can outperform brute force scaling. Evaluation frameworks are non-optional. Hardware constraints make you a better system designer.

Also building something when you don’t fully know what you’re doing is frustrating but extremely educational.

What's next for Jarvis

What's next for Jarvis

Stronger long-term memory abstraction

Cleaner model routing

Lower latency generation

More formal benchmarking

Expanding beyond messaging use cases

Jarvis is still evolving. The goal isn’t just a chatbot — it’s building deeply personal, privacy-aligned AI systems that actually understand context instead of pretending to.

Built With

  • bge-small
  • bm25
  • cross-encoder
  • embeddings
  • faiss
  • fastapi
  • json
  • llms
  • mlx)
  • python
  • reranking
  • sqlite
Share this project:

Updates