Inspiration

As a runner, I've always wanted deeper insights and someone to talk to about my running. This inspired me to create a RAG system that can provide answers to your running questions based on your data.

What it does

Reads strava_data.csv and processes each row as a separate chunk (one narrative per line). Generates a short, human-readable narrative for each activity and computes an embedding for that narrative. Stores the narrative, embedding, and rich metadata together in a Chroma persistent collection so semantic search can retrieve the most relevant chunks. Converts user prompts into embeddings, retrieves the top matches from Chroma, and passes those results (plus any tool outputs) into an LLM to produce the final answer.

How we built it

Language & tooling: Python 3.8+, ollama (for embeddings + LLM generation), chromadb (persistent vector store), and pandas for CSV-based aggregations. Ingestion (main.py): iterate CSV rows → build readable narrative → call ollama.embeddings(model="nomic-embed-text", prompt=narrative) → store id, embedding, document (narrative) and metadata (original row + derived fields) in Chroma. Query flow: embed user question → semantic search collection.query(query_embeddings=[q_vector], n_results=9, where=...) → pass retrieved narratives and tool outputs into ollama.generate with a concise prompt to produce the final answer. Metadata & filtering: derive month, year, day_of_week, and time_of_day at ingestion and use a small LLM-based extractor (build_where) to translate natural-language filters into Chroma where clauses. Tools: implement deterministic functions (e.g., total) that compute aggregates directly from the CSV using pandas, and include those results in the LLM prompt to improve numeric accuracy.## Challenges we ran into Numeric noise in narratives: many numeric fields (distance, speed, heart rate) appear in narratives and can confuse the LLM when precise aggregates are requested. LLM hallucinations and RAG retrieval not fetching relevant context.

Accomplishments that we're proud of

End-to-end RAG pipeline from raw CSV to LLM answers with minimal code in main.py. Simple, reproducible ingestion: one narrative per CSV line makes debugging and retrieval transparent. Tool-augmented answers: a small plugin-style tools layer lets us compute deterministic aggregates and include them in LLM prompts, making numeric answers reliable.

What we learned

Keep retrieval chunks small and semantically complete (one activity per chunk) to avoid fragmenting context. Deterministic tools are essential for numeric reliability, use the LLM for reasoning and narrative, tools for arithmetic and exact counts. Good prompt engineering for extraction/routing (e.g., build_where, build_tool) simplifies the pipeline and reduces brittle logic in application code.

What's next for Strava Rag

Experiment with different chunking techniques and prompt manipulation to get more accurate results. I also want to add MANY more tool calls which would allow it to manipulate data much better(ie min, max, total, within (x date) and (y date)).

Built With

Share this project:

Updates