🌾 Krishak.ai — The Voice of the Farm

"What if a farmer could just speak their question — in their own language — and get expert advice back, in their own voice?"

That one question drove this entire project.

💡 The Inspiration

India has over 100 million farming households. The knowledge they need — when to sow, whether to sell now or wait, how to fix their soil, what the rain is going to do this week — exists. It lives in ICAR research papers, in government databases, in meteorological forecasts. But almost none of it reaches the farmer who needs it, in the moment they need it, in the language they speak.

A wheat farmer in Punjab speaks Punjabi. A mustard grower in Rajasthan speaks Hindi. A paddy cultivator in Tamil Nadu speaks Tamil. Existing agri-advisory tools expect them all to navigate English-language dashboards. That's a broken assumption — and it locks out the very people who need the information most.

Here's the shift Krishak.ai makes: every ICAR research paper, every government advisory, every market report — previously only accessible to someone who reads English — is now accessible to anyone who can speak. A farmer who has never read a research paper in their life can now ask a question in Punjabi and get the answer that was sitting inside that paper all along. The language barrier between scientific knowledge and the people it was meant to serve is gone.

Krishak.ai (Voice of the Cultivator in Sanskrit) is our attempt to fix it. A farmer speaks a question into their phone. The system listens, thinks, and speaks a real answer back — in their language, backed by actual science and live data.

🏗️ How We Built It

🥉 Bronze — Raw Data Ingestion

Three data streams landing in Delta tables:

ICAR Research PDFs — ingested as binary files using Auto Loader. Every new PDF dropped into a volume gets picked up automatically in a streaming fashion.
Market Prices — mandi price data for 5 major crops across 5 cities over 30 days. In production, this would come from Agmarknet.
Weather Forecasts — 7-day forecasts tied to farmer pincodes, covering rainfall, temperature, humidity, and conditions.
Farmer Profiles — structured records including soil type, pH, NPK values, and preferred language.

🥈 Silver — Cleaning, Enrichment & Embeddings

PDF text extraction uses PyMuPDF as a PySpark UDF, converting raw binary content into clean text. Text is capped at 50,000 characters per document.

Then comes the vector knowledge base. Each ICAR document is chunked into 500-character segments with 50-character overlap:

$$\text{chunk}_i = \text{text}[i \cdot (500 - 50) \; : \; i \cdot (500 - 50) + 500]$$

Each chunk is embedded using paraphrase-multilingual-mpnet-base-v2 — a model that handles all 9 of our target Indian languages in a shared vector space. This is what makes the language barrier disappear: a question asked in Punjabi retrieves relevant text from an English ICAR PDF without any translation step in the retrieval itself. The question and the document meet in the same mathematical space regardless of what language either is written in.

Embeddings are indexed in FAISS using cosine similarity:

$$\text{similarity}(q, d) = \frac{\vec{q} \cdot \vec{d}}{|\vec{q}||\vec{d}|}$$

🥇 Gold — Aggregations & Soil Health

market_price_trends — 7-day averages, min/max, profitability label
farmer_soil_health — pH-based soil classification:

$$\text{Soil Status} = \begin{cases} \text{Good} & 6.0 \leq \text{pH} \leq 7.5 \ \text{Acceptable} & 5.5 \leq \text{pH} \leq 8.0 \ \text{Needs Amendment} & \text{otherwise} \end{cases}$$

🤖 The Agent

Given a farmer's query, the agent pulls three context streams:

ICAR knowledge via FAISS semantic search (top-3 relevant chunks)
7-day weather for their pincode
Market price trend for their primary crop

All packed into a structured prompt sent to Meta Llama 4 Maverick, instructed to be concise, cite sources, and end with a clear recommendation.

🛡️ Safety & RAG Metrics

We don't just generate an answer — we validate it. After every response, the same LLM runs a second pass as a safety checker, evaluating whether the advice is sound for an Indian farmer and returning a confidence score (high / medium / low) plus a warning if anything looks off. This is shown to the user as a visible badge on every response.

We also expose RAG retrieval metrics in the debug panel: which ICAR chunks were retrieved, their cosine similarity scores, and a preview of the content. This lets us verify that the right parts of the knowledge base are being surfaced — not just that an answer was generated, but that it came from the right source.

🎙️ The Voice Pipeline

Farmer speaks (regional language)
    → STT (saaras:v3) → regional text
    → Translate (mayura:v1) → English
    → Agent → English answer
    → Safety re-evaluation → confidence badge
    → Translate → regional answer
    → TTS (bulbul:v3, voice: priya) → audio
        → Farmer hears the answer

We support 9 Indian languages: Hindi, Punjabi, Tamil, Telugu, Marathi, Kannada, Gujarati, Bengali, and Odia.

⚡ One-Word Fallback Mode

Not every farmer will type a full sentence. A farmer with a 2G connection, limited literacy, or just no time can type a single word — wheat, बारिश, pest — and the system expands it into a full query before processing. This was a deliberate design decision: the interface should meet the farmer where they are, not demand that they meet the interface.

🧗 Challenges We Faced

Getting the FAISS index to the serving app was trickier than expected. The index lives in a volume, but the app runtime doesn't mount volumes the same way. We wrote a cache-on-startup pattern using the Workspace SDK to pull the index to /tmp on first load, with size validation to catch silent 0-byte failures.

The multilingual embedding model needed careful batching — 32 chunks per call kept memory usage predictable while staying fast enough to process the full corpus.

Keeping streaming costs low meant using .trigger(once=True) on the Bronze writeStream. Without it, the cluster runs indefinitely waiting for new files.

Prompt engineering for conciseness took several iterations. Llama 4 Maverick is capable but verbose. Explicit constraints — 3–5 sentences, cite sources, end with a recommendation — plus a hard max_tokens cap at 300 enforced the brevity the TTS pipeline needs.

📚 What We Learned

Multilingual embeddings are genuinely powerful. The fact that a Hindi question retrieves the right passage from an English ICAR PDF — without any translation in the retrieval step — is not a trick. It's a real capability that changes who can access scientific knowledge.
Validation matters as much as generation. Generating an answer is easy. Knowing whether to trust it is the hard part. The re-evaluation pass made us significantly more confident in what we were showing farmers.
Voice is the right interface for rural users. Typing is a bottleneck. Speaking is not. We'd start with the voice pipeline next time, not end with it.
One-word queries are a feature, not an edge case. Designing for the lowest-friction input made the system more honest about who it's actually for.

🔭 What's Next

Real farmer profiles from PM-Kisan data. Live market prices from Agmarknet APIs. The full ICAR publications corpus, not just a test PDF. And a proper phone-call interface — so a farmer without a smartphone can still call a number, speak their question, and hear an answer.

The core loop works: a farmer speaks, the system listens, science answers, and the farm gets a little smarter.

Jai Kisan. 🌱