🌾 Krishi Dhwani — The Voice of the Farm

"What if a farmer could just speak their question — in their own language — and get expert advice back, in their own voice?"

That one question drove this entire project.

💡 The Inspiration

India has over 100 million farming households. The knowledge they need — when to sow, whether to sell now or wait, how to fix their soil, what the rain is going to do this week — exists. It lives in ICAR (Indian Council of Agricultural Research) papers, in government databases, in meteorological forecasts. But almost none of it reaches the farmer who needs it, in the moment they need it, in the language they speak.

A wheat farmer in Punjab speaks Punjabi. A mustard grower in Rajasthan speaks Hindi. A paddy cultivator in Tamil Nadu speaks Tamil. Existing agri-advisory tools expect them all to navigate English-language dashboards. That's a broken assumption.

Krishi Dhwani (Voice of Farming in Hindi) is our attempt to fix it. A farmer speaks a question into their phone. The system listens, thinks, and speaks a real answer back — in their language, backed by actual science and live data.

🏗️ How We Built It

We used a Medallion Architecture on Databricks, moving data from raw ingestion all the way to an intelligent agent layer.

🥉 Bronze — Raw Data Ingestion

The foundation is three data streams, all landing in Delta tables:

ICAR Research PDFs — ingested as binary files using Databricks Auto Loader with cloudFiles format. Every new PDF dropped into a volume gets picked up automatically in a streaming fashion.
Market Prices — simulated mandi (market) price data for 5 major crops (Wheat, Mustard, Paddy, Soybean, Maize) across 5 cities over 30 days. In production, this would come from Agmarknet or a government price API.
Weather Forecasts — 7-day forecasts tied to farmer pincodes, covering rainfall, temperature, humidity, and conditions.
Farmer Profiles — structured records including farmer_id, pincode, soil_type, pH, NPK values, and most importantly, their preferred language.

🥈 Silver — Cleaning, Enrichment & Embeddings

This is where the interesting work begins.

PDF text extraction uses PyMuPDF (fitz) as a PySpark UDF, converting raw binary content into clean text and filtering out any PDFs that came out too short to be useful. The text gets capped at 50,000 characters per document to keep processing costs sane.

Then comes the vector knowledge base. We chunk each ICAR document into 500-character segments with a 50-character overlap, so context doesn't get cut off abruptly at a boundary:

$$\text{chunk}_i = \text{text}[i \cdot (500 - 50) \; : \; i \cdot (500 - 50) + 500]$$

Each chunk is embedded using paraphrase-multilingual-mpnet-base-v2 — a model that handles all 9 of our target Indian languages in a shared vector space. The embeddings are indexed in FAISS (using inner-product similarity after L2 normalization, which equals cosine similarity):

$$\text{similarity}(q, d) = \frac{\vec{q} \cdot \vec{d}}{|\vec{q}||\vec{d}|}$$

The index and metadata (chunk text + source path) are persisted to a Databricks Volume so the serving app can load them at startup without rebuilding everything.

Silver also cleans up market prices (converting rupees-per-quintal to rupees-per-kg: $ p_{kg} = p_{quintal} / 100 $) and filters the weather table to only keep future dates.

🥇 Gold — Aggregations & Soil Health

The gold layer produces two analytical tables used directly by the agent:

market_price_trends — 7-day averages, min/max, and a simple profitability label (Profitable if average price > ₹40/kg, Below Average otherwise)
farmer_soil_health — farmer profiles enriched with a soil health classification based on pH:

$$\text{Soil Status} = \begin{cases} \text{Good} & 6.0 \leq \text{pH} \leq 7.5 \ \text{Acceptable} & 5.5 \leq \text{pH} \leq 8.0 \ \text{Needs Amendment} & \text{otherwise} \end{cases}$$

🤖 The Agent — Putting It All Together

Given a farmer's query, the agent pulls three context streams in parallel:

ICAR knowledge via FAISS semantic search (top-3 relevant chunks)
7-day weather for their pincode
Market price trend for their primary crop

All of this gets packed into a structured prompt sent to Meta Llama 4 Maverick (via the Databricks model serving endpoint), which has been instructed to act as Krishi Dhwani: concise, source-citing, ending with a clear recommendation.

🎙️ The Voice Pipeline

This is the part we're most proud of. Using Sarvam AI, the full round-trip works like:

Farmer speaks (regional language)
    → Sarvam STT (saaras:v3) → regional text
    → Sarvam Translate (mayura:v1) → English
    → Krishi Agent → English answer
    → Sarvam Translate → regional answer
    → Sarvam TTS (bulbul:v3, voice: priya) → audio file
        → Farmer hears the answer

We support 9 Indian languages: Hindi, Punjabi, Tamil, Telugu, Marathi, Kannada, Gujarati, Bengali, and Odia. The frontend is a Gradio app deployed directly as a Databricks App.

🧗 Challenges We Faced

Getting the FAISS index to the serving app was trickier than expected. The index lives in a Databricks Volume (great for notebooks), but the Databricks App runtime doesn't mount volumes the same way. We ended up writing a cache-on-startup pattern using the Workspace SDK (w.files.download) to pull the index to /tmp on first load, with size validation to catch silent 0-byte failures.

The multilingual embedding model needed careful batching. Encoding thousands of chunks one-by-one would have been painfully slow; batching at 32 chunks per call made it manageable while keeping memory usage predictable.

Keeping costs low on a streaming pipeline meant using .trigger(once=True) on the Bronze writeStream — a small but important detail. Without it, the cluster would run indefinitely waiting for new PDFs.

Prompt engineering for conciseness took several iterations. Llama 4 Maverick is very capable but tends to be verbose. We had to be explicit: 3–5 sentences, cite your sources, end with a recommendation — and even then, we cap max_tokens at 300 to enforce brevity, since the TTS character limit is 500.

📚 What We Learned

Medallion architecture pays off even at small scale. When we needed to fix the market price unit conversion, we only had to change the Silver transformation — Bronze and Gold downstream updated cleanly.
Multilingual embeddings are genuinely good now. The fact that a Hindi question can retrieve relevant text from an English-language ICAR PDF — without any translation step in the retrieval — still feels a little magical to us.
Voice is the right interface for rural users. Typing is a bottleneck. Speaking is not. Building the voice pipeline last was probably the wrong order; we'd start there next time.
Databricks Apps made the deployment story surprisingly clean. Going from a notebook prototype to a shareable web app required almost no infrastructure work.

🔭 What's Next

The current farmer profiles are hand-written rows in a Delta table. The real version would integrate with PM-Kisan beneficiary data or Aadhaar-linked farmer registries. The market prices are simulated — connecting to live Agmarknet APIs would make the advice truly actionable. And we'd love to run this on actual ICAR publications, all of them, not just our test PDF.

But the core loop works: a farmer speaks, the system listens, science answers, and the farm gets a little smarter.

Jai Kisan. 🌱