An Agentic Student Risk Monitoring System

💡 The Inspiration

Every year, thousands of students fall through the cracks of the education system—not because they lack potential, but because help arrives too late. Traditional academic monitoring is reactive: we notice a problem only after a student fails an exam.

We were inspired to build a system that acts as a proactive digital guardian. By combining the industrial-scale data processing of Databricks with the cultural and linguistic nuance of Sarvam AI (Indic LLM), we wanted to create a "Lakehouse-to-Action" pipeline that identifies struggling students and intervenes in their mother tongue before a crisis occurs.

🏗️ How We Built It

We implemented the project using a Medallion Architecture on the Databricks Data Intelligence Platform, ensuring a clean flow from raw data to AI-driven decisions.

  1. Bronze Layer (Ingestion): We ingested high-fidelity synthetic datasets containing academic scores, behavioral metrics (hand-raising, resource visits), and attendance records into Delta Lake.
  2. Silver Layer (Feature Engineering): Using PySpark, we aggregated event-level data into student-centric features. We handled null values and normalized metrics to prepare for machine learning.
  3. Gold Layer (Machine Learning): We trained a Random Forest Classifier to predict student risk. The model analyzes the relationship between engagement and performance: $$\text{Risk Score} = f(\text{Attendance Rate}, \text{Assignment Completion}, \text{Engagement Index})$$ The model outputs a probability $P(\text{Risk})$ which is stored in our final Gold table.
  4. Agentic Layer (Sarvam AI): This is the "brain" of the system. An autonomous agent identifies students where $P(\text{Risk}) > 0.5$, analyzes their specific context (e.g., "High Absenteeism"), and calls the Sarvam-2.0 LLM to generate empathetic, multilingual interventions in Hindi and English.

🚩 Challenges We Faced

  • The "Data Explosion" Join: Initially, joining daily attendance with weekly homework created a Cartesian product that bloated our data. We solved this by implementing PySpark Aggregations to flatten the data into a single "Feature Vector" per student.
  • Vector Incompatibility: We discovered that Delta Lake cannot natively save MLlib Vector types. We overcame this by writing a User Defined Function (UDF) to extract probabilities into standard Float types before the final write.
  • LLM Hallucinations: Getting the LLM to return strictly valid JSON for our automated pipeline required intense Prompt Engineering. We implemented a "Strict Schema" prompt and a Python-based parser to ensure the agent's decisions could be saved back to Delta tables without breaking the schema.

🧠 What We Learned

  • Data vs. Intelligence: We learned that a Machine Learning model is just a "labeler," but an AI Agent is a "doer." Moving from a prediction (Gold table) to an action (Agent intervention) is the future of software.
  • The Power of Delta Lake: Using Delta Lake's overwriteSchema and merge capabilities made iterating on our features incredibly fast compared to traditional SQL databases.
  • Localization Matters: Seeing the AI generate a supportive message in Hindi made us realize that for an intervention to be effective, it must speak the student's heart language, not just the system's default language.

📈 The Impact

Margdarshak AI transforms the "Lakehouse" from a storage unit into a Reasoning Engine. By automating the path from $Data \rightarrow Insight \rightarrow Action$, we allow educators to focus on what they do best: teaching and mentoring.


🛠️ Tech Stack Used:

  • Platform: Databricks
  • Engine: Apache Spark (PySpark)
  • Storage: Delta Lake
  • ML: Spark MLlib (Random Forest)
  • Agentic AI: Sarvam AI (Indic LLM)
  • Visualization: Databricks SQL Dashboardsext for Vidya Vanguard

Built With

Share this project:

Updates