π Seva Agent: Real-Time Autonomous Prayer Assistant
Inspiration
The Problem: 30M+ Sikhs attend prayer services recited in Punjabi/Gurmukhi Script. Younger generations understand spoken Punjabi but cannot read Gurmukhi or grasp authentic spiritual meanings. They listen passively through 2-3 hour services without full engagement.
π‘Vision: What if AI could listen to live prayers and instantly display original verses with English meanings, creating immersive spiritual experience for everyone?
The Solution: Local AI Agent that listens to live prayer recitation and synchronizes projector display with original Punjabi text + English meanings. Transforms passive listeners into active participants who immerse in spiritual experience while enhancing Punjabi reading skills.
π§ How we built it
Architecture
Audio Input β ASR Engine β Ensemble Verse Matching β Desktop Control β Display Output
Core Components
1. Real-Time Speech Recognition
- Fine-tuned ASR on curated Gurbani dataset: 60+ hours, 10+ epochs
- Custom vocabulary/tokenizer constraining inference to domain-specific output
- Preprocessed ground truth to match real-world recitation patterns
- ASR transcript alignment back to original verses presenting original content
2. Ensemble Verse Matching
- Multi-algorithm real-time alignment: Fuzzy matching, LCS, SequenceMatcher, Levenshtein
- Leading indicators for verse identification at recitation start
- Consensus-based matching with confidence thresholds
3. Autonomous Desktop Integration
- AppleScript UI automation + OCR for SikhiToTheMax control
- Socket.IO communication replacing manual typing/scrolling
- Automated operator workflow: listen β search β display
4. Smart Navigation
- Anchor Mode: Initial positioning via consensus matching
- Paath Mode: Sequential reading with drift detection
- Leading Trigger: Predictive verse transitions
Tech Stack
- ASR: PyTorch, Transformers, HuggingFace, Wandb
- Audio: SoundDevice, NumPy, soundfile, librosa
- Matching: RapidFuzz, Levenshtein, difflib
- Automation: PyTesseract, PIL, AppleScript, Socket.IO
- Data: YouTube (yt-dlp), curated datasets, synthetic augmentation
π§ Challenges we ran into
1. Low-Resource ASR: Limited Punjabi training data Solution: Domain-specific fine-tuning + ensemble alignment techniques
2. Real-Time Performance: Sub-second latency requirements Solution: Sliding window processing + leading verse detection
3. Verse Disambiguation: Similar phrases across verses Solution: Contextual matching + drift monitoring
4. API-less Integration: No SikhiToTheMax APIs Solution: OCR + Socket.IO reverse engineering + AppleScript automation
5. Sacred Text Accuracy: Zero tolerance for errors Solution: Post-processing alignment ensuring original text preservation
π§ What we learned
Domain Adaptation: Base ASR models fail for low-resource languages. General Punjabi ASR β Religious domain ASR. High-quality domain-specific data is essential.
Ensemble Methods: Noisy ASR requires multiple alignment techniques. Single algorithms fail; consensus delivers production reliability.
Real-World Performance: Lab performance β production. Training on studio audio fails in halls with AV systems/ambient noise. Synthetic augmentation crucial.
Cultural AI: Religious contexts demand strict accuracy standards transferable to medical/legal domains. AI for underrepresented minorities has significant community impact.
π Accomplishments that we're proud of
- First autonomous Gurbani recognition system serving global Sikh community
- Inference Alignment Using ensemble approach for real-time verse identification and synchronization
- Zero operator dependency - fully autonomous projector displays
- Educational impact: Improved Punjabi literacy and spiritual engagement
- Open source contribution: The Agent is open source, the model is hosted on huggingface.
πWhat's next for Seva Agent
- Fine tune gpt-oss: Augment ensemble of techniques for leading verse identification
- Federated learning across global deployments at Sikh temple (Gurdwaras)
- Mobile integration for worldwide radio/internet listeners
- Edge optimization: Quantization for limited compute with real-time latency
- Multi-language translation (10+ languages) preserving original context
- More modes include keertan - recitation with singing and musical instruments
- Multi-modal learning: Audio + contextual signals (time, ceremony type)
- Custom ChatGPTs leveraging gpt-oss for personalized religious chatbots
Impact: Demonstrates production AI for low-resource languages, cultural preservation, and autonomous religious technology applicable to education, healthcare, and legal domains requiring strict accuracy.
Built With
- agent
- asr
- gpt-oss
- openai
- python
Log in or sign up for Devpost to join the conversation.