Logos
Inspiration
The study of Ancient Greek is often seen as an intimidating mountain of complex morphology and rigid grammar. We wanted to transform this experience from solitary rote memorization into a dynamic, interactive dialogue. Our inspiration was to create Logos (ΛΟΓΟΣ): not just a generic chatbot, but a world-class, specialized Ancient Greek scholar and live philological companion. We aimed to build a tool that feels less like a search engine and more like a patient mentor sitting beside you as you read Homer or decode a weathered inscription.
What it does
Logos is a multimodal AI console designed for classical philology.
- Live Scholarly Dialogue: Users can engage in low-latency, bidirectional voice or text conversations about Greek literature, history, and culture.
- Multimodal Analysis: Users can upload or point their camera at manuscripts and printed pages; Logos transcribes, translates, and analyzes the text in real-time.
- Specialized Philological Tools: It provides structured morphological parsing of any Greek word (via the
parse_greektool), metrical scansion for verse, and reconstructed Attic pronunciation guidance using IPA. - Adaptive Learning: The system adapts its scholarly nuance based on the user's level—from providing full morphology tables for beginners to engaging in peer-level textual criticism for advanced scholars.
How we built it
- Frontend: Built with Next.js and Tailwind CSS, utilizing the Web Audio API for real-time PCM audio capture and rendering.
- Backend: A FastAPI (Python) server acting as a gateway that manages session lifecycles, executes complex tool lookups, and relays binary audio streams.
- AI Engine: Powered by the Gemini 2.0 Multimodal Live API, allowing for seamless, low-latency "barge-in" interruptions and concurrent text, audio, and visual processing.
- Structured Data: For specialized linguistic tasks like parsing, we implemented a dual-call architecture where the Live session triggers non-streaming Gemini API calls to return high-precision JSON data for the UI.
- Deployment: Fully containerized with Docker and configured for scalable deployment on Google Cloud Run.
Challenges we ran into
- Protocol Nuances: One of our biggest hurdles was navigating the transition between Gemini API versions (v1 vs. v1alpha) to enable the bidirectional
bidiGenerateContentprotocol. - Real-time Audio Latency: Managing 16-bit 16kHz mono PCM streams between the browser's
ScriptProcessorNodeand the backend required careful synchronization to avoid audio artifacts. - Environment Orchestration: Debugging the 404 and 1008 (Policy Violation) errors during WebSocket handshakes forced us to deeply audit how Next.js bakes environment variables at build-time versus runtime within Docker containers.
- Domain Specificity: Engineering prompts that ensure Logos stays "in character" as a specialized scholar and refuses non-Greek queries required rigorous system instruction tuning.
Accomplishments that we're proud of
- Seamless Multimodality: We successfully integrated vision, audio, and text so a user can show a physical book to the camera and ask, "How do I pronounce this line?" and receive an immediate vocal response in reconstructed Attic Greek.
- The "Mock Mode": We built a complete mock session handler that replicates the entire Gemini Live WebSocket protocol, allowing for frontend development and testing without incurring API costs or requiring constant connectivity.
- Specialized Tooling: Developing the
parse_greektool-calling logic that provides structured, human-readable morphological data rather than just prose explanations.
What we learned
- The Power of v1alpha: We gained deep experience with the cutting-edge google-genai SDK and the specific model requirements (like
gemini-2.0-flash-exp) needed for live bidirectional streaming. - Audio Engineering: We learned the intricacies of handling raw binary audio data over WebSockets and the importance of sample rate matching (16kHz for input, 24kHz for output).
- Stateless Gateway Design: We validated that using the backend as a gateway rather than a blind proxy is essential for managing tool execution and session security.
What's next for Ancient Greek Scholar Console
- Audio Performance: Migrating from the deprecated
ScriptProcessorNodeto AudioWorklets to further reduce latency and improve browser performance. - Expanded Lexicography: Integrating the Logeion or LSJ databases directly into our
lookup_lexicontool for more authoritative scholarly references. - Haptic Feedback: Exploring ways to provide visual metrical "tapping" to help students learn the rhythm of dactylic hexameter more intuitively.
Log in or sign up for Devpost to join the conversation.