Logos

Inspiration

The study of Ancient Greek is often seen as an intimidating mountain of complex morphology and rigid grammar. We wanted to transform this experience from solitary rote memorization into a dynamic, interactive dialogue. Our inspiration was to create Logos (ΛΟΓΟΣ): not just a generic chatbot, but a world-class, specialized Ancient Greek scholar and live philological companion. We aimed to build a tool that feels less like a search engine and more like a patient mentor sitting beside you as you read Homer or decode a weathered inscription.

What it does

Logos is a multimodal AI console designed for classical philology.

  • Live Scholarly Dialogue: Users can engage in low-latency, bidirectional voice or text conversations about Greek literature, history, and culture.
  • Multimodal Analysis: Users can upload or point their camera at manuscripts and printed pages; Logos transcribes, translates, and analyzes the text in real-time.
  • Specialized Philological Tools: It provides structured morphological parsing of any Greek word (via the parse_greek tool), metrical scansion for verse, and reconstructed Attic pronunciation guidance using IPA.
  • Adaptive Learning: The system adapts its scholarly nuance based on the user's level—from providing full morphology tables for beginners to engaging in peer-level textual criticism for advanced scholars.

How we built it

  • Frontend: Built with Next.js and Tailwind CSS, utilizing the Web Audio API for real-time PCM audio capture and rendering.
  • Backend: A FastAPI (Python) server acting as a gateway that manages session lifecycles, executes complex tool lookups, and relays binary audio streams.
  • AI Engine: Powered by the Gemini 2.0 Multimodal Live API, allowing for seamless, low-latency "barge-in" interruptions and concurrent text, audio, and visual processing.
  • Structured Data: For specialized linguistic tasks like parsing, we implemented a dual-call architecture where the Live session triggers non-streaming Gemini API calls to return high-precision JSON data for the UI.
  • Deployment: Fully containerized with Docker and configured for scalable deployment on Google Cloud Run.

Challenges we ran into

  • Protocol Nuances: One of our biggest hurdles was navigating the transition between Gemini API versions (v1 vs. v1alpha) to enable the bidirectional bidiGenerateContent protocol.
  • Real-time Audio Latency: Managing 16-bit 16kHz mono PCM streams between the browser's ScriptProcessorNode and the backend required careful synchronization to avoid audio artifacts.
  • Environment Orchestration: Debugging the 404 and 1008 (Policy Violation) errors during WebSocket handshakes forced us to deeply audit how Next.js bakes environment variables at build-time versus runtime within Docker containers.
  • Domain Specificity: Engineering prompts that ensure Logos stays "in character" as a specialized scholar and refuses non-Greek queries required rigorous system instruction tuning.

Accomplishments that we're proud of

  • Seamless Multimodality: We successfully integrated vision, audio, and text so a user can show a physical book to the camera and ask, "How do I pronounce this line?" and receive an immediate vocal response in reconstructed Attic Greek.
  • The "Mock Mode": We built a complete mock session handler that replicates the entire Gemini Live WebSocket protocol, allowing for frontend development and testing without incurring API costs or requiring constant connectivity.
  • Specialized Tooling: Developing the parse_greek tool-calling logic that provides structured, human-readable morphological data rather than just prose explanations.

What we learned

  • The Power of v1alpha: We gained deep experience with the cutting-edge google-genai SDK and the specific model requirements (like gemini-2.0-flash-exp) needed for live bidirectional streaming.
  • Audio Engineering: We learned the intricacies of handling raw binary audio data over WebSockets and the importance of sample rate matching (16kHz for input, 24kHz for output).
  • Stateless Gateway Design: We validated that using the backend as a gateway rather than a blind proxy is essential for managing tool execution and session security.

What's next for Ancient Greek Scholar Console

  • Audio Performance: Migrating from the deprecated ScriptProcessorNode to AudioWorklets to further reduce latency and improve browser performance.
  • Expanded Lexicography: Integrating the Logeion or LSJ databases directly into our lookup_lexicon tool for more authoritative scholarly references.
  • Haptic Feedback: Exploring ways to provide visual metrical "tapping" to help students learn the rhythm of dactylic hexameter more intuitively.
Share this project:

Updates