Michoscribe

MichoScribe
transcription sumary

Inspiration The idea for MichoScribe was born out of frustration during a late-night crunch session. We were trying to review a 90-minute recording to find a single specific technical requirement. We spent valuable time scrubbing back and forth, listening at 2x speed, just to locate a 10-second clip.

We realized two things:

Audio is a black box: It is the most natural way to communicate but the least efficient way to retrieve information.

Existing tools are too slow: We tried other transcription services, but they suffered from major latency. They forced us to wait for the entire file to upload and process sequentially before we could see a single word.

We wanted a tool that didn't just transcribe, but transformed audio into a searchable knowledge base—and we wanted it fast.

What it does MichoScribe is an AI-powered audio intelligence platform. It transforms spoken content into actionable, searchable, and exportable knowledge in real-time.

Real-Time Transcription: Uses Google Chirp models to transcribe audio with industry-leading accuracy as it streams.

Instant Translation: Translates transcripts into 10+ languages side-by-side.

Chat with Your Audio: A RAG-powered chatbot lets users ask questions like "What were the marketing goals?" and receive answers with cited timestamps.

AI Analysis: Auto-generates summaries, key points, and topic tags.

Smart Playback: Click any sentence in the text to jump immediately to that moment in the audio.

How we built it The core challenge was speed. To achieve a truly "real-time" experience, we moved away from standard sequential processing and embraced an event-streaming architecture using Confluent Kafka.

Here is our pipeline:

The Event Stream: As audio is uploaded, Confluent Kafka decouples the data flow, allowing us to process multiple heavy tasks simultaneously rather than one by one.

Parallel Processing:

Stream A (Transcription): Sends audio chunks to Google Cloud Speech-to-Text V2 (Chirp).

Stream B (Translation): Instantly pipes transcriptions to Google Gemini for multi-language translation.

Stream C (Vectorization): Feeds text into Firestore Vector Search to power our RAG chatbot.

The Intelligence: We utilize Google Gemini for generating summaries and handling the natural language Q&A for the chat feature.

The Frontend: Built with React and Firebase to render live updates without page refreshes.

By using Kafka to run these streams in parallel, we drastically reduced the "time-to-value" compared to competitors.

Challenges we ran into Managing Latency: Orchestrating the Kafka streams to ensure the UI remained responsive while heavy AI processing happened in the background was tricky. We had to optimize our chunk sizes to balance speed vs. context accuracy.

RAG Hallucinations: Initially, the "Chat with Audio" feature would answer general knowledge questions instead of sticking to the audio context. We had to refine our system prompts and vector retrieval strategy to strictly ground answers in the transcript data.

Timestamp Synchronization: Aligning the translated text timestamps with the original audio for the "click-to-seek" feature required complex mapping logic.

Accomplishments that we're proud of Parallel Architecture: Successfully implementing Confluent Kafka to reduce processing time significantly. Seeing the transcription, translation, and summary generation happen almost simultaneously is a huge win.

The "Chat" Accuracy: Getting the RAG system to correctly cite timestamps so users can jump to the exact source of an answer feels like magic.

Google Chirp Integration: Leveraging the Chirp models allowed us to handle different accents and background noise much better than standard models.

What we learned Event-Driven > Monolith: We learned that for AI applications involving multiple models (transcription + LLMs + Translation), an event-driven architecture is essential for performance.

The Power of Context: Transcribing is easy; understanding is hard. We learned how to use LLMs not just to generate text, but to structure unstructured data into usable formats (JSON, summaries, etc.).

What's next for Michoscribe Live Meeting Integration: Building a bot that can join Zoom or Google Meet calls to transcribe in real-time.

Speaker Diarization: Enhancing the UI to visually distinguish between different speakers more clearly.

Mobile App: Bringing the recording and "chat" experience to mobile for on-the-go professionals.## Inspiration