Inspiration
Increase User Seconds for Grok!
I read a lot online but rarely finish articles. I skim the first few paragraphs, get distracted, and move on. Meanwhile, I listen to podcasts constantly while commuting, cooking, working out. The format just works better for passive consumption.
Google's NotebookLM showed that people want AI-generated podcasts. But it takes 3-5 minutes to generate one. By the time it's ready, you've moved on. That delay kills user retention.
I wanted something that could take whatever I was browsing and turn it into audio I could listen to later. Want to read some product reviews? Sure, just open the tab, let Grok dissect it and you move on to another tab while listening to this product review. Also, cherry on top: it can personalize the podcast based on your X user profile: your bio, your posts.
For example, if I search for housing in SF, based on my interest on X in AI startup, it will recommend houses on Apartments.com where AI startups thrive. Cool right? No other top AI name has captured this market yet.
Its more than that: you can talk to Grok live, share your tab and play with it! Ask whatever! And the best part? It goes unhinged! Haha, yes the humor is unmatched.
What it does
Grokcaster is a Chrome extension but wait its more than that. You land on any webpage, click the extension, and it generates a short podcast about that page. Two AI hosts (Alex and Sam) discuss the content in a natural back-and-forth conversation. The whole thing takes a few seconds.
You can control the duration (45 seconds to 10 minutes), the tone (formal to completely unhinged), and the format (podcast, summary, or debate). If you connect your X account, it personalizes the content based on your interests.
There's also a Live Talk feature where you can have a real-time voice conversation with Grok about whatever you're reading.
How we built it
Backend (FastAPI + Python)
The backend handles three things:
- Script generation using Grok Chat API (
grok-4-1-fast-non-reasoning). I built a context builder that takes the page content, user preferences, and X interests, then constructs a prompt that generates dialogue in a specific format:
ALEX: [question or reaction]
SAM: [explanation or response]
Audio synthesis using Grok TTS API. I parse the script by speaker and call the TTS endpoint with different voices. Alex uses "Ara" (female), Sam uses "Rex" (male). The audio segments get concatenated and sent back as a base64 data URL.
Live conversation using Grok Realtime API. This required a WebSocket proxy because browsers can't set custom headers on WebSocket connections. The backend fetches an ephemeral token, connects to
wss://api.x.ai/v1/realtime, and forwards audio bidirectionally.
Frontend (Chrome Extension)
The extension injects a panel into the page using Shadow DOM to avoid CSS conflicts. It captures page content (either the full page or a user-selected region via a snipping tool), sends it to the backend, and plays back the generated audio.
For Live Talk, it uses the Web Audio API to capture microphone input at 24kHz, converts Float32 samples to Int16 PCM, base64 encodes the chunks, and sends them over WebSocket. Incoming audio goes through the reverse process.
Challenges we ran into
Two distinct voices
Both hosts sounded identical at first. I tested every voice combination until Ara and Rex worked. They're clearly different (female/male) but share a similar cadence that makes conversation flow naturally.
Duration accuracy
Setting "45 seconds" gave me 2 minute podcasts. LLMs don't understand audio duration. I had to map word counts to time (150 words per minute) and enforce hard limits in the prompt with caps: "DO NOT exceed 90 words."
WebSocket auth from browser
Spent hours debugging failed connections. Browsers don't allow custom headers on WebSockets. Grok Realtime needs Bearer auth. Had to build a server-side proxy to handle authentication and forward messages both directions.
Live Talk interruption
Interrupting Grok meant waiting 5 seconds for queued audio to finish playing. Fixed it by tracking all active audio sources and calling stop() immediately when speech_started fires from the API.
Chrome Extension CSP
Inline JavaScript in popup HTML got blocked. Manifest V3 doesn't allow it. Had to move everything to external files. Small fix, hour of confused Googling.
Accomplishments that we're proud of
100% Grok-native stack. No OpenAI, no ElevenLabs, no third-party AI anywhere. Three xAI APIs working together: Chat writes the script, TTS speaks it, Realtime handles live conversation. The whole thing runs on Grok.
Generation speed. A 45-second podcast generates in under 10 seconds. NotebookLM takes 3-5 minutes for similar output. That speed difference matters for user retention.
Real two-voice conversation. Not just alternating text-to-speech. Two distinct voices (Ara and Rex) with natural conversational flow. It actually sounds like two people talking.
Live Talk actually works. Real-time bidirectional audio streaming through a WebSocket proxy. You can interrupt Grok mid-sentence and it responds immediately. The audio processing (24kHz PCM, base64 encoding, proper buffering) took work to get right.
X personalization. The podcast adapts to your interests based on your Twitter/X profile. Reading an article about real estate? If you're into tech, the hosts mention startup hubs. Same article, different user, different angle.
What we learned
I learned how different the three Grok APIs are. Chat is straightforward request/response. TTS is synchronous but requires thinking about voice selection and audio concatenation. Realtime is a completely different paradigm with bidirectional streaming, VAD, and real-time audio processing.
I also learned that prompt engineering for dialogue is harder than prose. Getting two hosts to sound like they're having a real conversation (not just alternating monologues) took a lot of iteration.
What's next for Grokcaster
Increase More User Seconds!
I want to add more sophisticated X integration. Right now it pulls your likes and bio. I'd like to incorporate your recent posts and the accounts you engage with most.
I'm also thinking about a "podcast queue" where you can save articles throughout the day and generate a single combined podcast for your commute.
Built With
- chrome-extension-manifest-v3
- fastapi
- javascript
- python
- web-audio-api
- x-api-v2
- xai-grok-chat-api
- xai-grok-realtime-api
- xai-grok-tts-api
Log in or sign up for Devpost to join the conversation.