Doc2SCORM Director

Inspiration

Working in eLearning, I’ve seen the same failure repeated constantly: subject matter experts write dense, 50-page manuals that engagement-starved employees never read. Traditional conversion into interactive content takes weeks and a full creative team. I was inspired to build an autonomous Creative Director that could handle that entire pipeline—storyboarding, illustrating, and narrating—in under two minutes.

What it does

Doc2SCORM Director transforms static PDF, DOCX, or text files into immersive, story-driven eLearning courses. It generates original illustrations, professional voice narration, and interactive branching decision points. The final output is a standards-compliant SCORM 1.2 package ready for any LMS, or a public shareable link hosted on Google Cloud Storage.

How we built it

The project uses a multimodal pipeline powered by the Google GenAI SDK and three Gemini models:

Planning: gemini-2.5-flash analyzes documents to propose three narrative directions.
Creation: gemini-3.1-flash-image-preview uses interleaved generation to produce story text and original illustrations in a single API pass.
Narration: gemini-2.5-flash-preview-tts generates per-screen audio encoded from PCM to WAV.
Infrastructure: The backend is built with Express and TypeScript on Google Cloud Run, while the frontend uses Vue 3 and Pinia.

Challenges we ran into

Mapping generated images to specific story scenes required precise sequential parsing of the interleaved inlineData parts returned by Gemini. I also had to refine prompt engineering to prevent the model from referencing its own filenames (like "image_0.png") within the course text. Additionally, ensuring TTS reliability meant building a per-screen error-handling system that falls back to text transcripts if audio generation fails.

Accomplishments that we're proud of

I am particularly proud of the adaptive color theming. The agent picks five CSS tokens that evoke the course subject (e.g., navy for cybersecurity, amber for cooking), and the UI smoothly transitions using @property CSS interpolation. Achieving narrative-visual coherence through a single "creative mind" via interleaved output is a major technical win.

What we learned

Interleaved generation is transformative; getting text and images from a single call produces significantly more coherent results than separate calls. I also learned how to leverage Google Cloud Storage for static hosting, allowing the app to bypass an LMS entirely by publishing courses to a public gallery with a direct URL.

What's next for Doc2SCORM Director

The next step is moving into video with Google's Veo model to turn static illustrations into animated cinematic scenes. I also plan to implement a "Reviewer Agent" that can take feedback from users and perform surgical edits to the generated course content using the same multimodal context.