Papertrail

Inspiration

Reading a PDF is still mostly a flat, linear experience. We wanted to explore what it would feel like if a document could become a place instead of just a file.

What it does

Papertrail turns an uploaded PDF into an explorable 3D story world. It extracts key scenes, objects, characters, emotional beats, and source-grounded quotes from the document, then turns them into connected immersive environments users can explore in the browser or WebXR. Each scene includes generated voice narration, interactive objects, and transitions that let users experience the document as a guided spatial story.

How we built it

We built Papertrail with Next.js and React Three Fiber for the interactive web experience, with WebXR support for immersive exploration.

The core of the system is Backboard, which we use as the orchestration and RAG layer over the uploaded PDF. After extracting text from the PDF, we chunk the document into retrievable passages with page-level source metadata, then use Backboard to retrieve the most relevant passages for each candidate scene, character, object, and emotional beat. Backboard coordinates the reasoning steps that turn raw document text into structured world data: scene summaries, source-grounded quotes, object descriptions, narration scripts, interaction copy, mood tags, layout hints, and transitions.

We used Gemini through Backboard as one of the main reasoning models for document understanding and scene planning. Gemini helped analyze long-form PDF content, identify narratively important passages, extract characters and objects, summarize emotional beats, and convert retrieved context into structured scene data. Backboard handled the orchestration around Gemini: retrieval, prompt construction, schema-constrained generation, validation, retries, and fallback behavior. This let us use Gemini for the high-level reasoning while keeping the app output predictable enough for the 3D frontend.

Instead of asking the frontend to interpret free-form AI output, Backboard produces typed scene plans that match our app schema. Each generated scene includes a stable ID, layout type, mood, environmental dressing, source anchors, narration, and a small set of interactable objects. Those objects include labels, visual types, descriptions, quotes, and explanations tied back to the original PDF. This structured output lets the frontend reliably render the world while preserving traceability to the source material.

We also use Backboard as the coordinator between multiple generative services. It decides what each scene should represent, which passages support it, what visual prompt should be sent to World Labs, and what narration script should be sent to ElevenLabs. World Labs generates immersive spatial environments for each scene, while ElevenLabs turns the narration scripts into voiceover so the world feels like a guided story experience. The frontend streams generation progress back to the user as each stage completes: PDF parsing, RAG retrieval, Gemini scene planning, environment generation, narration, object creation, and final WebXR assembly.

Challenges we ran into

The hardest part was connecting generative AI outputs to a reliable interactive world. PDF understanding, source grounding, Gemini reasoning, world generation, narration, and WebXR rendering all produce different types of artifacts, so we had to design a pipeline where Backboard creates structured scene data that the frontend can actually use.

Another challenge was keeping the generated world faithful to the source document. A purely generative pipeline can easily drift into making pretty but unsupported scenes, so we used RAG and page-level source anchors to keep scenes, objects, quotes, and narration tied to retrieved passages from the PDF. That source-grounding gave us a way to make the experience imaginative without letting it become disconnected from the text.

We also had to coordinate asynchronous generation across several providers. Gemini scene planning, World Labs environment generation, object rendering, and ElevenLabs narration all finish at different times, so we built the app around a streamed generation pipeline that can report progress, recover from provider failures, and still fall back to a playable demo world when needed.

Accomplishments that we're proud of

We are proud of building an end-to-end pipeline from PDF upload to source-grounded story extraction, Gemini-powered scene planning, generated immersive environments, narrated exploration, interactive objects, shareable worlds, and WebXR interaction.

The biggest win is that the world is not just decorative. Each scene is produced from retrieved document context, each object has a source-grounded quote or explanation, and each narration track is generated from the same structured plan. That makes Papertrail feel less like a random AI visualization and more like a spatial reading experience.

We are also proud of how modular the pipeline became. Backboard owns orchestration and source grounding, Gemini handles long-context reasoning and scene planning, World Labs owns spatial generation, ElevenLabs owns voice narration, and the React Three Fiber frontend owns interaction, rendering, portals, and WebXR controls. Separating those responsibilities made the system easier to debug and easier to extend.

What we learned

We learned that the best use of AI in this project is orchestration. The important part is not just generating assets, but deciding what should be generated, why it belongs in the world, and how it connects back to the source PDF.

Backboard helped us treat the PDF as a knowledge base rather than a blob of text. By combining retrieval, structured prompting, Gemini reasoning, schema validation, and provider coordination, we could turn a document into a reliable set of world-building instructions. World Labs then handled spatial environment generation, ElevenLabs brought the story to life through voice, and the frontend tied everything together into an immersive WebXR experience.

What's next for Papertrail

Next, we want to add richer object interactions, smoother scene transitions, persistent story worlds, multiplayer exploration, and more ways for users to customize the tone, style, and structure of the worlds generated from their documents.

We also want to deepen the RAG layer so users can ask questions inside the world, inspect why a scene was generated, jump from an object back to its source page, and regenerate individual scenes with different tones or levels of detail while preserving source grounding.