This markdown is entirely human generated
AristoBites
AristoBites makes short form videos completely autonomously from a text input. It does so by using the latest in agentic RAG (to make the script), image, audio, and video gen models. Check out some of the videos generated here: link
Sponsor technology used
LlamaIndex: LlamaCloud, workflow, ✨agentic RAG✨ Reflex: Frontend to display videos generated
The Pipeline

As the pipeline graph indicates, there are several moving parts in the system. This is required to make video and audio generation appear seamless. Once the user has input a prompt, the rest of the pipeline is autonomous—no human intervention involved.
Some of the tech/framework/models involved:
- Open AI and Claude: Mainly used for structured output and script writing
- Elevenlabs for audio generation
- Luma AI for image to video generation
- Flux for image generation
- A video retalking model from Replicate (for the talking head at the beginning of the video)
- LlamaCloud and LlamaIndex's workflow module used for agentic RAG pipelining
A Quick Note on How RAG Was Used

The knowledge base is Stanford's Encyclopedia of Philosophy, which was scraped and indexed in LlamaCloud. It involves several hundred documents. The agentic step is not too complicated—essentially, running a single user query against the vector database does not yield enough information to generate a comprehensive script. Hence, there are steps taken to create several subquestions based on the user's query to gather as much context as possible before passing it along to a script writing LLM.
Moving Forward
The use of philosophy documents is merely a sandbox example. The underlying tech used to generate these videos is highly extensible.
- Boring instruction manuals can now turn into engaging short videos.
- Convert lecture notes or transcripts into supplementary video content for students.
- Turn text-based recipes into visual cooking guides with animated steps.
- Convert long-form travel articles or guidebooks into concise video itineraries.
- and much much more...

NotebookLM by Google proved that dense content-to-audio podcast has strong product-market fit. People described it as Google's "ChatGPT" moment. I think we can go a lot further than that by turning unstructured content into engaging video content. AristoBites is just the start.
Built With
- anthropic
- elevenlabs
- flux
- llamacloud
- llamaindex
- lumaai
- openai
- reflex
Log in or sign up for Devpost to join the conversation.