This markdown is entirely human generated

AristoBites

AristoBites makes short form videos completely autonomously from a text input. It does so by using the latest in agentic RAG (to make the script), image, audio, and video gen models. Check out some of the videos generated here: link

Sponsor technology used

LlamaIndex: LlamaCloud, workflow, ✨agentic RAG✨ Reflex: Frontend to display videos generated

The Pipeline

Screenshot 2024-10-13 at 2.26.01 PM.png

As the pipeline graph indicates, there are several moving parts in the system. This is required to make video and audio generation appear seamless. Once the user has input a prompt, the rest of the pipeline is autonomous—no human intervention involved.

Some of the tech/framework/models involved:

  • Open AI and Claude: Mainly used for structured output and script writing
  • Elevenlabs for audio generation
  • Luma AI for image to video generation
  • Flux for image generation
  • A video retalking model from Replicate (for the talking head at the beginning of the video)
  • LlamaCloud and LlamaIndex's workflow module used for agentic RAG pipelining

A Quick Note on How RAG Was Used

Screenshot 2024-10-13 at 1.41.00 PM.png

The knowledge base is Stanford's Encyclopedia of Philosophy, which was scraped and indexed in LlamaCloud. It involves several hundred documents. The agentic step is not too complicated—essentially, running a single user query against the vector database does not yield enough information to generate a comprehensive script. Hence, there are steps taken to create several subquestions based on the user's query to gather as much context as possible before passing it along to a script writing LLM.

Moving Forward

The use of philosophy documents is merely a sandbox example. The underlying tech used to generate these videos is highly extensible.

  • Boring instruction manuals can now turn into engaging short videos.
  • Convert lecture notes or transcripts into supplementary video content for students.
  • Turn text-based recipes into visual cooking guides with animated steps.
  • Convert long-form travel articles or guidebooks into concise video itineraries.
  • and much much more...

Screenshot 2024-10-13 at 2.29.57 PM.png

NotebookLM by Google proved that dense content-to-audio podcast has strong product-market fit. People described it as Google's "ChatGPT" moment. I think we can go a lot further than that by turning unstructured content into engaging video content. AristoBites is just the start.

Built With

  • anthropic
  • elevenlabs
  • flux
  • llamacloud
  • llamaindex
  • lumaai
  • openai
  • reflex
Share this project:

Updates