Inspiration

Internet made information accessible while AI making knowledge accessible. But knowledge is multimodal. So I wanted to build a chat where you can leverage multimodal inputs to learn new things by practicing, and in your own sytle.

What it does

It takes videos, pictures, PDFs as input and teaches you new things. You can upload an hour long lecture videos and it will watch the video, extract lecture notes, prepare quiz cards to help you learn from it.

It leverages multiple Agents for different capabilities:

  • Extract lecture notes: Uses gemini to watch the lecture video and extract notes
  • Analyze image: Uses gemini to describe the image
  • Analyze pdf: Uses gemini API to analyze the pdf
  • Ask to math Agent: Uses DeepSeek-R1-Distill-Llama-70B for getting help with math questions
  • Ask to code Agent: Uses Claude Sonnet 3.5 for getting help with code questions
  • Ask to science Agent: Uses DeepSeek-R1-Distill-Llama-70B for getting help with science questions
  • Search web: Uses exa.ai to retrieve information from the web. Feed the text from first 5 urls to model.

How we built it

Used agno.ai for processing multimodal data, and used Sambanova with function calling for the base chat model.

Challenges we ran into

Building the interface, and processing long videos

Accomplishments that we're proud of

Video to lecture notes by splitting long videos into small chunks and using gemini to prepare a complete lecture note markdown

What we learned

Spare more time for building the UI...

What's next for Spilazzola

Adding whiteboard and shared session.

Built With

  • agno
  • exa.ai
  • sambanova
Share this project:

Updates