Inspiration

Imagine walking into a room full of VCs and getting destroyed because you only practiced with Claude, ChatGPT, or your roommates. Real investors interrupt you, pressure-test your assumptions, and react emotionally in real time. We wanted to recreate that experience with AI. This is a problem we have experienced as builders, being prepared for pitches and demo can be a struggle and the lack of tools makes it extremely difficult to get real feedback. Rather than finding a cheap workaround, we wanted to solve this problem directly.

What it does

Shark Tank is a realtime AI pitch simulator where founders pitch to virtual VC judges inspired by well-known investors. Users join a live audio/video session, pitch their startup, and get interrupted with tough questions. The system analyzes webcam-based confidence signals, adapts investor behavior in real time, and streams live responses, transcripts, and reactions directly in the browser.

How we built it

The frontend is a lightweight single-page application built in vanilla HTML and JavaScript. It uses the Tencent TRTC Web SDK to handle realtime audio and video streaming between the user and the system. The browser also captures webcam frames at regular intervals and sends them to the backend over a websocket connection. In return, it receives live judge responses, transcript updates, mood signals, and UI events that control avatar switching and realtime pitch visualization.

On the backend, FastAPI serves both the application and the realtime infrastructure. It exposes endpoints for serving the frontend, generating TRTC room credentials, and managing websocket communication. A central agent system runs concurrently with the server and maintains session state, including the current judge, turn index, transcript history, and user mood score. This agent orchestrates the flow of the entire pitch experience in realtime using asynchronous Python.

The intelligence layer is powered by GetStream Vision Agents, which drive both perception and conversational coordination through realtime multimodal processing. Instead of separate models per judge, personality differences are handled through prompt-level control and session context within the agent system. This allows the system to dynamically switch between investor personas while maintaining continuity in the conversation and adapting responses based on user performance and realtime visual confidence signals.

Challenges we ran into

One of the biggest challenges was connecting Tencent TRTC with GetStream Vision Agents in a way that felt truly realtime and synchronized. TRTC handles the audio/video stream, while Vision Agents process webcam frames separately, so we had to carefully bridge two independent pipelines that were never designed to work together. Making sure that what the user was saying, what the system was seeing, and what the judges were responding with all stayed in sync required a lot of coordination between websocket events, session state, and async backend logic.

Another difficulty was managing timing and latency between the vision feedback loop and the TRTC conversation flow. Webcam frames had to be captured in the browser, sent to the backend, processed by Vision Agents, and then reflected back into judge behavior—all without introducing noticeable delay. Even small lags caused mismatches where judges would react to outdated confidence signals or interrupt at the wrong moment, which broke the realism of the pitch experience.

We also had to deal with the complexity of debugging across multiple systems at once. Issues could originate in the browser capture layer, the websocket transport, the backend agent state machine, or the TRTC session itself, making failures hard to trace. Getting stable integration between these systems required tightening message formats, enforcing strict event ordering, and simplifying state transitions in the agent loop until the full pipeline became reliable under realtime load.

Accomplishments that we're proud of

We were very proud of successfully setting up Tencent TRTC, since we initially ran into several issues getting realtime audio and video streaming working correctly. A lot of the early debugging involved understanding how TRTC sessions initialize and how participants join the room in the correct order for everything to function reliably.

Connecting TRTC with GetStream Vision Agents was also difficult, because the two systems operate in completely different parts of the pipeline—TRTC handles media transport while Vision Agents handle webcam-based perception. To solve this, we went through the GetStream repository and documentation in detail to understand how the Tencent TRTC connector was designed to work internally. That helped us figure out how to properly bridge the two systems without breaking the realtime flow, and allowed us to adapt our implementation to align with their intended integration pattern.

What we learned

On the technical side, we learned how to get the best of multiple tools by carefully combining them instead of trying to force a single system to do everything. Working with TRTC and GetStream Vision Agents showed us that realtime systems are really about coordination—making sure different services, each with their own strengths, can work together smoothly through well-defined interfaces, timing, and state management.

On the non-technical side, we realized more than ever that a regular chatbot alone is not enough to replace truly interactive, personalized experiences. The real value comes when AI can respond to context, behavior, and emotion in real time, making the experience feel adaptive and human rather than static or one-dimensional.

What's next for Shark Tank

Tomorrow, we plan to add multiplayer mode so multiple founders can join the same room and pitch in a shared, competitive environment. We also want to expand the system with more AI judges to increase variety and unpredictability in feedback, making each session feel less repetitive and more dynamic. On the technical side, we’ll focus on improving latency and overall system speed to make the realtime interaction feel even more seamless. Finally, we aim to make the grilling experience more intense by enhancing judge behavior so follow-up questions feel sharper, more adaptive, and more challenging based on the user’s performance.

Built With

Share this project:

Updates