Inspiration
Vietnam's e-commerce market is booming, but the "last mile" of actually buying something is still painfully manual. Procurement managers at SMEs spend hours tabbing between Shopee, Lazada, and Tiki, comparing prices, checking seller ratings, and copy-pasting shipping details — only to get distracted and miss a flash sale. We asked: what if you could just say what you need, and an agent handles the rest — including calling you back when it's found the deal?
The cuckoo bird is famous for nesting inside other environments to achieve its goals. That metaphor stuck. Our agent doesn't own any storefront; it nests inside existing platforms and acts on your behalf.
What it does
Cuckoo is an autonomous transactional voice agent — the world's first procurement worker you can brief in plain language and trust to execute. Here's the full loop:
- You give a natural language instruction — e.g., "Source 20 office chairs, budget 50M VND, Shopee Mall sellers only."
- Cuckoo launches parallel browser instances via TinyFish, scraping live prices and seller ratings across Shopee and Tiki simultaneously — no cached data, no hallucinated links.
- The Brain (GPT-4o) evaluates the results, ranks deals by value-for-money, and composes a confident recommendation.
- Cuckoo calls you. Not a push notification — an actual outbound voice call, powered by Agora's ultra-low-latency audio stack and ElevenLabs' Vietnamese/English TTS voice.
- You say "Yes." Cuckoo navigates the checkout flow, fills in shipping and VAT details, and completes the order.
It's not a chatbot. It's a procurement workforce of one.
How we built it
We split the stack across three focused domains and connected them through a shared contract layer:
The Hands (TinyFish Runner): Goal-based browser automation handles authenticated Shopee/Tiki sessions, bypasses bot detection, and returns clean structured JSON for each listing — price, seller rating, estimated shipping. A second script takes a product URL all the way through checkout, stopping just before final payment or clicking Cash on Delivery.
The Brain (Orchestrator API): A FastAPI/Node hub runs a state machine with four stages — search triggered, results received, voice call initiated, decision received. GPT-4o with function calling decides when to fire
trigger_search()andtrigger_checkout(). All state is passed explicitly; the LLM is stateless by design.The Mouth & Ears (Voice Console): An Agora Web SDK stream handles full-duplex audio with sub-200ms latency. ElevenLabs Turbo v2.5 generates the agent's voice in real time. A browser-based STT layer transcribes the user's reply and pipes it back to the orchestrator. The UI is a "glass box" — a live waveform on the left, and a real-time stream of the TinyFish browser actually clicking on the right.
All three services share typed contracts via a shared-contracts package, so every JSON payload from browser to brain to voice is validated end-to-end.
Challenges we ran into
- Bot detection on Vietnamese platforms is aggressive and inconsistent. Shopee's anti-scraping layer changes behavior between sessions, requiring TinyFish's goal-based approach rather than brittle CSS selectors. We burned several hours here before committing to pre-saved authenticated sessions for the demo.
- Latency in the voice loop was the hairiest problem. ElevenLabs generation + Agora transmission + STT round-trip had to stay under ~2 seconds or the "call" felt broken. We cut latency by streaming ElevenLabs audio in chunks rather than waiting for the full synthesis, and by keeping the orchestrator co-located with the voice console.
- State synchronization across three independent services with no shared database required careful design. We ended up with a simple in-memory store with event-driven hooks — pragmatic for a hackathon, but something we'd harden in production.
- The checkout "stop before payment" gate was delicate — we needed the agent to get as far as the order summary page without accidentally firing a real transaction during demos.
Accomplishments that we're proud of
- A genuinely end-to-end transactional loop: voice in → browser acts → voice out → purchase confirmed. No mocked steps.
- Sub-2-second voice round-trip latency on a live Agora stream, tested repeatedly under demo conditions.
- The "glass box" UI — watching the TinyFish browser visibly navigate Shopee in real time while the agent speaks is the clearest possible proof that this is real automation, not a slide deck.
- A clean monorepo architecture with shared typed contracts that let three developers work in parallel for 21 hours without merge conflicts breaking the integration.
What we learned
- Goal-based browser automation (TinyFish) is genuinely superior to selector-based scraping for hostile, frequently-updated UIs — the abstraction is worth the overhead.
- Voice as an output channel is underrated. A phone call cuts through notification fatigue in a way no push alert can. High-stakes, time-sensitive decisions belong in voice.
- Building a state machine first before touching any API saved us from spaghetti integration hell. When Archit's voice layer and Nhat's browser layer were both ready, James's hub connected them cleanly in under two hours.
- Real-time audio + LLM orchestration + browser automation is a genuinely hard latency puzzle. Every millisecond saved in one layer shows up as a better user experience in the next.
What's next for Cuckoo
- Multi-platform expansion: Lazada, Sendo, and B2B supplier portals with login federation.
- Proactive monitoring: Cuckoo watches a wishlist and calls you only when a price drops below your threshold — true autonomous procurement.
- Enterprise procurement workflows: PO generation, VAT invoice capture, ERP integration (MISA, SAP B1).
- Voice-first mobile app: So the "incoming call" UX works natively on iOS/Android, not just in a browser tab.
- Payment execution: Move past Cash on Delivery to full card/e-wallet checkout with secure credential vaulting — the true end-to-end transactional agent.
Built With
- agorawebsdk
- docker
- elevenlabs
- node.js
- openaiplatformapi
- pnpm
- react
- tinyfish
- typescript
Log in or sign up for Devpost to join the conversation.