Inspiration

In Nigeria, customer service is a daily frustration. Small businesses lose sales because they can't staff phone lines around the clock. Customers call, wait on hold, get dropped, and move on. WhatsApp has become the backbone of commerce across Africa — but it's manual, unscalable, and falls apart when transactions get complex. Trade-ins, visual inspections, price negotiations, payments — none of that works over text alone.

I wanted to build something that could handle the entire customer journey through a single phone call, the way a skilled salesperson would — but available 24/7, in any industry, for any business with a phone number.

The name Ekaette is a traditional Ibibio/Efik name from southern Nigeria. It represents the resourceful, sharp Nigerian businesswoman who can sell anything, remember every customer, and close every deal. That's exactly what this agent does.

What it does

Ekaette is a voice-first AI commerce agent that handles the full transaction lifecycle — product discovery, visual device inspection, real-time price negotiation, payment collection, and delivery booking — all within a single phone call, with cross-channel WhatsApp integration.

A typical call flow:

  1. Customer calls in and asks about available phones
  2. Ekaette searches the catalog and recommends options
  3. Customer wants to trade in their old device — Ekaette sends a WhatsApp message requesting a video
  4. Customer sends a video of their phone mid-call — the AI analyzes it in real time (brand, model, condition, damage)
  5. Ekaette generates a trade-in valuation and negotiates the price naturally through conversation
  6. Customer agrees — Ekaette generates a preview image of a matching phone case and sends it via WhatsApp
  7. Delivery is quoted and booked through TopShip
  8. Payment is collected via Paystack — all without leaving the call

Any business can onboard in minutes through a self-service setup wizard: pick an industry, upload products, connect payment and logistics, and go live with an AI agent on a real phone number and WhatsApp channel.

How we built it

Architecture: A root orchestrator agent delegates to 5 specialized sub-agents (vision, valuation, booking, catalog, support) using Google's Agent Development Kit (ADK). Voice channels use Gemini Live 2.5 Flash Native Audio for real-time conversational speech. Text channels (WhatsApp, SMS) use Gemini 3 Flash for reasoning. Image generation uses Imagen 4 on Vertex AI.

Backend: Python 3.13 with FastAPI, deployed as dual Cloud Run services — one for HTTP/webhooks (short requests) and one for voice WebSocket streams (long-lived connections). Firestore powers a registry-driven multi-tenant configuration system where adding a new industry or vendor is pure config.

Frontend: React 19 with Vite 7 and Tailwind CSS v4. The web app supports browser-based voice calls via WebSocket with AudioWorklet for PCM streaming (separate 16kHz recording / 24kHz playback contexts). Includes a vendor onboarding wizard and real-time call UI with live transcription.

Cross-channel bridge: When a customer sends media via WhatsApp during an active voice call, the media bridge detects the live session in Firestore and injects the image or video directly into the voice pipeline for real-time analysis — no manual handoff needed.

Telephony: Africa's Talking SIP for voice calls, WhatsApp Cloud API for messaging, with a VM-based bridge service connecting SIP audio streams to the Gemini Live API.

Payments & Logistics: Paystack for payment collection, TopShip for delivery quoting and booking — both integrated as agent tools callable mid-conversation.

Testing: Full TDD approach — 641 passing tests (463 backend, 178 frontend) covering tools, agents, hooks, components, and end-to-end flows.

Challenges we ran into

Native audio function calling regression: The GA release of Gemini's native audio model had roughly 2% function calling success compared to 90-100% on the previous preview model. The model hallucinated sub-agent names as direct function calls instead of using the proper transfer mechanism. I fixed this with explicit error recovery callbacks, detailed agent descriptions for ADK's auto-injected transfer instructions, and negative instructions to prevent the hallucination pattern.

Cloud Run scaling for telephony: Voice calls use long-lived WebSocket connections that tie up Cloud Run instances. With minimum instances set to 1, a single active call blocked all webhook callbacks — returning 429 errors from Google's frontend with zero application logs. The fix was a dual-service architecture with separate scaling for voice and HTTP traffic.

Cross-channel media timing: Bridging WhatsApp media into an active voice session required careful coordination — the voice pipeline needs to suppress its own output while waiting for media analysis, then resume naturally. Getting the audio suppression gates, deterministic reply windows, and silence recovery timers to work together without muting the agent or creating awkward pauses took significant iteration.

Vertex AI migration under pressure: Preview model instability (random 1008 WebSocket crashes mid-call) forced a migration to Vertex AI GA models days before the deadline. Model IDs, TTS endpoints, and authentication all changed — but the env-only switchover architecture meant zero code changes for the rollback path.

Accomplishments we're proud of

  • Built a fully functional voice commerce platform — not a prototype — that handles real phone calls, real payments, and real delivery bookings end-to-end
  • Cross-channel media bridge: a customer can send a video on WhatsApp during a live voice call and the AI analyzes it in real time — no other voice agent does this
  • 6 industry templates (electronics, fashion, automotive, hotel, telecom, aviation) all powered by a single registry-driven config system — adding a new industry is config, not code
  • 641 passing tests across backend and frontend, built with full TDD from day one
  • Migrated from Gemini API preview models to Vertex AI GA under hackathon deadline pressure with zero downtime — env-only switchover, no code rollback needed
  • Natural voice negotiation: the agent haggles trade-in prices conversationally, just like a real Nigerian market vendor would

What we learned

  • Voice AI is a fundamentally different UX challenge than text. Silence is failure — every millisecond of dead air erodes trust. Building robust silence recovery, filler responses, and deterministic reply systems was as important as the AI itself.
  • Multi-tenant architecture pays for itself immediately. The registry-driven approach meant adding new industries (telecom, aviation) took hours instead of days, and vendor onboarding became a config problem, not a code problem.
  • African infrastructure constraints drive better architecture. Building for SIP calls (not just WebRTC), unreliable networks, and cross-channel workflows forced design decisions that made the system more resilient overall.

What's next for Ekaette

  • Self-improving agent loop: Close the feedback loop between call outcomes and agent behavior — track conversion rates and drop-off points to automatically refine prompts and routing
  • Multilingual voice: Expand to Pidgin, Yoruba, Igbo, and Hausa using Gemini's multilingual native audio, with automatic language detection on the opening turn
  • Proactive outbound: Use purchase history to trigger outbound calls and WhatsApp messages — trade-in reminders, restock alerts, and post-purchase check-ins
  • Advanced visual commerce: Real-time product comparison, AR-style previews using the customer's actual device photo, and damage assessment for insurance claims
  • Analytics dashboard: Real-time vendor intelligence — call funnels, revenue per call, sentiment trends — all derived from the structured event stream already flowing through the platform

Built With

Share this project:

Updates