Inspiration
Every consultant and support team knows the pain: a customer emails a question, gets a partial answer, then calls to follow up — and has to repeat themselves from scratch. Context is lost. Conversations are fragmented across inboxes, chat tools, and ticketing systems. Consultants spend the majority of their day triaging and responding to routine queries instead of solving the complex, high-value problems they were hired for.
I asked myself: what if a customer could start a conversation in email and continue it in chat — and the AI already knows everything? What if consultants woke up each morning to find 80% of overnight queries already resolved, with only the critical escalations waiting for them?
That vision, one intelligent agent, every channel, zero fragmentation , is what inspired OmniCQM.
What it does
OmniCQM is an omnichannel agentic AI framework that unifies customer query management across communication channels through a single, intelligent backend powered by Amazon Bedrock and deployed at nova-api.ksasalam.com.
For users:
- Start a conversation from Outlook — the AI agent responds autonomously within minutes via Power Automate
- Continue the same conversation in a chat UI with full history preserved, including the original email thread
- Use voice input powered by Amazon Transcribe to send spoken queries directly into the chat interface
- Receive context-aware, RAG-grounded responses 24/7 — no waiting for business hours
For consultants and admins:
- An admin chat interface accepts complex natural language queries and returns structured insights, charts, and tables
- Automatic escalation routing flags queries that exceed the agent's confidence threshold, surfacing them for human review with full context attached
- Full MLflow observability — every agent run is traced, evaluated, and logged with metrics, artifacts, and LLM-as-judge scores
For engineering teams:
- The backend is exposed as a clean REST API over HTTPS, integrated via Power Automate HTTP Connector, meaning any new channel can be connected without modifying core logic
- Unified data storage ensures no fragmented threads or broken conversation chains across channels
Mathematically, if $Q$ is the total query volume and $\alpha$ is the autonomous resolution rate, the consultant workload $W$ reduces to:
$$W = Q \cdot (1 - \alpha)$$
With $\alpha \geq 0.80$, consultants handle at most 20% of queries — freeing significant capacity for high-impact work.
How we built it
OmniCQM is built on a modular, cloud-native architecture deployed on AWS eu-west-1, with a custom domain, HTTPS, and production-grade infrastructure from day one.
Core AI Layer
global.amazon.nova-2-lite-v1:0via Amazon Bedrock is the primary LLM for query understanding, response generation, and escalation decisions — chosen for its speed and cost efficiency in high-frequency agentic loopsamazon.titan-embed-text-v2via Amazon Bedrock generates semantic embeddings for all knowledge base documents- A custom orchestration layer built with
boto3drives the agentic reasoning loop — invoking Bedrock, managing tool calls, and routing decisions without relying on a third-party agent framework for the core logic - LlamaIndex with ChromaDB powers the RAG pipeline, indexing 5 knowledge base documents stored in
s3://s3-cqm-bucket/knowledge-base/and retrieving semantically relevant context at query time
Speech-to-Text
- Amazon Transcribe handles all voice input, converting spoken queries to text before they enter the agent pipeline
Omnichannel Integration
- Microsoft Power Automate with an HTTP Connector bridges Outlook to the OmniCQM API — no custom email server required
- The backend exposes a unified
/queryREST endpoint accepting achannel_id,user_id, and message payload, ensuring every channel shares the same processing pipeline
Infrastructure
- Three containers built and deployed on AWS ECS Fargate (
email-agent-cluster):email-agent-backend— FastAPI backend, custom orchestration, Bedrock calls, RAG pipeline- Frontend container — chat UI serving both user and admin interfaces
cqm-mlflow(nginx-proxied) — self-hosted MLflow tracking server
- Application Load Balancer (
email-agent-alb) handles HTTPS termination and HTTP→HTTPS redirect - AWS Certificate Manager issues and auto-renews the TLS certificate for
api.ksasalam.com - Route 53 manages DNS for the hosted zone
- Amazon ECR stores container images (
email-agent-backend:latest,cqm-mlflow:nginx) - Amazon S3 (
s3-cqm-bucket) serves dual purpose: knowledge base document storage (/knowledge-base/) and MLflow artifact storage (/mlartifacts/) - All images built for
linux/amd64on ECS Fargate
Observability
- Every Bedrock invocation is wrapped with a custom
boto3logging layer that captures inputs, outputs, latency, and token counts, then logs them as MLflow runs - LLM-as-judge evaluation scores (relevance, accuracy, safety) are computed per response and persisted as MLflow metrics and artifacts
Challenges we ran into
1. Cross-channel session continuity
The hardest engineering problem was reliably linking an Outlook email sender identity to a chat UI user identity. We solved this with an identity resolution layer that normalises identifiers at ingestion time and stores a canonical user_id, ensuring both channels hydrate from the same conversation history.
2. Custom orchestration without a framework
Rather than using LangChain or a managed agent service, we built the agentic loop directly with boto3. This gave us full control over tool call sequencing, context window management, and escalation logic — but required careful design to handle partial tool results and retry behaviour gracefully.
3. Power Automate HTTP Connector timeout
Power Automate enforces a 30-second HTTP timeout. For multi-step agent reasoning this was a hard wall. We resolved it with an async pattern: the connector receives an immediate 202 Accepted, and the agent posts the completed response back via a webhook once processing is done.
4. LlamaIndex + ChromaDB on Fargate ChromaDB's default persistence mode requires a local filesystem, which conflicts with stateless Fargate tasks. We configured ChromaDB to use an ephemeral in-memory index at container startup, re-indexed from S3 on cold start, and accepted the warm-up latency trade-off for the simplicity it provided.
5. MLflow on ECS with S3 artifact backend
Routing MLflow's artifact store to S3 (/mlartifacts/) required correct IAM task role permissions and careful environment variable configuration inside the cqm-mlflow container. Getting the nginx reverse proxy to correctly forward the MLflow UI behind the ALB also required non-obvious path rewrite rules.
6. Prompt consistency across channels An email query arrives with signatures, quoted threads, and HTML artefacts. A voice query from Amazon Transcribe arrives with filler words and no punctuation. We built a prompt normalisation layer that strips channel-specific noise before any text reaches the LLM, ensuring consistent reasoning quality regardless of input source.
Accomplishments that we're proud of
- True omnichannel continuity — a user can send an email, receive an AI response, then open the chat UI and see the full thread in context. This works end-to-end in production at
nova-api.ksasalam.com. - Three-container production deployment — backend, frontend, and MLflow all running as independent ECS Fargate services behind a single ALB, with HTTPS and auto-renewing certificates.
- Custom agentic orchestration with boto3 — no third-party agent framework for the core loop. Full control over reasoning, tool dispatch, and escalation logic, with every step observable in MLflow.
- RAG pipeline grounded in real documents — LlamaIndex + ChromaDB + Titan Embeddings delivering semantically relevant context from the knowledge base on every query.
- LLM-as-judge evaluation pipeline — every response is automatically scored and persisted to MLflow, providing a continuous quality signal without manual labelling.
- Zero backend changes to add a new channel — validated by connecting the admin interface as a second channel using only a different
channel_id. Core agent logic untouched.
What we learned
- Custom orchestration beats framework magic for agentic systems. Building the reasoning loop directly with
boto3was harder upfront but gave us precise control over context, retries, and escalation — and made debugging vastly simpler than black-box framework behaviour. - Observability must be designed in, not bolted on. Wrapping every Bedrock call with MLflow logging from the start meant we always had a trace to inspect. This changed how fast we could iterate and diagnose regressions.
- RAG quality depends on chunking strategy, not just retrieval. Early versions of the LlamaIndex pipeline returned irrelevant chunks because documents were split at fixed character counts. Switching to semantic chunking with Titan Embeddings significantly improved answer grounding.
- Power Automate is powerful but opinionated. Its constraints forced us into an async webhook pattern that actually made the architecture cleaner and more resilient than a synchronous design would have been.
- Container cold starts matter in agentic systems. Re-indexing ChromaDB from S3 on Fargate cold start added noticeable latency to the first query after a scale event. In production this would warrant a persistent vector store like OpenSearch Serverless.
- Channel identity is a first-class concern. Cross-channel user resolution is harder than it looks and deserves a dedicated identity service rather than ad-hoc normalisation logic.
What's next for OmniCQM
Short term
- Integrate Microsoft Graph API to programmatically manage and orchestrate email at scale — moving beyond Power Automate's single-flow trigger to full mailbox management, threading, and priority routing across large consultant teams
- Add WhatsApp Business API and similar B2C messaging platforms as first-class channels, enabling OmniCQM to serve direct customer communications at consumer scale
- Build a consultant dashboard showing real-time query volume, autonomous resolution rate $\alpha$, and escalation trends per channel
Medium term
- Expand the RAG knowledge base with automatic document ingestion — new files dropped into S3 trigger re-indexing via Lambda, keeping the agent's knowledge current without manual intervention
- Replace the ephemeral ChromaDB index with Amazon OpenSearch Serverless for persistent, scalable vector search that survives Fargate task restarts
Long term
- Achieve $$\alpha \geq 0.95$$ autonomous resolution rate through continuous fine-tuning on organisation-specific query data, reducing consultant workload to:
$$W = Q \cdot (1 - 0.95) = 0.05Q$$
Five percent of queries reaching a human. That's the target.
Built With
- alb
- amazon-cloudwatch
- amazon-dynamodb
- amazon-rds-relational-database-service
- bedrock
- chromadb
- ecr
- ecs
- fargate
- fastapi
- javascript
- llamaindex
- powerautomate
- python
- react
- route53
Log in or sign up for Devpost to join the conversation.