OmniCQM

NovaStar Application

Inspiration

Every consultant and support team knows the pain: a customer emails a question, gets a partial answer, then calls to follow up — and has to repeat themselves from scratch. Context is lost. Conversations are fragmented across inboxes, chat tools, and ticketing systems. Consultants spend the majority of their day triaging and responding to routine queries instead of solving the complex, high-value problems they were hired for.

I asked myself: what if a customer could start a conversation in email and continue it in chat — and the AI already knows everything? What if consultants woke up each morning to find 80% of overnight queries already resolved, with only the critical escalations waiting for them?

That vision, one intelligent agent, every channel, zero fragmentation , is what inspired OmniCQM.

What it does

OmniCQM is an omnichannel agentic AI framework that unifies customer query management across communication channels through a single, intelligent backend powered by Amazon Bedrock and deployed at nova-api.ksasalam.com.

For users:

Start a conversation from Outlook — the AI agent responds autonomously within minutes via Power Automate
Continue the same conversation in a chat UI with full history preserved, including the original email thread
Use voice input powered by Amazon Transcribe to send spoken queries directly into the chat interface
Receive context-aware, RAG-grounded responses 24/7 — no waiting for business hours

For consultants and admins:

An admin chat interface accepts complex natural language queries and returns structured insights, charts, and tables
Automatic escalation routing flags queries that exceed the agent's confidence threshold, surfacing them for human review with full context attached
Full MLflow observability — every agent run is traced, evaluated, and logged with metrics, artifacts, and LLM-as-judge scores

For engineering teams:

The backend is exposed as a clean REST API over HTTPS, integrated via Power Automate HTTP Connector, meaning any new channel can be connected without modifying core logic
Unified data storage ensures no fragmented threads or broken conversation chains across channels

Mathematically, if $Q$ is the total query volume and $\alpha$ is the autonomous resolution rate, the consultant workload $W$ reduces to:

$$W = Q \cdot (1 - \alpha)$$

With $\alpha \geq 0.80$, consultants handle at most 20% of queries — freeing significant capacity for high-impact work.

How we built it

OmniCQM is built on a modular, cloud-native architecture deployed on AWS eu-west-1, with a custom domain, HTTPS, and production-grade infrastructure from day one.

Core AI Layer

global.amazon.nova-2-lite-v1:0 via Amazon Bedrock is the primary LLM for query understanding, response generation, and escalation decisions — chosen for its speed and cost efficiency in high-frequency agentic loops
amazon.titan-embed-text-v2 via Amazon Bedrock generates semantic embeddings for all knowledge base documents
A custom orchestration layer built with boto3 drives the agentic reasoning loop — invoking Bedrock, managing tool calls, and routing decisions without relying on a third-party agent framework for the core logic
LlamaIndex with ChromaDB powers the RAG pipeline, indexing 5 knowledge base documents stored in s3://s3-cqm-bucket/knowledge-base/ and retrieving semantically relevant context at query time

Speech-to-Text

Amazon Transcribe handles all voice input, converting spoken queries to text before they enter the agent pipeline

Omnichannel Integration

Microsoft Power Automate with an HTTP Connector bridges Outlook to the OmniCQM API — no custom email server required
The backend exposes a unified /query REST endpoint accepting a channel_id, user_id, and message payload, ensuring every channel shares the same processing pipeline

Infrastructure

Three containers built and deployed on AWS ECS Fargate (email-agent-cluster):
- email-agent-backend — FastAPI backend, custom orchestration, Bedrock calls, RAG pipeline
- Frontend container — chat UI serving both user and admin interfaces
- cqm-mlflow (nginx-proxied) — self-hosted MLflow tracking server
Application Load Balancer (email-agent-alb) handles HTTPS termination and HTTP→HTTPS redirect
AWS Certificate Manager issues and auto-renews the TLS certificate for api.ksasalam.com
Route 53 manages DNS for the hosted zone
Amazon ECR stores container images (email-agent-backend:latest, cqm-mlflow:nginx)
Amazon S3 (s3-cqm-bucket) serves dual purpose: knowledge base document storage (/knowledge-base/) and MLflow artifact storage (/mlartifacts/)
All images built for linux/amd64 on ECS Fargate

Observability

Every Bedrock invocation is wrapped with a custom boto3 logging layer that captures inputs, outputs, latency, and token counts, then logs them as MLflow runs
LLM-as-judge evaluation scores (relevance, accuracy, safety) are computed per response and persisted as MLflow metrics and artifacts

Challenges we ran into

1. Cross-channel session continuity The hardest engineering problem was reliably linking an Outlook email sender identity to a chat UI user identity. We solved this with an identity resolution layer that normalises identifiers at ingestion time and stores a canonical user_id, ensuring both channels hydrate from the same conversation history.

2. Custom orchestration without a framework Rather than using LangChain or a managed agent service, we built the agentic loop directly with boto3. This gave us full control over tool call sequencing, context window management, and escalation logic — but required careful design to handle partial tool results and retry behaviour gracefully.

3. Power Automate HTTP Connector timeout Power Automate enforces a 30-second HTTP timeout. For multi-step agent reasoning this was a hard wall. We resolved it with an async pattern: the connector receives an immediate 202 Accepted, and the agent posts the completed response back via a webhook once processing is done.

4. LlamaIndex + ChromaDB on Fargate ChromaDB's default persistence mode requires a local filesystem, which conflicts with stateless Fargate tasks. We configured ChromaDB to use an ephemeral in-memory index at container startup, re-indexed from S3 on cold start, and accepted the warm-up latency trade-off for the simplicity it provided.

5. MLflow on ECS with S3 artifact backend Routing MLflow's artifact store to S3 (/mlartifacts/) required correct IAM task role permissions and careful environment variable configuration inside the cqm-mlflow container. Getting the nginx reverse proxy to correctly forward the MLflow UI behind the ALB also required non-obvious path rewrite rules.

6. Prompt consistency across channels An email query arrives with signatures, quoted threads, and HTML artefacts. A voice query from Amazon Transcribe arrives with filler words and no punctuation. We built a prompt normalisation layer that strips channel-specific noise before any text reaches the LLM, ensuring consistent reasoning quality regardless of input source.

Accomplishments that we're proud of

True omnichannel continuity — a user can send an email, receive an AI response, then open the chat UI and see the full thread in context. This works end-to-end in production at nova-api.ksasalam.com.
Three-container production deployment — backend, frontend, and MLflow all running as independent ECS Fargate services behind a single ALB, with HTTPS and auto-renewing certificates.
Custom agentic orchestration with boto3 — no third-party agent framework for the core loop. Full control over reasoning, tool dispatch, and escalation logic, with every step observable in MLflow.
RAG pipeline grounded in real documents — LlamaIndex + ChromaDB + Titan Embeddings delivering semantically relevant context from the knowledge base on every query.
LLM-as-judge evaluation pipeline — every response is automatically scored and persisted to MLflow, providing a continuous quality signal without manual labelling.
Zero backend changes to add a new channel — validated by connecting the admin interface as a second channel using only a different channel_id. Core agent logic untouched.

What we learned

Custom orchestration beats framework magic for agentic systems. Building the reasoning loop directly with boto3 was harder upfront but gave us precise control over context, retries, and escalation — and made debugging vastly simpler than black-box framework behaviour.
Observability must be designed in, not bolted on. Wrapping every Bedrock call with MLflow logging from the start meant we always had a trace to inspect. This changed how fast we could iterate and diagnose regressions.
RAG quality depends on chunking strategy, not just retrieval. Early versions of the LlamaIndex pipeline returned irrelevant chunks because documents were split at fixed character counts. Switching to semantic chunking with Titan Embeddings significantly improved answer grounding.
Power Automate is powerful but opinionated. Its constraints forced us into an async webhook pattern that actually made the architecture cleaner and more resilient than a synchronous design would have been.
Container cold starts matter in agentic systems. Re-indexing ChromaDB from S3 on Fargate cold start added noticeable latency to the first query after a scale event. In production this would warrant a persistent vector store like OpenSearch Serverless.
Channel identity is a first-class concern. Cross-channel user resolution is harder than it looks and deserves a dedicated identity service rather than ad-hoc normalisation logic.

What's next for OmniCQM

Short term

Integrate Microsoft Graph API to programmatically manage and orchestrate email at scale — moving beyond Power Automate's single-flow trigger to full mailbox management, threading, and priority routing across large consultant teams
Add WhatsApp Business API and similar B2C messaging platforms as first-class channels, enabling OmniCQM to serve direct customer communications at consumer scale
Build a consultant dashboard showing real-time query volume, autonomous resolution rate $\alpha$, and escalation trends per channel

Medium term

Expand the RAG knowledge base with automatic document ingestion — new files dropped into S3 trigger re-indexing via Lambda, keeping the agent's knowledge current without manual intervention
Replace the ephemeral ChromaDB index with Amazon OpenSearch Serverless for persistent, scalable vector search that survives Fargate task restarts

Long term

Achieve $$\alpha \geq 0.95$$ autonomous resolution rate through continuous fine-tuning on organisation-specific query data, reducing consultant workload to:

$$W = Q \cdot (1 - 0.95) = 0.05Q$$

Five percent of queries reaching a human. That's the target.

Built With

alb
amazon-cloudwatch
amazon-dynamodb
amazon-rds-relational-database-service
bedrock
chromadb
ecr
ecs
fargate
fastapi
javascript
llamaindex
powerautomate
python
react
route53

Updates

Mzwandile Mhlongo started this project — Mar 10, 2026 07:45 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.