ClinicalCompass

Inspiration

In the dynamic landscape of healthcare, the rapid advancements in large language models (LLMs) have truly ushered in a new era of possibilities, particularly within this traditionally data-intensive sector, with their immense potential to revolutionize diagnostics, patient care, and research.

However, a profound challenge became apparent: the inherent "black box" nature of these powerful AI tools, coupled with the critical need for absolute accuracy where errors have life-altering consequences. This highlighted a significant market void for a dedicated, robust mechanism to continuously validate and ensure the reliability of LLM-generated content in sensitive clinical contexts.

It was this very realization, stemming directly from the intersection of cutting-edge AI and the unwavering demands of patient safety, that served as the core technical inspiration for developing ClinicalCompass – a specialized tool meticulously designed to bring transparency, accountability, and verifiable accuracy to LLM deployments in healthcare.

What it does

ClinicalCompass stands as a powerful, real-time LLM evaluation and advisory platform meticulously crafted for the demands of machine-critical healthcare environments, leveraging AWS Lambda services as its core engine. It systematically scrutinizes prompts and their corresponding responses sourced from a diverse array of LLM applications, encompassing those powered by Bedrock Nova, Open AI, Perplexity, and Gemini.

At its core, ClinicalCompass provides a multi-faceted approach to LLM assurance, powered by its robust microservices architecture:

Comprehensive Data Collection via AWS Lambda Microservices: The platform intelligently captures all prompts and the ensuing LLM-generated responses from integrated applications, alongside their vital telemetry data, forming a rich dataset for analysis. This process is orchestrated through dedicated microservices, each handling specific LLM interactions and data streams.
Rigorous and Grounded Evaluation: t employs a comprehensive suite of metrics, from standard NLP scores (BLEU, ROUGE, METEOR) to advanced DeepEval metrics (Hallucination, Faithfulness, Answer Relevance). This evaluation is fortified by strong text and grounding data, ensuring LLM responses are factually accurate, medically consistent, and clinically relevant.
Intelligent, Actionable Advisory: Through continuous, real-time evaluation, ClinicalCompass delivers intelligent advisory. For example, it might recommend that "Bedrock Nova is better suited for X-ray analysis, while Gemini excels in blood report medical diagnosis," thereby empowering healthcare providers and developers to confidently select the optimal LLM for specific clinical situations.
Integrated Micro-Security: Built from the ground up with robust micro-security features, the application ensures the highest level of protection for sensitive patient information and adheres to stringent healthcare data security standards, a critical aspect of its microservices design.

How we built it

ClinicalCompass is engineered as a highly scalable and resilient platform, primarily leveraging Python and AWS serverless architecture.

The user interface is powered by Streamlit, providing an intuitive and interactive dashboard for healthcare professionals to monitor LLM performance and receive advisories. This Streamlit application is deployed on an EC2 instance, serving as the central front-end for our users.

At its core, ClinicalCompass operates on a microservices architecture orchestrated by AWS Lambda functions and connected via a robust event-driven design:

Prompt Collection & Response Retrieval:
- extract_clinical_telemetry_prompts: This Lambda function is triggered to identify and collect prompts from various integrated healthcare applications.
- get_responses_from_openai_gpt, get_responses_from_perplexity_sonar, get_responses_from_gemini_flash, get_responses_from_bedrock_nova: These distinct Lambda functions are responsible for forwarding the collected prompts to their respective LLM APIs (Open AI, Perplexity, Gemini, Bedrock Nova) and retrieving the generated responses.
Data Storage & Event Orchestration:
- AWS EventBridge acts as the central nervous system, orchestrating the flow between these microservices. It's used to trigger Lambda functions based on events (e.g., a new prompt being extracted, an LLM response being received).
- An S3 bucket serves as the primary storage for raw telemetry data, collected prompts, LLM responses, and grounding data, ensuring data persistence and easy access for analytics.
- DynamoDB tables are utilized for storing metadata, configuration settings, real-time evaluation results, and micro-events, providing low-latency access for the UI.
Evaluation & Feedback Loop:
- validate_responses_from_llm: This Lambda function performs initial validation of the LLM responses before deeper evaluation.
- evaluate_llm_metric_scores: This critical Lambda function serves as the heart of our evaluation engine. Leveraging powerful Python libraries and the DeepEval framework, it calculates sophisticated metric scores. Beyond traditional measures like BLEU, ROUGE, and METEOR, it crucially quantifies vital aspects such as hallucination, faithfulness, and answer relevance by rigorously comparing LLM outputs against strong text and grounding data.
update_microevents_to_ui: This Lambda function is responsible for pushing evaluation results and advisory insights from DynamoDB back to the Streamlit UI, providing real-time updates to users.

This cohesive integration of Python, Streamlit, and AWS serverless components (Lambda, S3, EventBridge, DynamoDB) along with specialized NLP libraries ensures a robust, scalable, and highly accurate evaluation pipeline.

Challenges we ran into

Business Challenges:

Defining Comprehensive Healthcare-Specific Metrics: Establishing a universally accepted and clinically relevant set of metrics for LLM accuracy beyond generic NLP scores was a continuous challenge, requiring extensive collaboration with medical experts.
Maintaining Data Privacy and Security: Handling sensitive patient data and clinical prompts necessitated rigorous adherence to data privacy regulations across all AWS Lambda microservices, S3, and DynamoDB.
Ensuring Real-time Advisory Trust: Building trust among healthcare professionals for an AI-driven advisory system on LLM usage required consistent, transparent, and explainable evaluation results .
Invoking Complexity with Diverse LLMs: Each LLM (Open AI, Gemini, Bedrock Nova, Perplexity) has unique API structures and rate limits, demanding flexible and robust integration strategies for the AWS Lambda functions responsible for invoking them.
Scaling Evaluation Workloads: As the number of prompts and LLM interactions grew, ensuring the evaluation pipeline could scale efficiently without incurring prohibitive costs or latency became a significant business concern, directly impacting the design and optimization of AWS Lambda microservices.

Technical Challenges :

Cold Starts and Latency for LLM API Calls:

Challenge: Lambda cold starts significantly impacted the latency when making external API calls to various LLMs, especially for interactive evaluation loops or real-time advisory.

Solution: We employed provisioned concurrency for critical Lambda functions (get_responses_from_*) and optimized handler code to minimize startup time, along with strategic use of Python's requests library and connection pooling.
Managing Lambda Execution Time and Memory for DeepEval:

Challenge: Running complex NLP evaluation frameworks like deepeval (which can involve deep learning models) within Lambda's execution limits (memory and timeout) was demanding.

Solution: We carefully optimized dependencies, used larger memory configurations for evaluation Lambdas, and explored using AWS Fargate for very heavy-duty, long-running evaluation tasks that might exceed Lambda's practical limits.
Orchestrating Complex Workflows with EventBridge:

Challenge: Building intricate event patterns and reliable delivery mechanisms via EventBridge to connect numerous Lambda functions for sequential and parallel processing (e.g., extract -> get_response -> validate -> evaluate) required meticulous design and testing.

Solution: We extensively used EventBridge's custom event buses and rules, along with detailed CloudWatch logging and metrics, to monitor event flow and debug orchestration issues. State machines in AWS Step Functions were considered for more complex, stateful workflows, but EventBridge proved sufficient for most stateless event routing.
Handling Concurrency Limits and Throttling:

Challenge: Uncontrolled invocation of Lambda functions by EventBridge or direct calls could quickly hit AWS account concurrency limits or cause throttling on external LLM APIs.

Solution: We implemented proper queueing mechanisms (e.g., SQS integration for batching, or using EventBridge with target dead-letter queues), configured appropriate concurrency limits at the Lambda function level, and incorporated exponential backoff and retry logic for external API calls.
Dependency Management and Layering in Lambda:

Challenge: Packaging large Python libraries (including their native components) within Lambda deployment packages was challenging due to size limits and compatibility issues.

Solution: We extensively utilized Lambda Layers to manage common dependencies, which significantly reduced individual Lambda package sizes and streamlined updates. For particularly large or complex libraries that exceeded standard Lambda deployment package limits, we managed dependencies by uploading zipped packages to an S3 bucket. These packages were then downloaded to the Lambda function's /tmp directory and optionally unzipped at runtime.

Accomplishments that we're proud of

Developed a Comprehensive, Multi-LLM Evaluation Framework: Successfully integrated and evaluated leading LLMs (Open AI, Gemini, Perplexity, Bedrock Nova) against a diverse set of linguistic and domain-specific metrics (hallucination, faithfulness, relevance), a groundbreaking feat for clinical applications.
Engineered a Scalable Serverless Microservices Architecture: Built a highly resilient and extensible system using AWS Lambda, EventBridge, S3, and DynamoDB, capable of handling vast volumes of telemetry data and performing complex evaluations at scale.
Implemented Robust Data Grounding and Deep Evaluation: Leveraged the deepeval framework to conduct advanced, data-grounded evaluations, ensuring a higher level of trustworthiness and clinical relevance in our accuracy assessments.
User-Friendly Streamlit Interface for Complex Analytics: Provided a powerful yet intuitive Streamlit dashboard on EC2, enabling clinical users to easily visualize complex evaluation data and understand LLM performance without deep technical expertise.

What we learned

Building ClinicalCompass reinforced several key lessons:

Serverless Prowess and Pitfalls: AWS Lambda provides incredible scalability and cost-efficiency, but effective management of cold starts, concurrency, and complex dependency layers is crucial for performance-sensitive applications.
Event-Driven Architectures are Key to AWS Lambda Microservices: EventBridge proved indispensable for decoupling services and orchestrating complex workflows, enhancing system flexibility and resilience.
Data Integrity and Security are Non-Negotiable: In healthcare, a "security-first" mindset must permeate every architectural decision, from data storage to inter-service communication.
User Experience Matters Immensely for Adoption: Presenting complex AI evaluation results in an intuitive and actionable manner via Streamlit was critical for gaining user trust and driving adoption among clinical professionals.

What's next for ClinicalCompass

The future of ClinicalCompass is bright, with several exciting developments planned:

Expanded LLM Integration and Optimized Invocation: We plan to continuously integrate and evaluate new and emerging LLM models, including specialized medical LLMs ensuring adaptability to the evolving AI landscape.
Secure EHR System Connectivity: Developing secure and compliant connectors for Electronic Health Record (EHR) systems is a key next step, for real-time, in-context evaluation directly from clinical data.
Advanced Explainability Microservices: We will incorporate granular Explainable AI functionalities by developing new Lambda-driven microservices. These services will process and store deeper insights into evaluation scores, enhancing user understanding of LLM performance.
Dynamic Customizable Evaluation Workflows: Empowering users to define and configure custom evaluation workflows and weighting for metrics is a priority. This will be implemented through flexible AWS Lambda functions dynamically triggered by AWS EventBridge, allowing for highly tailored assessments based on unique clinical needs and risk tolerance.