Inspiration

Our inspiration stemmed from a clear gap in the rapidly advancing world of generative AI. While many powerful LLMs are available, and next-gen networks like 5G and IoT offer transformative capabilities, there was a lack of objective tools to determine which LLM performs optimally within these connected environments. How does an LLM's latency or output quality vary when deployed over a low-latency 5G network for remote medical diagnostics, or on an edge device for personalized learning? We aimed to build "Verdict" to demystify this selection process, providing the empirical data necessary to ensure that AI solutions are truly effective and accessible when intertwined with modern telecom infrastructure.

What it does

Verdict - BedRock vs Gemini/Open AI/Perplexity offers an intuitive, interactive environment for comprehensive LLM evaluation:

  • Model Selection: Users can easily choose which LLMs to compare, including AWS Bedrock's Nova Lite and other models, Gemini, Open AI, and Perplexity.

  • Custom Prompting: Input specific prompts relevant to real-world applications in healthcare (e.g., summarizing patient data), education (e.g., generating quiz questions), sustainability (e.g., analyzing environmental reports), or arts & culture (e.g., creative text generation).

  • Parallel Execution & Data Capture: The platform simultaneously sends prompts to the selected LLMs via their respective APIs, accurately recording key performance metrics.

  • Performance Metrics: It meticulously captures and displays response latency, token usage, and estimated API cost for each LLM, enabling direct quantitative comparison.

  • Output Visualization: The generated text outputs from each LLM are presented side-by-side, allowing for direct qualitative assessment of coherence, relevance, and overall quality.

By providing these insights, Verdict helps users determine the "best fit" LLM for their specific application requirements and connectivity constraints, leading to optimized, impactful solutions.

How we built it

"Verdict - BedRock vs Gemini/Open AI/Perplexity" was developed as a robust, Python-based web application, demonstrating a technically sound and well-engineered approach:

  • Frontend & User Interface (Streamlit): The clean, intuitive, and highly interactive user interface was developed entirely using Streamlit. This allowed for rapid creation of dynamic input forms, real-time metric displays (charts, tables), and side-by-side text output comparisons, ensuring a seamless user experience.

  • Backend Logic & API Orchestration (Python): The core intelligence, responsible for managing user interactions, orchestrating simultaneous API calls to various LLM providers, collecting responses, and processing performance data, was implemented in Python.

  • LLM Integration:

    • For AWS Bedrock (including AWS Nova Lite), we used the boto3 SDK, leveraging AWS's robust and secure APIs for generative AI.
    • For Gemini (via Google AI API), Open AI, and Perplexity, we integrated directly via their respective RESTful APIs using Python's requests library, handling each provider's unique authentication, request/response formats, and rate limits.
  • Deployment (AWS EC2 Instance): The entire application, including the Streamlit frontend and Python backend, is deployed and hosted on an AWS EC2 instance. This provides a dedicated, scalable, and customizable cloud computing environment for execution.

Challenges we ran into

Developing "Verdict - BedRock vs Gemini/Open AI/Perplexity" involved overcoming both business and technical hurdles:

Business Challenges

  1. Defining "Real-World Impact" Metrics: Translating raw LLM performance into concrete, measurable real-world impact for diverse sectors required careful consideration and framework design.

  2. Managing Evolving LLM Ecosystem: The rapid pace of new LLM releases, API changes, and pricing model updates across providers necessitated continuous adaptation and maintenance.

  3. Cost Management for Benchmarking: Running extensive, parallel queries against multiple commercial LLM APIs could quickly become expensive, requiring efficient test design and resource allocation.

  4. Ensuring Ethical & Unbiased Evaluation: Developing evaluation criteria that minimize bias and ensure responsible AI outputs, particularly in sensitive domains like healthcare and education, was a continuous challenge.

  5. Achieving Long-Term Relevance: Designing the platform to remain valuable and adaptable as LLM technology and next-gen connectivity solutions continue to evolve.

Technical Challenges

  1. Heterogeneous API Interoperability: Integrating and normalizing data from vastly different LLM APIs (AWS Bedrock, Google AI, Open AI, Perplexity) with varying parameters, authentication, and response structures.

  2. Accurate Real-Time Latency Measurement: Precisely measuring and attributing latency across various network paths (from EC2 to different LLM endpoints) was critical but complex to implement consistently.

  3. Standardized Prompt Engineering for Comparison: Crafting prompts that yield truly comparable and meaningful results across LLMs trained on different datasets and architectures was a significant challenge.

  4. Qualitative Output Assessment Automation: Developing automated or semi-automated methods for objectively scoring the qualitative aspects of diverse generative outputs (e.g., creativity, factual accuracy, coherence) beyond human review.

  5. Simulating Next-Gen Network Conditions: While deployed on EC2, accurately simulating granular 5G ultra-low latency, IoT device limitations, or edge computing processing capabilities for true "network-aware" insights required intricate technical considerations.

Accomplishments that we're proud of

  1. Unified Multi-LLM Comparison: Successfully created a single platform capable of performing direct, real-time comparisons across major LLM providers including AWS Bedrock (with Nova Lite), Gemini, OpenAI, and Perplexity.

  2. Intuitive & Clean Streamlit UI: Engineered a highly user-friendly and visually appealing web interface that makes complex LLM performance data accessible and actionable for a wide range of users.

  3. Actionable Performance Insights: The platform delivers concrete, measurable data (latency, token usage, cost) that directly translates into valuable insights for optimizing generative AI deployments for real-world impact.

  4. Addressing AI + Connectivity Gap: "Verdict" uniquely addresses the crucial problem of selecting optimal LLMs for applications leveraging next-generation connectivity, bridging a critical knowledge gap in the industry.

  5. Robust AWS Cloud Deployment: Successfully designed, built, and deployed a stable and technically sound application on an AWS EC2 instance, demonstrating proficient cloud engineering practices.

What we learned

Building "LLM Verdict" provided us with invaluable insights:

  • Contextual LLM Selection is Paramount: There is no "one-size-fits-all" LLM; optimal choice hinges on specific application requirements, network conditions, and desired performance characteristics.

  • AI and Connectivity are Intertwined: Next-generation telecommunications networks fundamentally influence the design, performance, and accessibility of generative AI applications, especially at the edge.

  • Streamlit Accelerates Development: Streamlit proved to be an incredibly powerful tool for rapid prototyping and iterating on data-driven web applications, significantly accelerating our development cycle.

  • Complexity of Multi-API Management: Effectively integrating and maintaining consistency across diverse LLM APIs, each with unique specifications, requires a flexible and resilient architectural approach.

  • Value of Empirical Data: Providing objective, measurable data empowers organizations to make confident, evidence-based decisions when adopting and deploying generative AI solutions.

What's next for Verdict - BedRock vs Gemini/Open AI/Perplexity

Our future roadmap for "Verdict" includes exciting enhancements:

  • Advanced Network Emulation: Deeper integration of network emulation tools or exploring live testing with actual 5G/edge infrastructure to provide more granular performance insights under varying connectivity conditions.

  • Automated Qualitative Scoring: Implementing more sophisticated, automated modules for objectively assessing the quality, relevance, and safety of generated content, potentially using another LLM as an evaluator.

  • User Profiles & Persistent Storage: Adding user authentication and a robust database (e.g., AWS DynamoDB) to allow users to save comparison results, create custom test suites, and track historical performance.

  • Expanded Model Catalog & Versioning: Continuously integrating new LLMs, foundational models, and allowing comparisons across different versions of the same model as they become available.

  • Intelligent Recommendation Engine: Developing an AI-powered system that suggests optimal LLMs based on user-defined parameters such as budget, required latency, domain specificity, and desired output characteristics.

Built With

Share this project:

Updates