Inspiration

As Large Language Models become more powerful, one major challenge still exists: different hardware environments cannot efficiently run every model. A lightweight laptop may struggle with larger models, while high-performance systems may underutilize available GPU and memory resources when running smaller models. We wanted to build an adaptive agentic intelligence platform capable of: understanding the hardware environment dynamically selecting the most optimal LLM automatically optimizing prompts based on system capabilities orchestrating AI workflows intelligently collecting telemetry to continuously improve routing decisions

The inspiration came from combining: autonomous AI agents observability platforms real-time telemetry streaming hardware-aware AI orchestration

using modern cloud-native and agentic AI technologies.

What it does

Gemini_Elastic_Kafka is an adaptive multi-model AI orchestration platform powered by: Gemini reasoning agents MCP (Model Context Protocol) tool servers Ollama local LLM runtime Kafka telemetry streaming Elastic observability hardware-aware routing intelligence

The platform dynamically:

  1. Detects hardware configuration CPU RAM GPU availability available VRAM
  2. Selects the most appropriate LLM Gemma Phi3 Mistral Llama
  3. Optimizes prompts for the selected model
  4. Executes inference through Ollama
  5. Publishes telemetry events latency success/failure tokens/sec memory usage
  6. Streams telemetry into Kafka
  7. Stores analytics in Elastic for observability and future optimization

Gemini acts as the reasoning and orchestration layer, helping the system decide: which model to use whether fallback is needed how prompts should be optimized how the workflow should execute

The result is a self-optimizing adaptive AI system capable of intelligent multi-step orchestration.

How we built it

We built the system using a modular microservice-style architecture.

Backend & Agent Layer Java 21 Spring Boot WebFlux

REST APIs

AI Orchestration Gemini API Ollama local inference runtime

Adaptive routing engine

MCP Tool Server A Python-based MCP server was created using: FastAPI psutil pynvml

The MCP server exposes tools for: hardware inspection model discovery inference execution

telemetry collection

Streaming & Observability Apache Kafka for telemetry streaming

Elastic for indexing and observability

Cloud & AI Integration Gemini for planning and orchestration BigQuery-compatible telemetry architecture

Cloud-ready deployment design

Workflow User Request ↓ Spring Boot Agent ↓ Gemini Planner ↓ MCP Tool Server ↓ Hardware Detection ↓ Adaptive Model Selection ↓ Ollama Inference ↓ Kafka Telemetry ↓

Elastic Analytics

Challenges we ran into

One major challenge was orchestrating multiple AI and infrastructure components together reliably.

Some of the key challenges included: Dynamic Model Routing Choosing the correct model based on: RAM GPU availability prompt complexity inference latency

required careful orchestration logic.

Integrating MCP Architecture

Building a functional MCP server that could expose tools dynamically while integrating with Spring Boot required custom interoperability design.

Telemetry Streaming

Capturing meaningful inference telemetry and streaming it reliably into Kafka while maintaining low latency was challenging.

Resource Constraints

Running local models efficiently on limited hardware while maintaining acceptable response times required prompt optimization and intelligent routing.

Multi-Service Coordination Coordinating: Gemini Ollama Kafka Elastic MCP server Spring Boot services

required careful service orchestration and configuration management.

Accomplishments that we're proud of We are especially proud of building: Adaptive Hardware-Aware AI Routing

The platform can intelligently select different LLMs based on live hardware telemetry.

Agentic AI Workflow Instead of simple chat completion, the system performs: reasoning planning tool usage orchestration

telemetry-driven execution

Full Observability Pipeline We successfully integrated: telemetry streaming Kafka pipelines Elastic analytics

to monitor model performance in real time.

MCP-Based Tool Architecture We created a reusable MCP tool framework capable of extending the agent with: hardware tools GitLab integrations MongoDB integrations

future cloud tools

Production-Style Architecture The system follows scalable cloud-native architecture patterns and can be deployed to: Cloud Run Kubernetes

Vertex AI Agent environments

What we learned

This project taught us several important lessons about building agentic AI systems at scale.

AI Agents Need Observability

Telemetry and analytics are critical for improving AI routing decisions over time.

Hardware Awareness Matters

Model selection cannot be static. Adaptive routing dramatically improves efficiency and reliability.

MCP Creates Powerful Extensibility

MCP-style tool servers make it much easier to extend agent capabilities without tightly coupling services.

Gemini Works Best as a Planner

Using Gemini as the reasoning and orchestration layer while using local models for execution creates a powerful hybrid architecture.

Distributed AI Systems Are Complex Managing: streaming systems model runtimes telemetry pipelines orchestration services

requires strong distributed systems thinking.

What's next for Gemini_Elastic_Kafka

We plan to evolve the platform into a fully autonomous adaptive AI infrastructure layer.

Planned Enhancements Autonomous Self-Optimization

Use telemetry history to continuously retrain routing strategies.

Semantic Memory

Integrate vector embeddings and RAG pipelines for long-term contextual memory.

Multi-Agent Collaboration Allow multiple specialized agents to collaborate: planner agents coding agents observability agents

deployment agents

Cloud-Native Deployment Deploy fully on: Google Cloud Run Vertex AI Agent Builder Pub/Sub

BigQuery

Real-Time Dashboards Build advanced observability dashboards showing: model efficiency latency heatmaps GPU utilization routing accuracy

prompt effectiveness

Enterprise Integrations Expand MCP integrations with: GitLab MongoDB Elastic Confluent Kafka BigQuery

CI/CD systems

Autonomous AI Infrastructure

The long-term vision is to build a system where AI agents can: monitor infrastructure optimize inference strategies select models autonomously self-heal failures continuously improve performance

creating a truly adaptive and intelligent AI orchestration platform.

Built With

  • elastic
  • elastic-cloud
  • fastapi
  • gemini-api
  • google-cloud-agent-builder
  • java
  • ollama
  • python
  • springboot
Share this project:

Updates