Inspiration

In enterprise banking environments, a single transaction failure often becomes a multi-day investigation. When a Financial Analyst (FA) logs an incident, the Production Support team must manually "hop" through silos—contacting developers, checking API gateways, and querying database logs—just to correlate a Client ID to a specific technical error. This manual "ping-pong" and log-sifting process typically takes 3 to 5 days to reach a conclusion, causing significant delays in customer resolution - PO's, FA's and Production support team knows the pain very well and incredibly frustrating for the customer waiting for an answer.

What it does

TraceMind Search is an autonomous observability agent that bridges the gap between business identifiers and technical root causes. The moment an incident is logged in Jira, TraceMind:

  1. Discovers: Searches Elasticsearch to map the business-level Client ID to a technical Correlation ID.
  2. Retrieves: Reconstructs the complete multi-service journey (UI -> Auth -> API -> DB) across Elastic Data Streams.
  3. Reasons: Uses a local LLM to translate raw JSON traces into a plain-English "Incident Analysis."
  4. Resolves: Posts the summary and the original technical evidence directly back to Jira within seconds. ## How I built it
  5. Elasticsearch & Kibana: Used as the central observability brain to harvest logs from Dockerized microservices via Elastic Agent.
  6. FastAPI: Serves as the "Controller" for the agent, handling Jira webhooks and orchestrating the discovery flow.
  7. Local AI (Ollama + Llama 3.2): Provides a privacy-first reasoning layer to summarize technical logs without sensitive data leaving the local environment.
  8. Jira REST API: Used to close the loop by providing automated feedback to the support team.

Challenges I ran into

The biggest technical hurdle was Data Correlation in Unstructured Logs. In some cases, Docker log drivers ship JSON payloads as raw strings within a message field. I solved this by implementing a Regex-based Fallback Parser in our Elastic client to "reach inside" raw strings and extract IDs. I also faced "Cold Start" timeouts with our local LLM, which I mitigated by optimizing the model warm-up sequence and increasing middleware request timeouts.

Accomplishments that I am proud of

I am incredibly proud of achieving Zero-Manual-Touch correlation. Moving from a manual 5-day investigation to a 10-second automated summary is a massive efficiency gain. Additionally, I successfully integrated a local LLM into an observability pipeline, proving that advanced AI reasoning can be both fast and private.

What I learned

I learned that the most valuable data in a system is often the "linkage" between services. Building TraceMind taught us how to leverage Elastic's Query String capabilities to find needles in haystacks and how to structure AI prompts so that technical logs become intelligible to non-technical users.

What's next for TraceMind Search

Elastic Agentic Search: Enterprise logs are deeply fragmented across microservices. We will deploy autonomous AI agents to dynamically traverse hybrid-cloud environments, hunting and uniting scattered log breadcrumbs into a single trace narrative.

Proactive Anomaly Detection: We will integrate Elastic Machine Learning to monitor data streams in real-time. By detecting error spikes autonomously, TraceMind will investigate and flag root causes before a Jira ticket is even created, shifting from reactive to predictive support.

Built With

Share this project:

Updates