Vigil AI

Ayan Liger posted an update — Sep 25, 2025 09:43 PM EDT

✅ Logging Storm Resolved – Vigil AI Back to Healthy State

After launch, I noticed the project's GKE cloud bill was skyrocketing — way beyond what a small demo cluster should cost. A quick dive into Cloud Billing, and then Logging, revealed the culprit: transactionhistory was stuck in a noisy fail loop, hammering Google’s IAM API for access tokens, failing every time, and dumping massive Java stack traces by the thousands.

The Root Cause It turned out to be a Workload Identity misconfiguration. The service account annotation was pointing to the wrong Google Service Account, so every token request was rejected. Worse, the service retried almost instantly after each failure, creating a tight loop of expensive, verbose logs.

The Fix I rolled up my sleeves and: -> Fixed the Workload Identity binding and updated the Kubernetes ServiceAccount to point at the correct GSA -> Added the missing Cloud Monitoring permissions so the pod could authenticate cleanly -> Redeployed and watched the logs… silence. Beautiful silence.

The Result --> Zero getAccessToken errors since the fix --> Logging costs slashed (no more endless retries + stack traces) -- I hope :)), will continue to monitor this daily. --> transactionhistory pod is healthy again (READY 1/1) --> Authentication now works properly via Workload Identity

In short: the app is now running clean, only throwing the occasional transient GCP blip — nothing that spams logs or eats budget. This was a good reminder that in Kubernetes, a single miswired service account can quietly drain your wallet until you notice the pattern.

Log in or sign up for Devpost to join the conversation.