Every platform architect running high-density Kubernetes clústers has faced the exact same nightmare at 2:00 AM. Your application’s tail latency (P99 or P99.9) suddenly spikes through the roof. Your ultra-low-latency Redis cache or critical payment microservice begins to choke. You dive into your observability dashboards, only to find that some sprawling, unthrottled batch-processing worker on the same node decided to spike its CPU usage. You are suffering from a textbook Noisy Neighbor attack. And by the time an alert triggers, a human opens a ticket, or the Kubernetes control plane attempts to evict and reschedule a pod, your SLA isn’t just broken — it’s completely shattered.
The root cause? The default Linux Completely Fair Scheduler (CFS) is fundamentally built to be, well, fair. It operates on a millisecond scale, trying to distribute CPU cycles equitably among competing processes. But in modern cloud-native architectures, some workloads are simply more equal than others.
Enter QUACK (Quick Unified AI Container Kernel). We decided to completely rewrite the rules of container scheduling. Instead of building another reactive dashboard that merely screams when things break, we built an intelligent, closed-loop system that observes application telemetry, decides priority using Splunk AI, and immediately acts by re-tuning the Linux kernel scheduler on the fly — all in under a second, with absolutely zero pod restarts or configuration changes.## Inspiration
"QUACK solves this with three technologies working together:"
- "Splunk AI Toolkit — a trained RandomForest model that predicts which pods need priority based on 4 signals"
- "eBPF and sched_ext — a custom kernel scheduler that changes CPU allocation in real-time, at the kernel level"
- "No pod restarts, no config changes — the kernel adapts automatically, in microseconds" The flow is: Metrics are collected every 10 seconds from Kubernetes. They're sent to Splunk via HEC. The AI scores each pod on 4 signals. The winning pod's cgroup ID is written to a BPF map. The kernel scheduler reads that map on every scheduling decision — thousands of times per second — and gives that pod 4x more CPU time.
"The key insight is that the AI decision becomes a kernel-level scheduling change instantly. No restarts. No cgroup manipulation. Just a hash map lookup in the kernel." 'AI Scoring Signals' details the four critical signals that feed our Machine Learning model and enable it to make decisions about CPU allocation. These signals fall into two categories:
First, we have Infrastructure Signals. We monitor the CPU Usage Ratio, which compares the container's current saturation to its preset maximum limit. Most importantly, we track Dispatch Latency, which measures the time a process spends waiting in the execution queue before receiving CPU cycles.
Second, we look at Application Signals, which are closer to the user experience. We look for Network Anomalies, specifically out-of-band traffic spikes. Additionally, we measure Request Latency to identify real-time response time degradation trends.1
By combining these four real-time metrics, the QUACK model can instantly determine which workload is suffering the most and needs a priority boost.
What's next for QUACK — Quick Unified AI Container Kernel Scheduler
Multi-pod priority (not just top-1). Right now, QUACK picks a single winner. Real clusters need tiered priorities — e.g., Redis gets 4x, the API gateway gets 2x, and batch jobs get 0.5x. The BPF map already supports multiple entries; the scoring engine just needs to output a ranked list with different weights.
Real BPF stats map: Implement the dispatch latency and queue depth collection from the kernel. This gives you the full observability story — not just what the AI decided, but what the kernel actually did.
More data, more data to train the model. :-) Automated model retraining: Schedule periodic retraining in Splunk (e.g., weekly) as workload patterns change. The model drifts as new services are deployed — auto-retraining keeps it accurate.
Built With
- amazon-web-services
- go
- javascript
- jupyter
- kubernetes
- shell
- splunk
- terraform
- yaml
Log in or sign up for Devpost to join the conversation.