Inspiration

The idea for this project started casually at a bar table. Our group, patos—all computer science students—was gathered when a friend showed us the new Linux scheduling technology, sched_ext. We were fascinated, but it felt unrealistic for a group of students to build a scheduler from scratch.

Months later, during a discussion about Kubernetes CPU management—specifically why cpu.limits is often discouraged and why cpu.shares is generally preferred—we began exploring how computational resources are allocated to containers. This led to a question: Is it possible to dynamically give more compute resources to a container, beyond the usual horizontal scaling or static allocation in manifests?

The idea was born: to attempt giving a container higher priority via a custom scheduler.

What it does

SCX_MUS: Mostly Unfair Scheduler is a custom Linux scheduler designed to prioritize containers within Kubernetes, by dynamically adjusting the CPU resources allocated to containers based on their priority. It is implemented using the sched ext framework, to give priority to certain containers over others, improving the performance of specific workloads, even in resource-heavy or "noisy neighbor" environments.

The userspace component, written in Go, provides a simple CLI that allows users to select which container to prioritize. The tool lists all currently running containers and prompts the user to choose one for prioritization. Once a container is selected, the scheduler retrieves the associated cgroup ID and saves it to the BPF map.

How we built it

We spent the hackathon diving deep into the internals of the Linux scheduler ecosystem. We studied the kernel source code, from discovering that Linux actually runs multiple schedulers (not just CFS. FIFO and RR are also present) to examining functions like __schedule() inside kernel/sched/core.c.

We combined this understanding with Kubernetes API communication and, most importantly, hands-on eBPF development. Using sched_ext and struct_ops, we implemented a custom scheduling policy in eBPF that applies dynamic, cgroup-based priority to containers.

Challenges we ran into

One of the hardest parts was figuring out how to implement and run a scheduler using sched_ext_ops. There is very little documentation or guides online on how to load and execute your own custom scheduler, only examples of people running pre-made schedulers included with SCX.

Another challenge was our initial workflow: we wrote most of the code without compiling or testing it (a terrible practice, we know). Only after reaching a reasonable implementation did we start the work of compiling the kernel with sched_ext support, setting up Kubernetes clusters and designing the benchmark.

The debugging phase involved days of solving compilation mysteries, and unexpected behaviors before everything finally worked.

Accomplishments that we're proud of

The goal of our custom scheduler, wasn't to critique the Completely Fair Scheduler (CFS), our scheduler is, in fact, a highly simplified version of CFS. Instead, it was a two-fold endeavor:

  1. To see if students could successfully implement a scheduler from scratch.

  2. To explore different approaches to container scalability by dynamically adjusting a container's priority/resource share via a custom scheduler.

We consider the project a success in demonstrating both of these concepts.

What we learned

We learned about:

  • The Linux scheduler architecture and its multiple scheduling classes
  • Kernel internals and low-level scheduling paths
  • Kubernetes resource management and API communication
  • eBPF development and the sched_ext subsystem
  • Performance evaluation, benchmarking, and debugging complex systems

What's next for SCX_MUS: Mostly Unfair Scheduler

  • Building a more sophisticated control mechanism that uses the Kubernetes API to gather metrics and automatically adjust the container's priority share based on workload (e.g. implementing a hook that prioritize a container when its netns be with a X quantity of packets)

  • Refining SCX_MUS by:

    • Adding multi DSQs for multi-cores enviroments
    • Developing more sophisticated migration heuristics and improving L2/L3 cache locality

Built With

Share this project:

Updates