SoftOrchestrator

Inspiration

We were inspired to move beyond "one-size-fits-all" AI models. Our system dynamically selects the best AI models for user tasks by analyzing benchmark performance, task decomposition, and capability mapping. We also prioritized a safety-first architecture, ensuring a "Preview & Confirm" workflow where no changes are applied immediately without user oversight.

What it does

SoftOrchestrator is a professional VS Code extension that orchestrates specialized AI agents to safely modify your codebase.

Intelligent Task Decomposition: It breaks complex user queries (e.g., "Build a coding assistant") into smaller, executable subtasks.
Dynamic Model Selection: It maps subtasks to required capabilities (such as Engineering or Reasoning) and assigns the optimal model based on stored benchmark performance.
Multi-Model Orchestration: The system can select from top-tier models including Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet.
Safety-First Architecture: It features a "Preview Mode" where users review proposed changes. It uses atomic edits via VS Code's WorkspaceEdit API, meaning one Cmd+Z undoes everything.

How we built it

We constructed the system using a dual-layer architecture consisting of a VS Code Extension and a Backend Intelligence Pipeline.

1. The Intelligence Backend

We built a data-driven decision engine comprising three core JSON components:

Benchmark Knowledge Base : Defines the relationship between evaluation benchmarks and AI capability domains.

Model Performance Database : Stores benchmark scores generated to optimizer agent selection.

Task Decomposition Engine : Converts large instructions into manageable tasks with specific capability requirements.

2. The VS Code Extension

Tech Stack: Built with TypeScript (Node.js extension host) and Vanilla HTML/CSS/JS for the Webview.

Smart Scanning: We implemented recursive scanning that intelligently filters out node_modules and hidden files while reading source code context.

Communication: We established a strict postMessage contract to handle data flow between the Frontend and Backend.

The logical data flow follows this path:

$$ \text{User Query} \rightarrow \text{Subtask Generator} \rightarrow \text{Capability Mapper} \rightarrow \text{Performance Lookup} \rightarrow \text{Model Selection} $$

Challenges we ran into

Reliability in Decomposition: LLMs occasionally fail to break down tasks correctly. To solve this, we implemented a Fallback Subtask System that provides predefined subtasks when the LLM generation fails.

Safe File Handling: Preventing data loss was critical. We had to ensure the extension never wrote to disk directly using fs. Instead, we utilized VS Code's WorkspaceEdit API to bundle changes into a single transaction.

Context Management: Managing the context window was difficult. We implemented safety limits, such as restricting recursion depth to 3 levels and capping file sizes at 10KB, to ensure relevant context was passed without overflowing the model.

Accomplishments that we're proud of

Atomic Edits Architecture: We successfully built a system where file creation and modification are bundled, ensuring that partial failures don't leave the codebase in a broken state.

Data-Driven Orchestration: We moved beyond simple prompting to a system that uses real-world data to make decisions. For example, the system knows that GPT-4o has a high proficiency in Engineering benchmarks:

$$ \text{GPT-4o score on HumanEval} = 86\% $$

This allows the system to intelligently assign it to coding tasks.

Mock Mode: We designed a robust "Mock Mode" that allows the system to be demonstrated and tested without requiring active API keys, facilitating easier development and review.

What we learned

The Power of Taxonomy: We learned that defining clear capability categories (Linguistic Core, Knowledge, Reasoning, Engineering, etc.) is essential for accurately mapping benchmarks to real-world tasks.

Separation of Concerns: Separating the "Static Knowledge Layer" (benchmarks.json, performance.json) from the "Dynamic Query Layer" (subtasks_<query_id>.json) allowed for a much more scalable storage architecture.

User Trust: We discovered that a "Preview & Confirm" workflow is vital for user trust. Users need to see the proposed updatedContent alongside the filePath before committing changes.

What's next for SoftOrchestrator

Moving beyond Experimental: The current version is 0.0.1 (Experimental). We plan to stabilize the prototype into a production-ready release.

Enhanced RAG Pipeline: We aim to further automate the performance.json generation process, improving how we retrieve and validate benchmark results from new research papers.

Expanded Caching: We plan to refine the query caching mechanisms to store previously generated subtasks, significantly improving performance for recurrent queries.

Line-by-Line Diff Comparison & Intelligent Integration:
We plan to introduce granular diff-based previews where users can view and selectively approve code modifications at the line level. Formally, instead of replacing full file content ( F ), we aim to generate a minimal edit set:

$$ \Delta F = { \delta_1, \delta_2, \dots, \delta_n } $$

where each ( \delta_i ) represents an insertion, deletion, or modification operation. This will allow safer integrations, clearer transparency, and more precise developer control over AI-generated changes.