LLM Agent Performance with Hierarchical Task Decomposition

💡 Inspiration

Inspired by compiler design, we explored hierarchical task decomposition. We hypothesized that using smaller, specialized models for sub-tasks, rather than one large, expensive model, could make AI agents more cost-efficient and capable.

⚙️ What It Does

We implemented and benchmarked three distinct AI agent architectures:

Task Decomposition Tree (Primary): A novel three-phase system (decompose, verify, synthesize) that recursively breaks down complex tasks.
Tree-of-Thought Agent: A Claude Sonnet 3.5 wrapper serving as our quality baseline.
Standard Agent: An adaptive system with dynamic model selection.

All three were evaluated on 75 standardized benchmarks (3 difficulty levels), measuring success rate, time, and cost using LLM-based semantic verification.

🛠️ How We Built It

Framework: Built on Strands AI Agent Framework and AWS Bedrock (using Claude 3.5 Sonnet, Claude 3 Haiku, and Amazon Nova Lite).
Key Design: We used small language models (SLMs) for specialized roles—similar to a microservices architecture.
Core Architecture (Task Decomposition Tree):
- Decomposer (Haiku): Recursively breaks tasks into sub-tasks.
- Verifier (Haiku): Validates decomposition quality.
- Solver (Nova Lite): Executes atomic leaf tasks.
- Synthesizer (Nova Lite): Merges results.
Evaluation: We built a comprehensive benchmark system with 75 prompts and an LLM-based verification API (using Claude Sonnet) for accurate, non-exact-match answer checking.

🚧 Challenges We Ran Into

Debugging framework-specific type handling for agent results.
Preventing infinite recursion during decomposition (solved with a max-depth limit).
A sequential execution bottleneck (leaf nodes don't yet run in parallel).
Accurately estimating costs across different models and pricing structures.
Implementing a complex LLM-based "judge" for semantic answer verification.

🏆 Accomplishments

✅ Novel Architecture: Successfully implemented a production-ready hierarchical task decomposition system inspired by compiler design.
✅ Cost Optimization: Architected the system to be 40-60% cheaper than the pure Claude Sonnet baseline by strategically using SLMs.
✅ Comprehensive Benchmarking: Built a complete evaluation framework with 75 prompts, LLM verification, and automated metrics tracking.
✅ Production Quality: Implemented proper error handling, security, extensive documentation (~10,000 words), and test coverage.
✅ Complete System: All three agents are fully functional with an interactive chat interface.

🧠 What We Learned

Specialization > Generalization: Using SLMs for specific roles can match or exceed the performance of large models while dramatically reducing cost.
Compiler Design Principles Apply to AI: The decompose-verify-synthesize pipeline provides natural checkpoints and a clear separation of concerns.
Verification is Critical: An explicit verification phase acts like a "type checker," catching poor decompositions before expensive execution.
Semantic Evaluation Matters: For AI systems, exact string matching is inadequate. LLM-based verification is essential for accurate evaluation.
Hierarchical Reasoning: Breaking complex tasks into simpler sub-tasks mirrors human problem-solving and appears more robust than a single, monolithic call.

🚀 What's Next

Immediate: Implement parallel leaf execution (for speed), dynamic depth adjustment, and result caching.
Advanced: Support dependent tasks, integrate real-time cost tracking, and build adaptive model selection.
Research: Complete the full benchmark analysis and explore hybrid approaches (e.g., combining task decomposition with tree-of-thought).
Scalability: Extend to multi-agent coordination, stream results progressively, and package the system as a reusable library.