Orch | Devpost

Inspiration

Single AI models make mistakes confidently - they hallucinate non-existent APIs, create bugs, and miss edge cases. We realized that multiple specialized agents validating each other's work could eliminate these errors completely. Just like in software teams, having an architect, coder, reviewer, and documenter each focused on their specialty produces better results than one person doing everything.

W&B Weave made this vision possible by providing the observability layer we needed to track each agent's contribution and learn from mistakes.

What it does

Facilitair orchestrates specialized AI agents through a deterministic 5-stage sequential workflow that achieves perfect task completion by validating each stage:

Architect - Designs system architecture and approach
Coder - Implements code based on architecture
Reviewer - Analyzes implementation for issues
Refiner - Fixes problems (iterates up to 3x)
Documenter - Creates comprehensive documentation

Unlike single-model approaches that attempt everything in one pass, this multi-stage verification eliminates hallucinations and catches errors before they propagate.

Proven Results:

100% success rate vs 80% GPT-4 baseline
0% hallucinations vs 10% baseline
+25% quality score improvement
Complete W&B Weave observability

When to use: Multi-category tasks (needs architecture + code + review), high complexity, production-critical code, zero hallucination tolerance.

When NOT to use: Single-category focused tasks, low-medium complexity, speed/cost priority.

How we built it

Languages & Frameworks:

Python 3.9+ (async/await for concurrent workflows)
FastAPI (REST API with auto-generated OpenAPI docs)
Click (professional CLI framework)
pytest (comprehensive testing)

APIs & Services:

W&B Weave - Complete experiment tracking and lineage
Tavily API - Web search for research tasks (integrated, ready to activate)
** OpenRouter**

Architecture:

Sequential orchestrator manages 5-stage pipeline with timeout budgets
Async Python for parallel agent task execution
Hallucination detector with pattern matching (fake APIs, impossible claims)
Thompson Sampling for adaptive multi-model selection
W&B Weave tracking at every stage for complete observability
Refinement loop with 3-iteration cap to prevent infinite loops
Mixed model tiers: premium (GPT-4o) for architecture, budget (GPT-3.5) for docs

No backend server needed - runs as CLI or lightweight FastAPI service locally.

Challenges we ran into

1. Latency Reality Check

Sequential makes 5-11 API calls (vs 1 baseline)
Initially claimed "700x faster" - completely false
Pivot: Honest positioning as "quality over speed"
Solution: Clear guidance on multi-category vs single-category tasks

2. Hallucination Epidemic

LLMs confidently generate non-existent APIs, impossible O(0) complexity
Single-pass approaches have no validation mechanism
Solution: Pattern-matching detector + multi-stage review validation
Result: 0% hallucination rate on all evaluation tasks

3. Cost Explosion

5-11x more API calls = 5-11x higher cost per task
Needed intelligent model selection to stay affordable
Solution: Mixed model tiers (GPT-4o for critical stages, GPT-3.5 for simple stages)
Result: Thompson Sampling adapts model selection based on task complexity

4. Refinement Loop Instability

Initial implementation had infinite loops (review always finds "issues")
Solution: 3-iteration cap + timeout budgets + quality score thresholds
Learning: Refinement has diminishing returns after 2 iterations

Accomplishments that we're proud of

🎯 Perfect Benchmark Score:

100% success (10/10 tasks) vs 80% baseline (8/10)
Zero hallucinations across all task types
+25% quality score improvement

📊 Production-Grade Observability:

Every stage tracked in W&B Weave with full lineage
Per-stage metrics: quality, latency, token usage, cost
Complete audit trail from request to final output

🚀 Developer-Friendly Interfaces:

CLI: 6 commands (health, collaborate, evaluate, serve, init, config)
REST API: 8 endpoints with auto-generated OpenAPI docs
Comprehensive logging to facilitair_cli.log and facilitair_api.log

🔬 Honest Technical Communication:

Corrected false "700x faster" marketing claim
Documented real trade-offs (5-11x slower, 5-11x more expensive)
Evidence-based guidance on when sequential beats single-model

⚡ Multi-Model Intelligence:

Adaptive model selection per agent role
200+ models available via OpenAI, Anthropic, Google APIs
Cost-optimized with mixed premium/budget tiers

What we learned

1. Multi-Stage Validation Eliminates Hallucinations

Single LLM passes hallucinate ~10% of the time with high confidence
Each subsequent stage acts as validator for previous stages
Architect→Coder→Reviewer→Refiner creates 4 layers of verification

2. Specialization Beats Generalization

Dedicated Architect agent outperforms "do everything" prompt
Model matching matters: GPT-4o (architecture) > Qwen (coding) > GPT-3.5 (docs)
Right agent for right task > best agent for all tasks

3. Observability is Non-Negotiable

W&B Weave tracking made debugging and optimization possible
Without stage-level metrics, can't identify bottlenecks
Discovered refinement loop had diminishing returns after iteration 2

4. Honesty > Marketing Hype

Initially oversold speed ("700x faster") - had to correct to truth
Sequential is slower and more expensive, but MORE RELIABLE
Users appreciate honest trade-off documentation

5. Async Python + FastAPI = Production AI

Async/await handles concurrent agent execution elegantly
FastAPI's automatic OpenAPI generation saved days of documentation
Click framework made professional CLI development rapid

What's next for Facilitair

Immediate (1 week):

Activate Tavily web search integration for research tasks
Add streaming responses via WebSockets for real-time progress
Cost dashboard showing per-task breakdown

Short-term (1 month):

Daytona workspace integration for isolated execution environments
Human-in-the-loop approval gates between stages
Agent marketplace for custom agent definitions

Medium-term (3 months):

Parallel execution where safe (Coder + Tester simultaneously)
Automatic model routing based on learned task complexity patterns
Multi-language support beyond Python (JavaScript, Rust, Go)

Long-term (6+ months):

Self-improving: learn optimal stage sequences from past tasks
Enterprise: team collaboration, audit logs, compliance
Hosted SaaS with usage-based pricing

Ultimately a lot of the methods for routing and task execution originated prior to the hackathon, however were rebuilt by Claude for this competition with instructions to focus on specific features and areas of the platform. Happy to clarify if curious, but given I am 90 minutes past the deadline to submit I will wait for that request on that.

Latency can be reduced by migrating some parts of the system over to rust (as I have proven with my work with Facilitair). Looking forward to rolling out the beta platform for Facilitair soon, and please add me on LinkedIn if you'd like to learn more!

Built With

click
fastapi
pytest
python

Updates

Blake Ledden started this project — Oct 12, 2025 06:22 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.