Inspiration
Single AI models make mistakes confidently - they hallucinate non-existent APIs, create bugs, and miss edge cases. We realized that multiple specialized agents validating each other's work could eliminate these errors completely. Just like in software teams, having an architect, coder, reviewer, and documenter each focused on their specialty produces better results than one person doing everything.
W&B Weave made this vision possible by providing the observability layer we needed to track each agent's contribution and learn from mistakes.
What it does
Facilitair orchestrates specialized AI agents through a deterministic 5-stage sequential workflow that achieves perfect task completion by validating each stage:
- Architect - Designs system architecture and approach
- Coder - Implements code based on architecture
- Reviewer - Analyzes implementation for issues
- Refiner - Fixes problems (iterates up to 3x)
- Documenter - Creates comprehensive documentation
Unlike single-model approaches that attempt everything in one pass, this multi-stage verification eliminates hallucinations and catches errors before they propagate.
Proven Results:
- 100% success rate vs 80% GPT-4 baseline
- 0% hallucinations vs 10% baseline
- +25% quality score improvement
- Complete W&B Weave observability
When to use: Multi-category tasks (needs architecture + code + review), high complexity, production-critical code, zero hallucination tolerance.
When NOT to use: Single-category focused tasks, low-medium complexity, speed/cost priority.
How we built it
Languages & Frameworks:
- Python 3.9+ (async/await for concurrent workflows)
- FastAPI (REST API with auto-generated OpenAPI docs)
- Click (professional CLI framework)
- pytest (comprehensive testing)
APIs & Services:
- W&B Weave - Complete experiment tracking and lineage
- Tavily API - Web search for research tasks (integrated, ready to activate)
- ** OpenRouter**
Architecture:
- Sequential orchestrator manages 5-stage pipeline with timeout budgets
- Async Python for parallel agent task execution
- Hallucination detector with pattern matching (fake APIs, impossible claims)
- Thompson Sampling for adaptive multi-model selection
- W&B Weave tracking at every stage for complete observability
- Refinement loop with 3-iteration cap to prevent infinite loops
- Mixed model tiers: premium (GPT-4o) for architecture, budget (GPT-3.5) for docs
No backend server needed - runs as CLI or lightweight FastAPI service locally.
Challenges we ran into
1. Latency Reality Check
- Sequential makes 5-11 API calls (vs 1 baseline)
- Initially claimed "700x faster" - completely false
- Pivot: Honest positioning as "quality over speed"
- Solution: Clear guidance on multi-category vs single-category tasks
2. Hallucination Epidemic
- LLMs confidently generate non-existent APIs, impossible O(0) complexity
- Single-pass approaches have no validation mechanism
- Solution: Pattern-matching detector + multi-stage review validation
- Result: 0% hallucination rate on all evaluation tasks
3. Cost Explosion
- 5-11x more API calls = 5-11x higher cost per task
- Needed intelligent model selection to stay affordable
- Solution: Mixed model tiers (GPT-4o for critical stages, GPT-3.5 for simple stages)
- Result: Thompson Sampling adapts model selection based on task complexity
4. Refinement Loop Instability
- Initial implementation had infinite loops (review always finds "issues")
- Solution: 3-iteration cap + timeout budgets + quality score thresholds
- Learning: Refinement has diminishing returns after 2 iterations
Accomplishments that we're proud of
🎯 Perfect Benchmark Score:
- 100% success (10/10 tasks) vs 80% baseline (8/10)
- Zero hallucinations across all task types
- +25% quality score improvement
📊 Production-Grade Observability:
- Every stage tracked in W&B Weave with full lineage
- Per-stage metrics: quality, latency, token usage, cost
- Complete audit trail from request to final output
🚀 Developer-Friendly Interfaces:
- CLI: 6 commands (health, collaborate, evaluate, serve, init, config)
- REST API: 8 endpoints with auto-generated OpenAPI docs
- Comprehensive logging to facilitair_cli.log and facilitair_api.log
🔬 Honest Technical Communication:
- Corrected false "700x faster" marketing claim
- Documented real trade-offs (5-11x slower, 5-11x more expensive)
- Evidence-based guidance on when sequential beats single-model
âš¡ Multi-Model Intelligence:
- Adaptive model selection per agent role
- 200+ models available via OpenAI, Anthropic, Google APIs
- Cost-optimized with mixed premium/budget tiers
What we learned
1. Multi-Stage Validation Eliminates Hallucinations
- Single LLM passes hallucinate ~10% of the time with high confidence
- Each subsequent stage acts as validator for previous stages
- Architect→Coder→Reviewer→Refiner creates 4 layers of verification
2. Specialization Beats Generalization
- Dedicated Architect agent outperforms "do everything" prompt
- Model matching matters: GPT-4o (architecture) > Qwen (coding) > GPT-3.5 (docs)
- Right agent for right task > best agent for all tasks
3. Observability is Non-Negotiable
- W&B Weave tracking made debugging and optimization possible
- Without stage-level metrics, can't identify bottlenecks
- Discovered refinement loop had diminishing returns after iteration 2
4. Honesty > Marketing Hype
- Initially oversold speed ("700x faster") - had to correct to truth
- Sequential is slower and more expensive, but MORE RELIABLE
- Users appreciate honest trade-off documentation
5. Async Python + FastAPI = Production AI
- Async/await handles concurrent agent execution elegantly
- FastAPI's automatic OpenAPI generation saved days of documentation
- Click framework made professional CLI development rapid
What's next for Facilitair
Immediate (1 week):
- Activate Tavily web search integration for research tasks
- Add streaming responses via WebSockets for real-time progress
- Cost dashboard showing per-task breakdown
Short-term (1 month):
- Daytona workspace integration for isolated execution environments
- Human-in-the-loop approval gates between stages
- Agent marketplace for custom agent definitions
Medium-term (3 months):
- Parallel execution where safe (Coder + Tester simultaneously)
- Automatic model routing based on learned task complexity patterns
- Multi-language support beyond Python (JavaScript, Rust, Go)
Long-term (6+ months):
- Self-improving: learn optimal stage sequences from past tasks
- Enterprise: team collaboration, audit logs, compliance
- Hosted SaaS with usage-based pricing
Ultimately a lot of the methods for routing and task execution originated prior to the hackathon, however were rebuilt by Claude for this competition with instructions to focus on specific features and areas of the platform. Happy to clarify if curious, but given I am 90 minutes past the deadline to submit I will wait for that request on that.
Latency can be reduced by migrating some parts of the system over to rust (as I have proven with my work with Facilitair). Looking forward to rolling out the beta platform for Facilitair soon, and please add me on LinkedIn if you'd like to learn more!
Built With
- click
- fastapi
- pytest
- python
Log in or sign up for Devpost to join the conversation.