Voice Cloning Agent - Project Story
Inspiration
The inspiration for this project came from witnessing the gap between impressive AI voice cloning demos and production-ready systems that can actually scale. While open-source voice cloning models have matured significantly, deploying them at enterprise scale remains a complex challenge involving infrastructure management, security, session handling, and observability.
When Amazon launched Bedrock AgentCore, I saw an opportunity to bridge this gap. AgentCore's core primitives (Runtime, Memory, Gateway, Observability) provide exactly what's needed to transform a local voice cloning prototype into a production-grade service.
The addition of autonomous LLM reasoning with Amazon Nova Premier and Strands Agents framework takes this further - transforming the agent from a rule-based operation router into an intelligent system that interprets structured requests and makes autonomous decisions about tool execution.
What it does
Voice Cloning Agent is a production-ready, autonomous voice synthesis platform that lets users:
- Interact through GraphQL API with structured mutations and queries
- Create voice profiles through audio upload with automatic format conversion
- Clone voices with profile ID and text input for synthesis
- Manage profiles with secure S3 storage and presigned URLs
- Autonomous backend processing where Amazon Nova Premier LLM interprets requests and selects appropriate tools
- Secure storage with S3 encryption and presigned URLs
- GraphQL API through AWS AppSync with Cognito authentication
- Deploy globally with AWS Amplify hosting
The system features a Svelte UI with audio recording capabilities, OAuth authentication via Cognito, and integrates seamlessly with AWS services through AgentCore Runtime powered by Strands Agents framework and Amazon Bedrock Nova Premier.
How we built it
AI-Accelerated Development Workflow:
I leveraged cutting-edge AI development tools to dramatically accelerate the build process:
Research Phase (Amazon Quick Suite): Used Amazon Quick Suite's comprehensive research capabilities to generate a 500+ page technical analysis on voice cloning agents, AgentCore primitives, open-source model benchmarks, LLM reasoning frameworks, and deployment strategies. This deep research covered everything from model performance comparisons to autonomous agent architectures - completed in under an hour.
Specification Phase (Kiro): Fed the research document into Kiro to automatically generate detailed technical specifications, breaking down the implementation into structured tasks with clear requirements. Kiro helped architect the autonomous agent system, define LLM integration patterns, and plan the Strands framework integration.
Implementation Phase (Kiro 50% and Q CLI 50%): Used Kiro's autonomous coding capabilities to generate production-ready code across the entire stack - from Python backend with Strands integration to Svelte frontend with GraphQL API integration. Kiro handled file creation, code generation, and iterative refinements. Fixes, enhancements, LLM integration, and production hardening were further handled by Q Developer Pro CLI, which provided expert-level code reviews, caught critical GraphQL schema mismatches, and optimized the autonomous agent implementation.
This AI-assisted workflow compressed what would typically be weeks of research and development into days, while maintaining high code quality and architectural consistency.
Architecture:
I built the agent using AgentCore Runtime integrated with AWS managed services and autonomous LLM reasoning:
- Runtime: Containerized voice models (SpeechT5) deployed as serverless agent with Strands Agent framework
- LLM Reasoning: Amazon Bedrock Nova Premier for interpreting GraphQL requests and autonomous tool selection
- Tool Execution: Strands @tool decorators wrapping voice cloning operations (clone, create, list)
- API Layer: AppSync GraphQL API with simplified
executeAgent(prompt)mutation - Lambda Resolver: Forwards structured prompts to AgentCore Runtime
- Observability: Built-in AgentCore tracing and CloudWatch monitoring
- Frontend: AWS Amplify hosting with Svelte UI
- Storage: S3 for voice profiles and audio (AgentCore Memory not used - direct S3 storage instead)
Autonomous Agent Implementation:
- Strands Agent: Initialized with BedrockModel (Nova Premier) and three tool functions
- Tool Decorators:
@tooldecorated functions for clone_voice, create_profile, list_profiles - Request Interpretation: Nova Premier interprets structured GraphQL prompts and autonomously selects tools
- Structured Returns: Tools return plain dicts (decorator handles wrapping automatically)
- Response Extraction: Multiple fallback paths for extracting tool results from LLM responses
- GraphQL Integration: AWSJSON type for flexible tool result data
Voice Models:
I integrated Microsoft's SpeechT5 from Hugging Face:
- SpeechT5: Text-to-speech synthesis with voice cloning capabilities
- SpeechBrain: Speaker encoder for voice embedding extraction
- HiFiGAN vocoder: High-quality audio generation
- Automatic text chunking (50 words) for long-form synthesis
- Audio concatenation with pydub for seamless output
Infrastructure:
- TypeScript CDK: Single source of truth for all infrastructure (S3, Cognito, AppSync, Lambda, Amplify)
- Docker: Containerization for AgentCore Runtime with Strands framework
- S3: SSE-S3 encryption for voice profiles and generated audio with CORS configuration
- Lambda: GraphQL resolver with direct prompt forwarding to AgentCore
- AppSync: Simplified GraphQL schema with
executeAgent(prompt: String!)mutation - Amplify: Production frontend hosting with automatic deployments
- Cognito: User authentication with OAuth2 and M2M clients
- Bedrock Permissions: IAM policies for Nova Premier model invocation
Frontend:
- Svelte 4: Reactive UI components with traditional form-based interface
- Vite 5: Lightning-fast development and optimized production builds
- Tailwind CSS 3: Responsive design system
- AWS Amplify SDK: GraphQL client with autonomous agent mutation
- Structured API Calls: UI uses convenience wrappers that internally leverage LLM reasoning
- AWSJSON Parsing: Type-safe handling of flexible tool result data
Deployment:
Fully automated two-command deployment:
./scripts/deploy_complete.sh- CDK IaC and AgentCore backend deployment with Strands./scripts/deploy_amplify.sh- Frontend deployment to Amplify
CDK handles:
- All AWS resource creation and configuration
- Bedrock Nova Premier IAM permissions
- Simplified GraphQL schema deployment
- Lambda resolver with environment variables
- Amplify app and branch setup
- S3 CORS with dynamic Amplify URL
- Cognito pool and client configuration
Challenges we ran into
GraphQL Schema Mismatch:
Initial implementation used nested input object executeAgent(input: AgentInput!) while UI expected direct parameter executeAgent(prompt: String!).
Solution: Simplified schema to direct prompt parameter, updated Lambda to access args['prompt'] directly, and updated UI component - caught through comprehensive end-to-end code review.
Lambda Reserved Environment Variables:
Attempted to set AWS_REGION as Lambda environment variable, which is reserved by Lambda runtime.
Solution: Removed from CDK environment variables and used Lambda's built-in AWS_REGION variable instead.
Strands Tool Return Format:
Initially double-wrapped tool returns (manual wrapping + decorator wrapping) causing nested data structures.
Solution: Return plain dicts from tools - @tool decorator handles wrapping automatically per Strands documentation.
LLM Response Extraction:
Tool results appeared in varying response structures from Nova Premier.
Solution: Implemented multiple fallback paths checking llm_result.content for toolResult structures with different nesting levels.
AWSJSON Type Handling: GraphQL AWSJSON type returned either string or already-parsed object depending on client. Solution: Added type checking in UI - parse if string, use directly if already object.
Lambda Response Size Limits: Initial implementation returned base64-encoded audio in Lambda responses, hitting the 6MB payload limit. Solution: Implemented presigned S3 URLs with 1-hour expiry for efficient audio delivery, reducing response size by 99%.
Audio Format Compatibility: Users uploaded various audio formats (MP4, M4A, MP3) that required conversion. Solution: Implemented automatic MP4 to WAV conversion using pydub's AudioSegment, handling format detection transparently.
Text Length Limitations: Long text inputs caused tensor size errors in voice models. Solution: Implemented automatic text chunking (50 words per chunk) with audio concatenation using pydub for seamless output.
Accomplishments that we're proud of
Autonomous LLM Reasoning: Successfully integrated Amazon Nova Premier with Strands Agents framework for autonomous decision-making - the backend agent interprets structured GraphQL requests and selects appropriate tools without rule-based routing, while maintaining a traditional UI/UX.
Production-Ready AI Agent: Built enterprise-grade autonomous agent meeting AWS AI agent qualifications with intelligent request interpretation, tool execution, structured outputs, and comprehensive error handling.
Simplified API Design:
Reduced GraphQL schema complexity from multiple operation-specific mutations to single executeAgent(prompt) mutation - more intuitive and extensible.
Strands Framework Integration: Correctly implemented Strands Agent with proper tool decorators, response extraction, and Nova Premier configuration - verified against official documentation with 95% confidence.
Zero-Manual-Step Deployment: Achieved fully automated deployment from infrastructure to frontend with just two commands. CDK creates Amplify app, configures Bedrock permissions, deployment scripts use CDK outputs - no manual AWS Console steps required.
Comprehensive Code Review: Conducted thorough end-to-end review catching critical issues (GraphQL schema mismatch, Lambda environment variables, tool return formats) before production deployment.
Efficient Audio Delivery: Solved Lambda payload limits with presigned S3 URLs, enabling unlimited audio file sizes while maintaining sub-500ms response times.
Responsive Modern UI: Created Svelte-based interface with compact profile cards, multi-format audio upload, browser recording, and mobile-first responsive design that maintains traditional form-based interaction while leveraging autonomous backend processing.
Clean Codebase: Maintained minimal, focused codebase with autonomous agent implementation, organized scripts folder, and comprehensive documentation.
What we learned
LLM Reasoning Transforms Agents: Adding Nova Premier LLM reasoning transforms rule-based agents into autonomous systems - the backend interprets structured GraphQL requests and decides execution paths intelligently, eliminating hardcoded operation routing while maintaining familiar API patterns.
Strands Framework Best Practices: Tool decorators automatically wrap returns - manual wrapping creates double-nested structures. Response extraction needs multiple fallback paths due to varying LLM response formats.
GraphQL Schema Design:
Simpler is better - direct parameters (prompt: String!) more intuitive than nested input objects (input: AgentInput!). Caught through comprehensive code review.
AWSJSON Flexibility: GraphQL AWSJSON type provides flexibility for varying tool results but requires type checking in clients - may be string or already-parsed object.
AgentCore's Power: AgentCore eliminates infrastructure complexity - automatic session isolation, built-in observability, and managed runtime transformed weeks of work into days of integration.
CDK as Single Source of Truth: Using TypeScript CDK for all infrastructure (including Amplify and Bedrock permissions) eliminated configuration drift and made deployments reproducible.
Presigned URLs for Large Files: For large binary data (audio, video), presigned S3 URLs are far superior to base64 encoding in API responses - better performance, no size limits, and efficient browser caching.
AI-Assisted Development: Quick Suite + Kiro + Q CLI workflow compressed weeks of research and development into days while maintaining high code quality through automated reviews and optimizations.
Code Review Importance: Comprehensive end-to-end review caught critical issues (schema mismatches, environment variables, tool formats) that would have caused production failures.
What's next for Voice Cloning Agent
Enhanced LLM Capabilities:
- Multi-turn conversations with context retention
- Streaming LLM responses for real-time feedback
- Advanced prompt engineering for better tool selection
- Conversation history and session management
- Support for more complex multi-step operations
AgentCore Gateway Integration:
- Add AgentCore Gateway as unified API layer fronting AppSync GraphQL endpoint
- Use OpenAPI target to transform AppSync into MCP-compatible tools
- Leverage AgentCore's built-in OAuth and inbound/outbound authorization
- Enable tool discovery and semantic search capabilities
- Maintain existing AppSync infrastructure while adding Gateway benefits
Enhanced Model Support:
- Integrate CosyVoice2 for ultra-low latency synthesis (150ms)
- Add Fish Speech V1.5 for highest quality output
- Support XTTS-v2 for multilingual voice cloning (17 languages)
- Model quantization for faster inference
Advanced Features:
- Real-time streaming synthesis for conversational AI
- Voice mixing to blend multiple voice profiles
- Emotion and prosody control through structured parameters
- Background noise removal and audio enhancement
- Voice profile versioning and management
Performance Optimization:
- Response caching for repeated requests
- GPU optimization for better throughput
- Parallel synthesis for batch requests
- Model optimization and quantization
Enterprise Capabilities:
- Multi-region deployment for global latency optimization
- Custom model fine-tuning on user data
- Batch processing API for high-volume synthesis
- Advanced analytics and usage tracking
- Team collaboration features
Integration Ecosystem:
- AgentCore Browser integration for web scraping with voice
- Agent2Agent (A2A) protocol for multi-agent workflows
- MCP server for voice cloning as a tool for other agents
- Webhook support for event-driven synthesis
- REST API alongside GraphQL for broader compatibility
- Multi-agent orchestration with Strands framework
Log in or sign up for Devpost to join the conversation.