Architecture Diagram
Landing Page
Product Features
Pro Plan
Welcome Screen
Create Voice Profile
Synthesize text and clone
Play Cloned Voice Recording

Voice Cloning Agent - Project Story

Inspiration

The inspiration for this project came from witnessing the gap between impressive AI voice cloning demos and production-ready systems that can actually scale. While open-source voice cloning models have matured significantly, deploying them at enterprise scale remains a complex challenge involving infrastructure management, security, session handling, and observability.

When Amazon launched Bedrock AgentCore, I saw an opportunity to bridge this gap. AgentCore's core primitives (Runtime, Memory, Gateway, Observability) provide exactly what's needed to transform a local voice cloning prototype into a production-grade service.

The addition of autonomous LLM reasoning with Amazon Nova Premier and Strands Agents framework takes this further - transforming the agent from a rule-based operation router into an intelligent system that interprets structured requests and makes autonomous decisions about tool execution.

What it does

Voice Cloning Agent is a production-ready, autonomous voice synthesis platform that lets users:

Interact through GraphQL API with structured mutations and queries
Create voice profiles through audio upload with automatic format conversion
Clone voices with profile ID and text input for synthesis
Manage profiles with secure S3 storage and presigned URLs
Autonomous backend processing where Amazon Nova Premier LLM interprets requests and selects appropriate tools
Secure storage with S3 encryption and presigned URLs
GraphQL API through AWS AppSync with Cognito authentication
Deploy globally with AWS Amplify hosting

The system features a Svelte UI with audio recording capabilities, OAuth authentication via Cognito, and integrates seamlessly with AWS services through AgentCore Runtime powered by Strands Agents framework and Amazon Bedrock Nova Premier.

How we built it

AI-Accelerated Development Workflow:

I leveraged cutting-edge AI development tools to dramatically accelerate the build process:

Research Phase (Amazon Quick Suite): Used Amazon Quick Suite's comprehensive research capabilities to generate a 500+ page technical analysis on voice cloning agents, AgentCore primitives, open-source model benchmarks, LLM reasoning frameworks, and deployment strategies. This deep research covered everything from model performance comparisons to autonomous agent architectures - completed in under an hour.
Specification Phase (Kiro): Fed the research document into Kiro to automatically generate detailed technical specifications, breaking down the implementation into structured tasks with clear requirements. Kiro helped architect the autonomous agent system, define LLM integration patterns, and plan the Strands framework integration.
Implementation Phase (Kiro 50% and Q CLI 50%): Used Kiro's autonomous coding capabilities to generate production-ready code across the entire stack - from Python backend with Strands integration to Svelte frontend with GraphQL API integration. Kiro handled file creation, code generation, and iterative refinements. Fixes, enhancements, LLM integration, and production hardening were further handled by Q Developer Pro CLI, which provided expert-level code reviews, caught critical GraphQL schema mismatches, and optimized the autonomous agent implementation.

This AI-assisted workflow compressed what would typically be weeks of research and development into days, while maintaining high code quality and architectural consistency.

Architecture:

I built the agent using AgentCore Runtime integrated with AWS managed services and autonomous LLM reasoning:

Runtime: Containerized voice models (SpeechT5) deployed as serverless agent with Strands Agent framework
LLM Reasoning: Amazon Bedrock Nova Premier for interpreting GraphQL requests and autonomous tool selection
Tool Execution: Strands @tool decorators wrapping voice cloning operations (clone, create, list)
API Layer: AppSync GraphQL API with simplified executeAgent(prompt) mutation
Lambda Resolver: Forwards structured prompts to AgentCore Runtime
Observability: Built-in AgentCore tracing and CloudWatch monitoring
Frontend: AWS Amplify hosting with Svelte UI
Storage: S3 for voice profiles and audio (AgentCore Memory not used - direct S3 storage instead)

Autonomous Agent Implementation:

Strands Agent: Initialized with BedrockModel (Nova Premier) and three tool functions
Tool Decorators: @tool decorated functions for clone_voice, create_profile, list_profiles
Request Interpretation: Nova Premier interprets structured GraphQL prompts and autonomously selects tools
Structured Returns: Tools return plain dicts (decorator handles wrapping automatically)
Response Extraction: Multiple fallback paths for extracting tool results from LLM responses
GraphQL Integration: AWSJSON type for flexible tool result data

Voice Models:

I integrated Microsoft's SpeechT5 from Hugging Face:

SpeechT5: Text-to-speech synthesis with voice cloning capabilities
SpeechBrain: Speaker encoder for voice embedding extraction
HiFiGAN vocoder: High-quality audio generation
Automatic text chunking (50 words) for long-form synthesis
Audio concatenation with pydub for seamless output

Infrastructure:

TypeScript CDK: Single source of truth for all infrastructure (S3, Cognito, AppSync, Lambda, Amplify)
Docker: Containerization for AgentCore Runtime with Strands framework
S3: SSE-S3 encryption for voice profiles and generated audio with CORS configuration
Lambda: GraphQL resolver with direct prompt forwarding to AgentCore
AppSync: Simplified GraphQL schema with executeAgent(prompt: String!) mutation
Amplify: Production frontend hosting with automatic deployments
Cognito: User authentication with OAuth2 and M2M clients
Bedrock Permissions: IAM policies for Nova Premier model invocation

Frontend:

Svelte 4: Reactive UI components with traditional form-based interface
Vite 5: Lightning-fast development and optimized production builds
Tailwind CSS 3: Responsive design system
AWS Amplify SDK: GraphQL client with autonomous agent mutation
Structured API Calls: UI uses convenience wrappers that internally leverage LLM reasoning
AWSJSON Parsing: Type-safe handling of flexible tool result data

Deployment:

Fully automated two-command deployment:

./scripts/deploy_complete.sh - CDK IaC and AgentCore backend deployment with Strands
./scripts/deploy_amplify.sh - Frontend deployment to Amplify

CDK handles:

All AWS resource creation and configuration
Bedrock Nova Premier IAM permissions
Simplified GraphQL schema deployment
Lambda resolver with environment variables
Amplify app and branch setup
S3 CORS with dynamic Amplify URL
Cognito pool and client configuration

Challenges we ran into

GraphQL Schema Mismatch: Initial implementation used nested input object executeAgent(input: AgentInput!) while UI expected direct parameter executeAgent(prompt: String!). Solution: Simplified schema to direct prompt parameter, updated Lambda to access args['prompt'] directly, and updated UI component - caught through comprehensive end-to-end code review.

Lambda Reserved Environment Variables: Attempted to set AWS_REGION as Lambda environment variable, which is reserved by Lambda runtime. Solution: Removed from CDK environment variables and used Lambda's built-in AWS_REGION variable instead.

Strands Tool Return Format: Initially double-wrapped tool returns (manual wrapping + decorator wrapping) causing nested data structures. Solution: Return plain dicts from tools - @tool decorator handles wrapping automatically per Strands documentation.

LLM Response Extraction: Tool results appeared in varying response structures from Nova Premier. Solution: Implemented multiple fallback paths checking llm_result.content for toolResult structures with different nesting levels.

AWSJSON Type Handling: GraphQL AWSJSON type returned either string or already-parsed object depending on client. Solution: Added type checking in UI - parse if string, use directly if already object.

Lambda Response Size Limits: Initial implementation returned base64-encoded audio in Lambda responses, hitting the 6MB payload limit. Solution: Implemented presigned S3 URLs with 1-hour expiry for efficient audio delivery, reducing response size by 99%.

Audio Format Compatibility: Users uploaded various audio formats (MP4, M4A, MP3) that required conversion. Solution: Implemented automatic MP4 to WAV conversion using pydub's AudioSegment, handling format detection transparently.

Text Length Limitations: Long text inputs caused tensor size errors in voice models. Solution: Implemented automatic text chunking (50 words per chunk) with audio concatenation using pydub for seamless output.

Accomplishments that we're proud of

Autonomous LLM Reasoning: Successfully integrated Amazon Nova Premier with Strands Agents framework for autonomous decision-making - the backend agent interprets structured GraphQL requests and selects appropriate tools without rule-based routing, while maintaining a traditional UI/UX.

Production-Ready AI Agent: Built enterprise-grade autonomous agent meeting AWS AI agent qualifications with intelligent request interpretation, tool execution, structured outputs, and comprehensive error handling.

Simplified API Design: Reduced GraphQL schema complexity from multiple operation-specific mutations to single executeAgent(prompt) mutation - more intuitive and extensible.

Strands Framework Integration: Correctly implemented Strands Agent with proper tool decorators, response extraction, and Nova Premier configuration - verified against official documentation with 95% confidence.

Zero-Manual-Step Deployment: Achieved fully automated deployment from infrastructure to frontend with just two commands. CDK creates Amplify app, configures Bedrock permissions, deployment scripts use CDK outputs - no manual AWS Console steps required.

Comprehensive Code Review: Conducted thorough end-to-end review catching critical issues (GraphQL schema mismatch, Lambda environment variables, tool return formats) before production deployment.

Efficient Audio Delivery: Solved Lambda payload limits with presigned S3 URLs, enabling unlimited audio file sizes while maintaining sub-500ms response times.

Responsive Modern UI: Created Svelte-based interface with compact profile cards, multi-format audio upload, browser recording, and mobile-first responsive design that maintains traditional form-based interaction while leveraging autonomous backend processing.

Clean Codebase: Maintained minimal, focused codebase with autonomous agent implementation, organized scripts folder, and comprehensive documentation.

What we learned

LLM Reasoning Transforms Agents: Adding Nova Premier LLM reasoning transforms rule-based agents into autonomous systems - the backend interprets structured GraphQL requests and decides execution paths intelligently, eliminating hardcoded operation routing while maintaining familiar API patterns.

Strands Framework Best Practices: Tool decorators automatically wrap returns - manual wrapping creates double-nested structures. Response extraction needs multiple fallback paths due to varying LLM response formats.

GraphQL Schema Design: Simpler is better - direct parameters (prompt: String!) more intuitive than nested input objects (input: AgentInput!). Caught through comprehensive code review.

AWSJSON Flexibility: GraphQL AWSJSON type provides flexibility for varying tool results but requires type checking in clients - may be string or already-parsed object.

AgentCore's Power: AgentCore eliminates infrastructure complexity - automatic session isolation, built-in observability, and managed runtime transformed weeks of work into days of integration.

CDK as Single Source of Truth: Using TypeScript CDK for all infrastructure (including Amplify and Bedrock permissions) eliminated configuration drift and made deployments reproducible.

Presigned URLs for Large Files: For large binary data (audio, video), presigned S3 URLs are far superior to base64 encoding in API responses - better performance, no size limits, and efficient browser caching.

AI-Assisted Development: Quick Suite + Kiro + Q CLI workflow compressed weeks of research and development into days while maintaining high code quality through automated reviews and optimizations.

Code Review Importance: Comprehensive end-to-end review caught critical issues (schema mismatches, environment variables, tool formats) that would have caused production failures.

What's next for Voice Cloning Agent

Enhanced LLM Capabilities:

Multi-turn conversations with context retention
Streaming LLM responses for real-time feedback
Advanced prompt engineering for better tool selection
Conversation history and session management
Support for more complex multi-step operations

AgentCore Gateway Integration:

Add AgentCore Gateway as unified API layer fronting AppSync GraphQL endpoint
Use OpenAPI target to transform AppSync into MCP-compatible tools
Leverage AgentCore's built-in OAuth and inbound/outbound authorization
Enable tool discovery and semantic search capabilities
Maintain existing AppSync infrastructure while adding Gateway benefits

Enhanced Model Support:

Integrate CosyVoice2 for ultra-low latency synthesis (150ms)
Add Fish Speech V1.5 for highest quality output
Support XTTS-v2 for multilingual voice cloning (17 languages)
Model quantization for faster inference

Advanced Features:

Real-time streaming synthesis for conversational AI
Voice mixing to blend multiple voice profiles
Emotion and prosody control through structured parameters
Background noise removal and audio enhancement
Voice profile versioning and management

Performance Optimization:

Response caching for repeated requests
GPU optimization for better throughput
Parallel synthesis for batch requests
Model optimization and quantization

Enterprise Capabilities:

Multi-region deployment for global latency optimization
Custom model fine-tuning on user data
Batch processing API for high-volume synthesis
Advanced analytics and usage tracking
Team collaboration features

Integration Ecosystem:

AgentCore Browser integration for web scraping with voice
Agent2Agent (A2A) protocol for multi-agent workflows
MCP server for voice cloning as a tool for other agents
Webhook support for event-driven synthesis
REST API alongside GraphQL for broader compatibility
Multi-agent orchestration with Strands framework

Built With

amazon-bedrock-agentcore
amazon-nova
amplify
appsync
bedrock-model
cognito
docker
generative-ai-observability
graphql
kiro
lambda
q
quicksuite
s3
speechbrain
speecht5
strands-agents
svelte
typescript-cdk

Updates

Vivek V. started this project — Oct 22, 2025 01:53 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.