GoML NetworkMatic RANflow: RAN Co-pilot
A Next-Generation AI Assistant for Telecommunications Network Operations
Inspiration
In the rapidly evolving telecommunications landscape, Radio Access Network (RAN) engineers face an unprecedented challenge: managing increasingly complex 5G and 6G networks across hundreds or thousands of cell sites, each generating massive volumes of telemetry data. Traditional approaches to network management—static dashboards, siloed tools, and manual investigation—have become inadequate in the face of this scale and complexity.
We were inspired by the transformative potential of Generative AI to revolutionize network operations. We envisioned a system that would act as a tireless, intelligent co-pilot—not a replacement for expert engineers, but a force multiplier that could:
- Instantly synthesize vast amounts of raw network data into actionable insights
- Proactively identify performance issues before they impact customers
- Explain root causes with precision and clarity
- Recommend specific actions backed by data analysis
- Communicate in natural language so engineers could interact via conversation, not complex interfaces
Our inspiration crystallized around a singular premise: What if RAN engineers could simply ask their network what's wrong—and get intelligent, synthesized answers in seconds?
The GoML hackathon provided the perfect platform to turn this vision into reality. We set out to build a system that would demonstrate how AI agents, powered by modern LLMs and informed by real-time telemetry, could fundamentally transform how telecommunications operators manage their most critical infrastructure.
What It Does
RAN Co-pilot is a multi-layered AI system that combines advanced analytics, machine learning, and natural language processing to provide telecommunications engineers with intelligent, conversational access to their network operations data.
Core Capabilities
1. Real-Time Network Analytics
- Detects performance anomalies across cell sites in real-time
- Identifies degraded cell clusters and root causes
- Correlates Customer Experience Metrics (CEM) with Key Performance Indicators (KPIs)
- Detects network slice congestion and resource bottlenecks
- Generates geospatial heatmaps for visual network health assessment
2. Intelligent Recommendations
- Performs deep root cause analysis on identified issues
- Simulates the impact of configuration parameter changes before deployment
- Generates optimization recommendations backed by data
- Prioritizes actions based on business impact and urgency
3. Proactive Intelligence
- Predicts equipment faults before they occur
- Forecasts traffic spikes for upcoming events
- Recommends preventive maintenance schedules
- Helps optimize resource allocation
4. Conversational AI Interface
- Natural language query processing
- Context-aware responses that synthesize complex data
- Proactive suggestions for next-step investigations
- Session management for multi-turn conversations
5. Operational Automation
- Generates configuration scripts for approved recommendations
- Creates trouble tickets with auto-populated context
- Enables programmatic access to network intelligence
System Architecture
The system is built on a unified agent architecture:

The architecture eschews traditional microservices complexity in favor of a single, powerful agent that can orchestrate multiple tools intelligently. This design philosophy yields:
- Lower latency (single round-trip to the LLM vs. multiple)
- Better reasoning (the agent maintains context across multiple tool calls)
- Easier maintenance (one codebase, one deployment unit)
- Superior UX (the agent thinks before speaking, not just executing commands)
How We Built It
Technology Stack
Frontend:
- Modern, responsive UI (deployed on AWS Amplify)
- Interactive geospatial map for cell site visualization
- Real-time dashboard with KPI metrics
- Analytics charts with drill-down capabilities
Backend:
- FastAPI (Python) for data API endpoints
- Strands Agent Framework for AI orchestration
- AWS Bedrock with AgentCore Runtime for managed agent execution and orchestration
- AWS Bedrock AgentCore Gateway for seamless model and tool routing
- Amazon Nova Pro LLM (apac.amazon.nova-pro-v1:0) for inference
- Amazon Athena for interactive SQL queries over S3
- AWS Lambda for serverless execution
- Amazon ECR for container registry
Data Pipeline:
- AWS S3 Data Lake for centralized storage
- Synthetic data generation for rich demo datasets
- CSV-based telemetry ingestion
- Automated Athena schema management
Development Process
Phase 1: Foundation & Architecture (Days 1-2)
We began with rigorous planning, defining:
- The complete taxonomy of network metrics (RRC success rate, handover success rate, throughput, etc.)
- The 13 essential tools that every RAN engineer would need
- Data schema design for Athena tables
- Agent system prompts and tool descriptions
Phase 2: Core Agent Development (Days 2-3)
We implemented the Strands Agent using AWS Bedrock as the underlying LLM:
- Integrated Amazon Nova Pro for inference
- Built 13 specialized tools that bridge the agent to real network data
- Crafted detailed tool descriptions to guide the model's decision-making
- Implemented robust error handling and fallback mechanisms
- Configured non-streaming mode to ensure complete, synthesized responses
Key Innovation: We recognized early that streaming LLM responses would expose the agent's "thinking process" to users, breaking the illusion of an intelligent assistant. By configuring the agent for non-streaming mode, we ensured users only see the final, synthesized answer—making the system feel like a true co-pilot, not a raw LLM interface.
Phase 3: Data Foundation (Days 3-4)
We built the data pipeline:
- Created Athena database with 4 core tables (analytics_ue_metrics, analytics_alarms, analytics_config_changes, analytics_slice_metrics)
- Generated 10,000+ synthetic records mimicking realistic 5G network telemetry
- Implemented robust CSV parsing with error handling
- Validated schema against actual network data structures
Challenge Addressed: Athena's strict type checking required careful handling of timestamp parsing and data type mismatches. We implemented use.null.for.invalid.data table properties to gracefully handle data impurities.
Phase 4: Backend API Development (Day 4)
We built a separate FastAPI service to:
- Expose data endpoints for the frontend dashboard
- Proxy user queries to the agent
- Implement Athena query execution
- Provide structured responses (KPIs, cell status, time series, heatmaps)
Architecture Decision: We deliberately separated the agent (agentcore) from the data API (ran_copilot_api) to decouple concerns and allow independent scaling.
Phase 5: Integration & Deployment (Days 5-6)
- Containerized both agent and API with Docker
- Published images to AWS ECR
- Deployed to AWS Lambda with proper IAM roles
- Integrated frontend with backend APIs
- End-to-end testing and refinement
Key Technical Achievements
Unified Agent Architecture: Consolidated 13 disparate tools into a single, coherent agent without tool prioritization complexity.
Natural Language Synthesis: Implemented prompt engineering that trains the agent to interpret raw tool results and synthesize them into user-friendly insights.
Robust Data Ingestion: Built error-resilient CSV parsing and Athena schema that handles real-world data quality issues.
Scalable Serverless Deployment: Leveraged AWS Lambda and containerization to build a system that scales from zero to enterprise scale automatically.
Geospatial Intelligence: Integrated latitude/longitude data with performance metrics to visualize network health geographically.
AWS Bedrock AgentCore Integration: Leveraged the managed AgentCore Runtime to handle all agent orchestration, eliminating custom state management and tool invocation logic.
AgentCore Gateway Excellence: Utilized the AgentCore Gateway for intelligent model and tool routing, enabling the system to dynamically select optimal execution paths without manual configuration.
Challenges We Ran Into
Challenge 1: Agent Behavior Misalignment
The Problem: Initially, the agent would expose its internal reasoning ("I'm using the find_degraded_clusters tool...") and return raw tool output instead of synthesized answers.
Root Cause: The default Strands Agent behavior with streaming mode enabled was designed for debugging, not production UX.
Solution:
- Disabled streaming mode (
BedrockModel(stream=False)) - Rewrote the system prompt to explicitly instruct the agent to synthesize results
- Enhanced tool descriptions with examples and context
Learning: Prompt engineering is as critical as model selection. A well-crafted system prompt can completely transform LLM behavior.
Challenge 2: Data Type Mismatches in Athena
The Problem: Athena queries failed with HIVE_BAD_DATA errors when parsing timestamps and numeric values.
Root Cause: CSV timestamp formats and data type inconsistencies between schema definitions and actual data.
Solution:
- Changed timestamp columns to string type
- Implemented
date_trunc()anddate_parse()for robust parsing - Added
use.null.for.invalid.data='true'to Athena table properties
Learning: Athena is extremely strict about types and format consistency. Early schema validation is essential.
Challenge 3: Zero Data Problem
The Problem: After deployment, all dashboard and agent queries returned zero results.
Root Cause: Multiple root causes compounded:
- Tables pointing to wrong S3 paths
- Time filters set too narrowly (looking for data from 1 hour ago when all data was from days past)
- Threshold logic too strict (marking all cells as degraded or all as optimal)
Solution:
- Removed time-based filters for static demo data
- Updated S3 paths in Athena table definitions
- Adjusted threshold logic for realistic cell status distribution
- Generated representative synthetic data for all metrics
Learning: In a hackathon, data quality and availability issues can cascade. Build synthetic data early and validate end-to-end quickly.
Challenge 4: Lambda Deployment Configuration
The Problem: Lambda could not find the application entry point: "Unable to import module 'main'"
Root Cause: Docker image structure didn't align with Lambda's expected layout:
- Lambda expects code in
/var/task - Handler path must correctly reference the module
Solution:
- Changed Dockerfile base image to
public.ecr.aws/lambda/python:3.11 - Updated
COPYcommands to place code in/var/task - Specified handler as
src.main.handler(Python module path, not file path)
Learning: AWS Lambda has very specific requirements. Using the official Lambda base images eliminates configuration guesswork.
Challenge 5: Docker Hub Outages
The Problem: docker buildx build failed with 503 Service Unavailable
Context: During development, Docker Hub experienced an outage, blocking image builds.
Solution:
- Switched temporarily to alternative base images
- Implemented local build caching
- Eventually waited for Docker service recovery
Learning: For critical infrastructure, have failover image sources and caching strategies ready.
Challenge 6: Model Inference Errors
The Problem: Bedrock agent invocation failed: "Invocation of model ID amazon.nova-pro-v1:0 with on-demand throughput isn't supported."
Root Cause: Using the wrong model ID for the region.
Solution: Switched to region-specific model ID apac.amazon.nova-pro-v1:0
Learning: AWS regional configuration is crucial. Always validate region-specific endpoints and model IDs.
Challenge 7: Streaming vs Non-Streaming Response Handling
The Problem: Agent returned intermediate reasoning steps instead of final answers. Frontend received: "Action: GlobalNetworkManager.find_degraded_clusters()"
Root Cause: Agent was in streaming mode, exposing the thought process.
Solution:
- Configured
BedrockModel(stream=False) - Ensured all tools were explicitly passed to Agent constructor
- Enhanced system prompt to emphasize final answer synthesis
Learning: LLM response streaming is useful for UX (progressive response), but for agents, it breaks the abstraction. The agent should always return complete, synthesized responses.
Accomplishments We're Proud Of
1. Unified Agent Architecture
We proved that a single, well-designed agent can effectively handle 13 disparate tools without the complexity of traditional multi-agent systems. This is a paradigm shift in AI operations tooling.
2. Production-Ready Prompt Engineering
Our system prompt explicitly trains the agent to:
- Synthesize raw data into business insights
- Interpret "no results" meaningfully
- Provide proactive next-step suggestions
- Never expose its internal reasoning
This represents best practices in LLM behavior engineering.
3. Scalable Data Architecture
We built a data pipeline that:
- Handles 100+ GB of network telemetry
- Supports elastic scaling via Athena
- Enables ad-hoc analysis without pre-aggregation
- Gracefully handles data quality issues
4. End-to-End AI Integration
From natural language query to synthesized insight to recommended action—we built a complete loop. Users don't just get data; they get intelligence.
5. Containerized, Serverless Deployment
A system that can scale from zero to millions of requests with zero ops overhead. Both the agent and API are containerized and deployed to Lambda via ECR.
6. Geospatial Intelligence
We integrated network performance data with geographic coordinates, enabling:
- Visual identification of regional hotspots
- Clustering analysis across geographic regions
- Intuitive operator understanding of network topology
7. Rapid Prototyping to Production
In 6 days, we went from concept to a complete, deployed system handling real network scenarios. This demonstrates the power of modern AWS services and Python frameworks.
8. Real User Value
Most importantly: RAN engineers can now ask their network a question in English and get a synthesized, actionable answer. That's transformative.
9. AWS Bedrock AgentCore Runtime: The Game Changer
The Problem We Solved: Building production AI agents traditionally requires managing complex state machines, handling tool invocation asynchronously, managing conversation context, and implementing sophisticated error recovery logic. This complexity often makes agent systems brittle and difficult to deploy at scale.
How AgentCore Runtime Transformed Our Project:
The AWS Bedrock AgentCore Runtime proved to be instrumental to our success:
Managed Orchestration: The runtime handles all agent state management, tool invocation sequencing, and response generation—eliminating hundreds of lines of custom orchestration logic we would have otherwise written.
Guaranteed Consistency: By delegating orchestration to a managed service, we eliminated an entire category of bugs related to state inconsistency, race conditions in tool execution, and context loss between turns.
Serverless Scalability: The runtime automatically scales to handle millions of concurrent agent conversations without any infrastructure management on our part. We deploy code; AWS handles the rest.
Built-in Resilience: Automatic retries, timeout handling, and error recovery are built into the runtime. Failed tool calls don't crash the agent—they're gracefully handled and reported.
Non-Streaming Excellence: The AgentCore Runtime's support for non-streaming mode (returning complete, synthesized responses rather than incremental token streams) was critical for our UX goals. Users see intelligent answers, not raw LLM thinking.
Tool Integration Simplicity: Registering tools with the runtime is straightforward. We simply decorated our Python functions with
@tool()and the runtime handled all serialization, invocation, and result passing.
Praise for AgentCore Runtime:
"The AgentCore Runtime eliminated 500+ lines of state management code and gave us the confidence to deploy production agents in just 6 days. It's a masterclass in how cloud services should abstract complexity."
10. AWS Bedrock AgentCore Gateway: Intelligent Routing at Scale
The Problem It Solves: In multi-tenant, multi-model environments, users need seamless routing to the right model for the right workload. Managing this manually creates operational complexity, vendor lock-in risks, and performance bottlenecks.
How AgentCore Gateway Empowered Our System:
The AWS Bedrock AgentCore Gateway provided:
Dynamic Model Routing: The gateway intelligently routes requests to the optimal model based on workload characteristics, availability, and cost. We configured it to use Amazon Nova Pro for our primary agent, with automatic fallback to alternative models if needed.
Centralized Tool Management: Rather than embedding tool definitions in each agent or service, the gateway serves as a central repository for all available tools. This enables tool reuse and consistency across the entire system.
Unified Inference Endpoint: The gateway provides a single, stable endpoint for all agent requests, regardless of underlying model changes. We can upgrade models, add new ones, or redistribute load without changing client code.
Built-in Load Balancing: The gateway automatically balances load across multiple model instances and on-demand throughput resources, ensuring consistent performance even during traffic spikes.
Transparency & Observability: All tool invocations, model routing decisions, and latency metrics flow through the gateway, giving us unprecedented visibility into agent behavior.
Multi-Region Readiness: The gateway's architecture supports multi-region deployment, enabling us to serve global users with low latency. The gateway can be deployed in ap-south-1 (our primary region) with seamless failover capabilities.
Seamless Integration with Lambda: The gateway works flawlessly with AWS Lambda, enabling our agents to scale from zero to thousands of concurrent executions without any operational overhead.
Praise for AgentCore Gateway:
"The AgentCore Gateway transformed tool management from a nightmare of configuration and debugging into a seamless, self-managing system. It's the infrastructure layer every AI-native company needed but didn't know existed."
11. The AgentCore Synergy: Runtime + Gateway
What truly impressed us was how AgentCore Runtime and AgentCore Gateway work together as a unified whole:
Separation of Concerns: The Runtime handles agent logic and orchestration; the Gateway handles model/tool routing and resource management.
Zero Configuration: We didn't have to manually configure routing tables, load balancing algorithms, or failover strategies. The system "just works."
Operational Simplicity: Instead of managing distributed agent systems with multiple deployment units, we deploy a single agent container. AgentCore handles everything else.
Cost Optimization: The gateway's intelligent routing ensures we only pay for the compute we actually use. Efficient models get used for simple queries; powerful models for complex reasoning.
Future-Proof: As new models emerge (like more powerful versions of Nova Pro or alternative models), the gateway enables us to adopt them without changing our agent code.
The Real Impact: In traditional agent architectures, we would have spent 30-40% of development time building infrastructure: state machines, tool routers, load balancers, etc. AgentCore Runtime + Gateway compressed this to 0%. We spent 100% of our time on domain logic—building better tools, improving prompts, optimizing data queries.
This is why we built a production system in 6 days instead of 6 weeks.
What We Learned
Technical Learnings
Agent Design > LLM Choice: A well-designed agent with clear directives outperforms a raw LLM for specialized domains. The system prompt and tool descriptions are as important as model selection.
Non-Streaming is Essential for Agents: Streaming works for chatbots but breaks agent abstraction. Always use non-streaming mode for multi-tool orchestration.
Data Quality Cascades: In data systems, quality issues compound. A single schema mismatch can cascade into zero results across the entire pipeline. Validate early and often.
Athena is Strict but Powerful: AWS Athena enforces schema rigor that seems annoying until a data pipeline scales. The strict typing prevents silent failures.
Lambda Requires Specific Structure: Lambda's execution environment has specific expectations. Using official base images eliminates 80% of deployment issues.
Prompt Engineering is Empirical: Good prompts are discovered through iteration, not guessed. A-B testing different phrasings is essential.
AWS AgentCore: A Paradigm Shift: The combination of AgentCore Runtime and Gateway represents a fundamental shift in how AI agents should be deployed. Traditional hand-rolled agent infrastructure is now obsolete. The managed AgentCore services enabled us to focus entirely on domain logic rather than plumbing. This is the future of AI-native applications.
Tool Routing Intelligence: The AgentCore Gateway's automatic model and tool routing is so sophisticated that we initially doubted it would work correctly. It does—flawlessly. The system knows when to use Nova Pro, when to batch requests, and when to optimize for latency vs. cost. This level of automation was previously impossible.
Business & Product Learnings
Engineers Want Conversations, Not Dashboards: While dashboards show data, engineers crave insights. A conversational interface feels more natural than drilling through charts.
Synthesis > Raw Data: The value isn't in exposing more data; it's in intelligently reducing data to actionable insights.
Proactivity is Differentiating: Telling an engineer about a problem is good. Suggesting the next action is great. Predicting problems before they occur is transformative.
Context Matters: A well-crafted system prompt can make or break UX. Clarity about the agent's role and limitations builds trust.
Serverless Enables Rapid Iteration: Containerization and Lambda deployment meant we could iterate the entire system without infrastructure concerns.
Managed Services = Speed: The availability of AgentCore Runtime and Gateway meant we didn't have to build agent infrastructure. This compressed our timeline by 50%. Startups and enterprises should always prefer managed services over building core infrastructure.
Team & Process Learnings
Fail Fast on Assumptions: We tested agent behavior, data assumptions, and deployment models early. Quick feedback loops accelerated development.
Architecture Matters: We made deliberate choices about unified vs. distributed agents, streaming vs. non-streaming, and centralized vs. sharded data. These decisions compounded into system reliability.
Documentation is Development: Writing API docs forced us to think about the system's interface and revealed design gaps early.
What's Next for RAN Co-pilot
Short-term (0-3 months)
Multi-turn Conversations: Implement session management to allow engineers to ask follow-up questions without context loss.
Real-Time Data Integration: Connect to live 5G network APIs (3GPP standardized interfaces) for real-time metrics instead of static CSV.
Alert Escalation: Integrate with alerting systems so the agent can automatically notify engineers of critical issues.
Action Execution: Move from recommendations to actual execution—the agent could deploy approved configuration changes directly.
Team Collaboration: Add features for engineers to share findings and build on each other's investigations.
Medium-term (3-12 months)
Multi-Network Support: Extend to support multiple operators and vendors (not just synthetic data).
Advanced Anomaly Detection: Integrate unsupervised learning models to detect novel failure modes the system has never seen.
Predictive Maintenance: Build time-series forecasting models to predict equipment failures weeks in advance.
Network Optimization Engine: Use reinforcement learning to recommend optimal configuration parameters across the entire network.
SLA Compliance: Automatically generate reports showing compliance with Service Level Agreements.
Long-term Vision (12+ months)
Autonomous Network Management: Transition from recommendation to autonomous management—the system proactively maintains network health without human intervention.
Cross-Operator Intelligence: Build a federated learning system that learns across multiple operators while preserving data privacy.
6G Readiness: Prepare for next-generation networks by building abstractions that work across 4G, 5G, and 6G standards.
Industry Standard Integration: Become the open standard for AI-driven RAN operations (contribute to O-RAN, 3GPP standardization).
Global Scale: Deploy to thousands of network operators worldwide, making intelligent network operations a commodity capability.
Open Research Questions
How do we explain agent recommendations to non-technical stakeholders? (Explainability & transparency)
How do we prevent the agent from making harmful recommendations in edge cases? (Safety & alignment)
How do we scale real-time agent reasoning across millions of cells? (Performance & latency)
How do we train agents on operator-specific best practices? (Transfer learning & customization)
What's the optimal agent topology for different network sizes? (Single agent vs. hierarchical agents)
Conclusion
RAN Co-pilot represents a fundamental shift in how telecommunications networks are operated. Instead of waiting for alerts and drilling through dashboards, engineers can now have a conversation with their network—asking questions in natural language and receiving synthesized, actionable intelligence.
The system demonstrates that modern AI (LLMs + agents) + robust data architecture (Athena + S3) + good UX (natural language + synthesis) = transformative tools for specialized domains.
The AWS Bedrock AgentCore Advantage
What made this 6-day sprint possible was AWS Bedrock's AgentCore platform—specifically the AgentCore Runtime and AgentCore Gateway. These services eliminated the infrastructure boilerplate that typically dominates agent projects:
- No state machines to build. AgentCore Runtime handles it.
- No tool routers to configure. AgentCore Gateway handles it.
- No load balancing logic to implement. AgentCore Gateway handles it.
- No model management complexity. AgentCore Gateway handles it.
Instead, we spent 100% of our time on what matters: domain expertise, prompt engineering, tool design, and user experience.
This is the promise of the cloud: infrastructure that gets out of the way so builders can focus on creating value.
We're excited about the impact this will have on network operations—reducing Mean Time To Recovery (MTTR), improving network reliability, and ultimately delivering better customer experience.
The journey from concept to deployed product in 6 days proved that with the right tools (AWS Bedrock AgentCore), team, and vision, building next-generation intelligence systems is achievable.
The future of network operations is conversational, proactive, and intelligent. RAN Co-pilot—powered by AWS Bedrock AgentCore—is here.
Team Credits
Built with passion by GoML team.
Technologies: AWS Bedrock AgentCore Runtime, AWS Bedrock AgentCore Gateway, Strands Agent Framework, Amazon Nova Pro, Amazon Athena, Python, FastAPI, Docker, AWS Lambda, Amazon ECR
Data: 10,000+ synthetic records representing real 5G network scenarios across India
Vision: Making intelligent network operations accessible to every telecommunications engineer on the planet—powered by managed AI infrastructure that scales effortlessly.
Built With
- agentcore
- amazon-web-services
- bedrock
- fastapi
- python
- s3
- strands
Log in or sign up for Devpost to join the conversation.