Inspiration

Deaf and hard-of-hearing users struggle to access remote meetings on Zoom, Teams, and other platforms. Professional ASL interpreters are expensive, introduce privacy concerns, and are often unavailable at short notice.

We wanted to build an accessible, affordable real-time translation system powered by a scalable cloud-native architecture. Our goal was to leverage AWS to orchestrate real-time AI pipelines while combining it with ultra-low latency edge inference—making remote communication truly inclusive.


What it does

NIMBUS captures American Sign Language (ASL) through a webcam and translates it into fluent, natural language in real time.

It delivers:

  • Live captions overlaid directly on the video feed
  • Natural speech synthesis using emotion-aware text-to-speech
  • Emotion detection to dynamically adjust tone, pitch, and pacing
  • Multi-participant sessions with WebRTC-based routing
  • Speaker and gallery views for flexible collaboration
  • Full transcript history stored and queryable in real time
  • Global language output (English, Spanish, French, Japanese, etc.)

The system uses an edge + cloud architecture: fast inference in the browser combined with a fully managed AWS backend for real-time processing, translation, and delivery.


How we built it

Frontend (React + Vite)

  • Real-Time Communication: Persistent WebSocket connections to AWS API Gateway
  • Media Routing: WebRTC peer connections with Mediasoup SFU
  • On-Device ML: MediaPipe extracts 55 keypoints per frame, processed by an ONNX model in a Web Worker (~15ms latency)
  • Dynamic UI: Real-time captions, speaker switching, and gallery layouts
  • Authentication: Secure login via Amazon Cognito OAuth

Backend (AWS Serverless Architecture — Core Focus)

Our backend is a fully event-driven, serverless system on AWS, designed for low latency and massive scalability.

  • Amazon API Gateway (WebSockets):
    Maintains persistent, bidirectional connections for streaming ASL gloss tokens and system events in real time

  • AWS Lambda (Microservices Architecture):
    9+ Lambda functions orchestrate the pipeline:

    • process_gloss_stream → buffers incoming tokens
    • nlp_transform → sends structured prompts to Bedrock
    • emotion_pipeline → processes Rekognition outputs
    • tts_dispatch → generates and distributes audio
    • Additional Lambdas handle signaling, retries, session lifecycle, and cleanup

Each function scales independently and executes in sub-100ms windows.

  • Amazon Bedrock (Claude):
    Converts ASL gloss (topic-comment structure) into fluent, grammatically correct language using context-aware prompting

  • Amazon Translate:
    Enables real-time multilingual output

  • Amazon Rekognition:
    Detects facial emotion from sampled frames

  • Amazon Polly:
    Generates expressive speech using SSML <prosody> tags


Infrastructure (AWS Backbone)

  • Amazon DynamoDB:
    Stores session state, gloss buffers, and transcript history with TTL-based auto-cleanup

  • Amazon S3:
    Temporary storage for TTS audio with presigned URLs for efficient delivery

  • Amazon EC2 (Mediasoup SFU):
    Dedicated media routing layer, decoupled from signaling

  • AWS CloudFormation (SAM):
    Full Infrastructure-as-Code enabling reproducible deployments


Data Flow

Webcam → MediaPipe (keypoints) → ONNX (edge inference) → API Gateway → Lambda → DynamoDB (buffer) → Bedrock (NLP) → Translate → Rekognition (emotion) → Polly (TTS) → S3 → Broadcast to clients


Accomplishments that we're proud of

  • Real-Time Serverless AI Pipeline
    Achieved end-to-end latency under ~1.5 seconds using AWS services

  • Edge + Cloud Hybrid Optimization
    Reduced inference latency from ~800ms (cloud) to ~15ms (edge ONNX)

  • Advanced Sentence Boundary Detection
    Dynamic triggers (token count, elapsed time, idle detection, [EOS]) produce natural sentence flow

  • Fault-Tolerant Distributed System
    Every service has graceful fallback:

    • Bedrock fails → raw gloss displayed
    • Polly fails → captions still shown
    • No silent failures
  • Emotion-Aware Speech Pipeline
    Rekognition + Polly + SSML produces expressive, human-like output

  • Scalable Multi-User Architecture
    WebRTC + EC2 SFU + WebSocket signaling enables real-time collaboration

  • Fully Serverless + Cost Efficient
    Pay-per-use infrastructure (~$1–2/day per active room) with automatic scaling


What we learned

  • AWS Enables Rapid System Design at Scale
    Combining API Gateway, Lambda, DynamoDB, Bedrock, Rekognition, and Polly allows complex real-time systems to be built quickly

  • Event-Driven Architectures Are Powerful
    Decoupling each stage of the pipeline improves scalability and reliability

  • Latency is Critical
    Even small delays compound—forcing optimization of cold starts, payload size, and execution paths

  • ASL Requires Contextual Intelligence
    Gloss tokens alone are insufficient; LLMs are necessary for fluency

  • Serverless State Management is Challenging
    DynamoDB requires careful handling of atomic updates and TTL cleanup

  • Emotion Improves UX
    Expressive speech significantly increases realism and engagement


What's next for NIMBUS

  • Deep AWS Optimization (SageMaker Integration)
    Deploy a full ASL transformer model on SageMaker endpoints

  • Zoom & Teams Integration
    Inject captions directly into native CC pipelines and route audio seamlessly

  • Vocabulary Scaling
    Expand from 100 → 2,000+ ASL signs (WLASL dataset)

  • Multilingual Sign Language Support
    Extend to BSL, LSF, and other global sign languages

  • Speech-to-ASL Translation
    Build a reverse pipeline using 3D avatars for full bidirectional communication

Built With

  • amazon-web-services
  • aws-api-gateway-v2
  • aws-bedrock
  • aws-cloudformation
  • aws-cloudwatch
  • aws-cognito
  • aws-dynamodb
  • aws-ec2
  • aws-lambda
  • aws-lambda-powertools
  • aws-polly
  • aws-rekognition
  • aws-sam
  • boto3
  • docker
  • github-actions
  • mediapipe
  • mediasoup
  • node.js
  • onnx-runtime-web
  • opencv
  • pydantic
  • pyjwt
  • python
  • react-19
  • react-router
  • sagemaker
  • stun/turn
  • tailwind-css
  • tgcn
  • typescript
  • vite
  • web-speech-api
  • webrtc-api
  • websocket-api
  • websockets
  • wlasl-2000
Share this project:

Updates