Inspiration
As international students and aspiring IT professionals, we've witnessed a recurring pain point in our local community: the "Bamboo Ceiling." Brilliant engineers and talented individuals often struggle to communicate their ideas confidently due to pronunciation insecurities and accent barriers. Traditional language apps are too generic, and human tutors are expensive and intimidating.
We wanted to build a zero-judgment, highly personalized, and accessible safe space. Our inspiration was to leverage the absolute bleeding-edge of AI across multiple cloud ecosystems to create a "Pronunciation Co-Pilot"—a tool that doesn't just score you, but understands your specific phonetic weaknesses and dynamically creates tailored exercises to help you break through communication barriers in the workplace.
What it does
Hajimi is a hyper-personalized AI English pronunciation coach. Users can read provided texts, tackle tongue twisters, or engage in free speech context based on specific topics (e.g., "Job Interview").
Instead of just giving a generic "good job," Hajimi breaks down the audio into granular, phoneme-level scores. It identifies exactly which syllables the user struggles with (e.g., confusing /th/ and /s/). It then tracks this historical data and uses a Large Language Model to dynamically generate custom tongue twisters targeting those exact weak spots. Finally, it uses Neural TTS to read the sentence back, providing a perfect audio reference for the user to mimic and improve.
How we built it
We architected a Production-Ready, Multi-Cloud Serverless application, adopting a "Best-of-Breed" cloud strategy rather than locking into a single vendor:
Frontend (Edge Accelerated): Built with React.js and RecordRTC, globally distributed via Amazon CloudFront and hosted on an Amazon S3 static website for sub-second, secure (HTTPS) edge delivery.
Backend (AWS Serverless Core): An Amazon API Gateway routes RESTful requests to 5 highly decoupled AWS Lambda microservices (written in Python). This ensures zero server maintenance, high scalability, and strict rate-limiting for cost protection.
Data & Storage: We utilized Amazon DynamoDB for sub-millisecond NoSQL tracking of users' historical phonetic weaknesses. Audio streams are securely handled via S3 Presigned URLs, allowing the frontend to upload directly to storage (Zero-Compute Uploads), drastically reducing Lambda execution time and latency.
Multi-Cloud Cognitive Engines (The Brains): * Microsoft Azure Speech Service: Handles the heavy lifting of granular, phoneme-level pronunciation assessment.
Google Gemini Pro (LLM): Acts as the logical brain, analyzing DynamoDB historical weaknesses to generate dynamic, context-aware coaching content.
ElevenLabs API: Delivers ultra-fast, high-fidelity voice cloning for the AI tutor's audio feedback.
Challenges we ran into
Orchestrating three different major cloud providers (AWS, Azure, Google) synchronously within a serverless environment was a massive challenge. Initially, processing heavy audio streams through API Gateway caused payload limits and timeout errors (504s).
We solved this by implementing an asynchronous S3 Presigned URL architecture, bypassing the API Gateway bottleneck for file uploads. Additionally, securing our public-facing CDN required us to deep-dive into strict CORS policies and IAM roles to ensure our cloud architecture was secure from unauthorized API abuse. Finally, mapping Azure's deeply nested phonetic JSON arrays into actionable, structured prompts for Gemini required complex data wrangling.
Accomplishments that we're proud of
We are incredibly proud of shipping an enterprise-grade, Multi-Cloud Serverless architecture within a 24-hour hackathon timeframe. We didn't just build a frontend that calls a single API; we built a highly decoupled, scalable backend that safely choreographs AWS, Azure, and Google Cloud.
Beyond the tech, we are proud to have built a tool that carries real social impact—providing a tangible solution to help non-native speakers in our local community gain confidence and thrive in English-speaking professional environments.
What we learned
This project was a masterclass in Cloud Architecture and AI integration. We learned the immense value of the "Multi-Cloud" approach—leveraging Azure for acoustic precision, Gemini for logical reasoning, and AWS for unbreakable infrastructure. We also leveled up our skills in serverless security, state management via DynamoDB, and prompt engineering for educational contexts.
What's next for hajimi
Our vision for Hajimi is to evolve it into a real-time, interruptible conversational agent. We plan to:
Integrate WebSockets for real-time, ultra-low-latency mock interview simulations.
Implement a gamified progression system (streaks, badges) based on DynamoDB historical data to keep users motivated.
Develop browser extensions to provide real-time pronunciation feedback during live Zoom/Teams meetings.
Built With
- amazon-dynamodb
- amazon-web-services
- azure
- cpp
- elevenlab
- gemini
- lambda
- python
- react
- ts
Log in or sign up for Devpost to join the conversation.