Fluent enables people to improve their public speaking skills, as well as improve the quality of their audio recordings by removing filler sounds and words, using AI and machine learning.
What it does
Fluent provides the optimal platform for creating refined audio speech recordings using AI, as well as speech quality and video analytics for improving public speaking skills through natural language processing.
In the upload page, the user inputs an audio file in the form of an mp3 or wav file, which is then processed in our backend using python and google cloud speech-to-text api to output an automatically edited clip without any filler words like 'uhhh'. It also gives speech insights on the audio clip, such as the pace, the eloquence, the word choice, the pronunciation, and the intonation, as well as an overall score. Lastly, it gives NLP insights as well to detect active vs passive voice.
In the realtime analytics page, the user can record audio in realtime, which is then processed in our backend using python and google cloud speech-to-text api to output an automatically edited clip without any filler words like 'uhhh'. It also gives speech insights on the audio clip, such as the pace, the eloquence, the word choice, the pronunciation, and the intonation, as well as an overall score. Lastly, it gives NLP insights as well to detect active vs passive voice. Users can also get realtime video analysis on their body posture and hand gestures via posenet and tensorflow.
Lastly, we have a statistics page incorporating charts.js to show interactive graphs and visualizations for all the collected metrics, allowing users to gauge their progress over time.
How we built it
- Google Cloud speech API for speech to text to find important keywords
- FFMPEG for removing the filler sounds based on Google Cloud data
- Amazon EC2 for backend server hosting and functions/endpoints
- Amazon S3 for react web app hosting
- Python + Flask for backend functions
- React.js for frontend
- Tensorflow.js + Posenet for live camera integrations and video analysis
- Google Cloud Serverless Functions for initial login/register endpoints
Challenges we ran into
- We had dependencies such as FFMPEG, so we decided to switch to a full-fledged Ubuntu server on Amazon EC2, as opposed to a serverless architecture. This was indeed a challenging transition.
- Successfully hosting our backend on EC2, and serving our endpoints from there
- Integration Posenet successfully with our live webcam stream
- Getting FFMPEG to work seamlessly with the audio integration
- Getting the react front-end to be responsive
- Filtering algorithms for making the models more accurate
Accomplishments that we're proud of
- Successfully transitioning to an Amazon EC2 server from the google cloud functions serverless architecture
- Getting everything integrated
- Getting everything hosted with EC2 working seamlessly
- Getting Posenet to work and give accurate insights
- Making the UI responsive and clean
- Getting the audio cropping to work.
What we learned
- We learned how to use Tensorflow.js and Posenet with a live webcam
- Google cloud speech and audio processing
- Amazon EC2 and S3
What's next for Fluent
- Improving speech models, making it more efficient and refined.
- Improving the Posenet insights.
- Improving and making more rigorous NLP models.