What it does
"ImpromPPTX" uses your computer microphone to listen while you talk. Based on what you're speaking about, it generates content to appear on your screen in a presentation in real time. It can retrieve images and graphs, as well as making relevant titles, and summarizing your words into bullet points.
How We built it
Our project is comprised of many interconnected components, which we detail below:
We use Google’s Text To Speech API to process audio on the microphone of the laptop. Mobile Phones currently do not support the continuous audio implementation of the spec, so we process audio on the presenter’s laptop instead. The Text To Speech is captured whenever a user holds down their clicker button, and when they let go the aggregated text is sent to the server over websockets to be processed.
Fundamentally we needed a way to determine whether a given sentence included a request to an image or not. So we gathered a repository of sample sentences from BBC news articles for “no” examples, and manually curated a list of “yes” examples. We then used Facebook’s Deep Learning text classificiation library, FastText, to train a custom NN that could perform text classification.
Once we have a sentence that the NN classifies as a request for an image, such as “and here you can see a picture of a golden retriever”, we use part of speech tagging and some tree theory rules to extract the subject, “golden retriever”, and scrape Bing for pictures of the golden animal. These image urls are then sent over websockets to be rendered on screen.
Once the backend detects that the user specifically wants a graph which demonstrates their point, we employ matplotlib code to programmatically generate graphs that align with the user’s expectations. These graphs are then added to the presentation in real-time.
When we receive text back from the google text to speech api, it doesn’t naturally add periods when we pause in our speech. This can give more conventional NLP analysis (like part-of-speech analysis), some trouble because the text is grammatically incorrect. We use a sequence to sequence transformer architecture, seq2seq, and transfer learned a new head that was capable of classifying the borders between sentences. This was then able to add punctuation back into the text before the rest of the processing pipeline.
Using Part-of-speech analysis, we determine which parts of a sentence (or sentences) would best serve as a title to a new slide. We do this by searching through sentence dependency trees to find short sub-phrases (1-5 words optimally) which contain important words and verbs. If the user is signalling the clicker that it needs a new slide, this function is run on their text until a suitable sub-phrase is found. When it is, a new slide is created using that sub-phrase as a title.
When the user is talking “normally,” and not signalling for a new slide, image, or graph, we attempt to summarize their speech into bullet points which can be displayed on screen. This summarization is performed using custom Part-of-speech analysis, which starts at verbs with many dependencies and works its way outward in the dependency tree, pruning branches of the sentence that are superfluous.
Since it is really convenient to have a clicker device that you can use while moving around during your presentation, we decided to integrate it into your mobile device. After logging into the website on your phone, we send you to a clicker page that communicates with the server when you click the “New Slide” or “New Element” buttons. Pressing and holding these buttons activates the microphone on your laptop and begins to analyze the text on the server and sends the information back in real-time. This real-time communication is accomplished using WebSockets.
Internal Socket Communication
Challenges We ran into
Text summarization is extremely difficult -- while there are many powerful algorithms for turning articles into paragraph summaries, there is essentially nothing on shortening sentences into bullet points. We ended up having to develop a custom pipeline for bullet-point generation based on Part-of-speech and dependency analysis.
The Web Speech API is not supported across all browsers, and even though it is "supported" on Android, Android devices are incapable of continuous streaming. Because of this, we had to move the recording segment of our code from the phone to the laptop.
Accomplishments that we're proud of
- Making a multi-faceted application, with a variety of machine learning and non-machine learning techniques.
- Working on an unsolved machine learning problem (sentence simplification)
- Connecting a mobile device to the laptop browser’s mic using WebSockets
- Real-time text analysis to determine new elements
What's next for ImpromPPTX
- Predict what the user intends to say next
- Scraping Primary sources to automatically add citations and definitions.
- Improving text summarization with word reordering and synonym analysis.