Both of us were tired of making presentations for almost every single class. Also inspired by the 'in-game advertising' episode on the show Silicon Valley(S06E02).
What it does
InstaPresent uses your computer microphone to generate content to appear on your screen in a presentation in real-time. It can retrieve images and graphs and summarize your words into bullet points.
How we built it
We use Google’s Text To Speech API to process audio on the microphone of the laptop. The Text To Speech is captured whenever a user records their audio, and when they let go the aggregated text is sent to the server over WebSockets to be processed.
Fundamentally we needed a way to determine whether a given sentence included a request to an image or not. So we gathered a repository of sample sentences from news articles for “no” examples, and manually curated a list of “yes” examples. We then used Facebook’s Deep Learning text classification library, FastText, to train a custom neural network that could perform text classification.
Once we have a sentence that the neural network classifies as a request for an image, such as “and here you can see a picture of a dachshund”, we use part of speech tagging and some tree theory rules to extract the subject, “dachshund”, and scrape Bing for pictures of the Weiner dog. These image URLs are then rendered on the screen.
Once the backend detects that the user specifically wants a graph that demonstrates their point, we used matplotlib code to generate the graphs. These graphs are then added to the presentation in real-time.
When we receive the text back from the google text to speech API, it doesn’t naturally add periods when we pause in our speech. This can give more conventional NLP analysis (like part-of-speech analysis), some trouble because the text is grammatically incorrect. We use a sequence to sequence transformer architecture, seq2seq, and transfer learned a new head that was capable of classifying the borders between sentences. This was then able to add punctuation back into the text before the rest of the processing pipeline.
Using Part-of-speech analysis, we determine which parts of a sentence (or sentences) would best serve as a title to a new slide. We do this by searching through sentence dependency trees to find short sub-phrases (1-5 words optimally) which contain important words and verbs. If the user is signaling the clicker that it needs a new slide, this function is run on their text until a suitable sub-phrase is found. When it is, a new slide is created using that sub-phrase as a title.
When the user is talking “normally,” and not signaling for a new slide, image, or graph, we attempt to summarize their speech into bullet points. This summarization is performed using custom Part-of-speech analysis, which starts at verbs with many dependencies and works its way outward in the dependency tree, pruning branches of the sentence that are superfluous.
INTERNAL SOCKET COMMUNICATION
Challenges We ran into
Text summarization is very difficult - there may be powerful algorithms to turn articles into paragraph summaries, there is essentially nothing on shortening sentences into bullet points. We ended up developing a custom pipeline for bullet-point generation based on 'Part-of-speech' and 'Dependency analysis'. We couldn't explore the APIs of other services like Auth0, Twilio, etc. We also had plans of making an Android app for the same, but couldn't because of limited team members and time constraints. But despite our challenges, we enjoyed the opportunity and are grateful for that.
Accomplishments that we're proud of
Making a web application, with a variety of machine learning and non-machine learning techniques. Working on an unsolved machine learning problem (sentence simplification) Real-time text analysis to determine new elements
What's next for InstaPresent
Predict what the user intends to say next Improving text summarization with word reordering