Syntax highlighting from programming languages
Parsey McParseface by Google for automated sentence structure analyzation
What it does
Colorful syntax highlighting for natural languages inside HTML documents
How we built it
First, we extract the plain text from a given HTML document.
The Parsey McParseface API from Google builds a sentence graph for each of the sentences. These graphs contain the words as nodes, which also contain information about the word type, e.g. "verb". The edges describe the grammatical structure between those, e.g. how "nested" they are inside the sentence. Hence we can compute a score for the semantical importance of each single word. Furthermore very common words get smaller scores. The output is another HTML document in which the most imortant ones are highlighted.
We also used the SAP press articles as raw test data.
Challenges we ran into
At first we tried out to analyse and predict resumes for job applications by means of RNNs, the final identification of our project topic was around midnight.
Installation of Python 2.7 on external server after midnight
Lack of sleep -.-
Accomplishments that we're proud of
Installtion of Parsey McParseface without root permissions
What we learned
It's hard to get more than 500 CVs at one day because crawlers soon get blocked.
How to spoof with as many proxies as possible for resume crawling
What's next for NLP highlight
Implementation as Chorme plugin with inplace text highlighting