Syntax highlighting from programming languages

Parsey McParseface by Google for automated sentence structure analyzation

What it does

Colorful syntax highlighting for natural languages inside HTML documents

How we built it

First, we extract the plain text from a given HTML document.

The Parsey McParseface API from Google builds a sentence graph for each of the sentences. These graphs contain the words as nodes, which also contain information about the word type, e.g. "verb". The edges describe the grammatical structure between those, e.g. how "nested" they are inside the sentence. Hence we can compute a score for the semantical importance of each single word. Furthermore very common words get smaller scores. The output is another HTML document in which the most imortant ones are highlighted.

We also used the SAP press articles as raw test data.


Challenges we ran into

At first we tried out to analyse and predict resumes for job applications by means of RNNs, the final identification of our project topic was around midnight.

Installation of Python 2.7 on external server after midnight

Lack of sleep -.-

Accomplishments that we're proud of

Installtion of Parsey McParseface without root permissions

What we learned

It's hard to get more than 500 CVs at one day because crawlers soon get blocked.

How to spoof with as many proxies as possible for resume crawling



Web crawling

Poxy Services

Python 2.7

What's next for NLP highlight

Implementation as Chorme plugin with inplace text highlighting

Built With

Share this project: