Inspiration

I had always been fascinated by the potential of machine learning; going into this hackathon, I knew I wanted to explore the boundary of these possibilities. While brainstorming with respects to NLP, I noticed many applications followed a very cookie-cutter process; some input channel would be replaced by speech, that speech would then be parsed and the app would continue working as usual. This often doesn't feel impactful - for example as a user of a fitness app, I don't really care if I can enter my daily info by spreadsheet or by talking to my phone.

When I asked myself what NLP and Wit.ai is really all about, I thought of scanning and tokenization; not only does it detect the intent of a sentence, it also detects and classifies keywords. Oddly enough, compilers do the same thing, except with strict syntax rules on how 'sentences' (blocks of code) can be formed. Following the connection, I realized that if I could replace this process with NLP, I could leverage the ability to scan and parse input without require strict syntax rules; otherwise, I could create a language with no syntax.

What it does

NLPy is in and of itself a coding language. Users write 'code' by either talking to the application or writing the sentence they wish to send; these queries could be as simple as "set variable to five" or "add fifteen to temp". What makes NLPy special is that there's no syntax; just give it instructions in plain old English and it should work.

Currently, NLPy supports:

  • IDs (i.e variables) and integers
  • Arithmetic (+, -, *, /, %)
  • Assignment (=)
  • Basic Comparisons (<, =, >)
  • If statements
  • For loops
  • Std. Output
  • 'function' calls

How I built it

There's really two integral parts to NLPy, the code generator and the real brains of the language, the NLP model.

To give a brief overview, the Wit.ai model receives a sentence and tries to identify the type of instruction it represents; it also tries to identify the specific tokens inside the sentence - for example, "set variable to five" would be recognized as an assignment with variable as the left value and 5 as the right value. Currently, it recognizes all the available commands listed under What it does, and then some. I've trained it to associate certain keywords to certain intents as well as entities; but the bulk of the training is invested into free text lookup strategies, as keywords are risky and can seriously mess things up if one were used out of context (e.g as a variable name).

These results are returned to the code generator, which just takes these intents and tokens and creates the actual code for it - for the previous example, "variable = 5" would be generated. Beyond that, it's just a matter of stringing together APIs and cobbling an interface together.

I cheated a little bit; NLPy actually transpiles to Python instead of generating the actual machine code, this is exposed to the user for debugging purposes, but could be abstracted away as well.

Challenges I ran into

The design of the Wit.ai model was extremely challenging; what intents should I recognize, and what kinds of entities do I want? My initial models struggled to classify many of my entities because they were either too generic or too specific; this made training a nightmare, as I'd find myself training the model using previous data with extremely minor differences.

Furthermore, nested expressions would be a nightmare to handle without being extremely inefficient. For example, recognizing the query "if a isn't equal to b..." as an if statement wouldn't be difficult, but training the model to parse the actual content to get a, not equal, and b would be very inefficient, as I'd essentially have to retrain everything if I wanted to implement a while statement, and then the model might confuse if's for while's because it may associate the conditional with one over the other. I overcame this by identifying "a isn't equal to b" as an entirely separate entity, and recursively sending this substring back into my model, and then training my model to recognize specific comparison patterns.

Similarly, calculating nested arithmetic expressions was near impossible; training aside, the real deal-breaker was in the ambiguity of expressions. A query such as "let x equal three plus two times two" is completely ambiguous, as "x = 3 + 2 * 2" and "x = (3 + 2) * 2" are both valid interpretations. Consequently, I only allowed one arithmetic operation each instruction to avoid this headache; funnily enough this meant NLPy felt a lot like assembly at times, where I'd have to use temporary variables (scratch register) to evaluate expressions without modifying the original variables.

Accomplishments that I'm proud of

I can't begin to describe the sense of achievement when I successfully compiled my first program. Using NLPy, I cobbled together a quick script to calculate and return all primes below 100; an speculation that had originally started as a _ "What if I could do this ..." _ had turned into _ "Oh shit it actually works" _.

I'm amazed at how well everything comes together so cleanly; the logic behind the compiler itself is essentially one glorified switch statement. Despite its simplicity, NLPy is already able to handle everything listed in What it does and then some.

What I learned

I learnt a lot about training models effectively and efficiently; going in I thought 'training is training, what does it really matter what I use'. After my first couple models crashed and burned, I realized I had to adapt my set of intents and entities to better incorporate the possible training data I could generate so as to efficiently train my model.

What's next for NLPy

As it stands, NLPy in it's currently operates like a syntax free assembly language with some Python cheats baked in (lists and function calls). This is nice and all, but assembly is very low-level and detail-oriented; I want to extend NLPy to handle instructions on a much higher-level, after all, the entire point is to offer a more human and intuitive approach to programming. Ultimately, I want NLPy to feel just like pseudo-code, to the point where someone just describes an algorithm as naturally as they can and NLPy will just compile it. This would be an incredibly powerful tool; not only would it be very practical, but extremely beginner friendly as new programmers wouldn't have to bog themselves down in syntactical details.

Built With

Share this project:

Updates