NazAR

About the Project

NazAR is an educational tool that automatically creates interactive visualizations of math word problems in AR, requiring nothing more than an iPhone.

Behind the Name

Nazar means “vision” in Arabic, which symbolizes the driving goal behind our app – not only do we visualize math problems for students, but we also strive to represent a vision for a more inclusive, accessible and tech-friendly future for education. And, it ends with AR, hence NazAR :)

Inspiration

The inspiration for this project came from each of our own unique experiences with interactive learning. As an example, we want to showcase two of the team members’ experiences, Mohamed and Rayan’s. Mohamed Musa moved to the US when he was 12, coming from a village in Sudan where he grew up and received his primary education. He did not speak English and struggled until he had an experience with a teacher that transformed his entire learning experience through experiential and interactive learning. From then on, applying those principles, Mohamed was able to pick up English fluently within a few months and reached the top of his class in both science and mathematics. Rayan Ansari had worked with many Syrian refugee students on a catch-up curriculum. One of his students, a 15 year-old named Jamal, had not received schooling since Kindergarten and did not understand arithmetic and the abstractions used to represent it. Intuitively, the only means Rayan felt he could effectively teach Jamal and bridge the connection would be through physical examples that Jamal could envision or interact with. From the diverse experiences of the team members, it was glaringly clear that creating an accessible and flexible interactive learning software would be invaluable in bringing this sort of transformative experience to any student’s work. We were determined to develop a platform that could achieve this goal without having its questions pre-curated or requiring the aid of a teacher, tutor, or parent to help provide this sort of time-intensive education experience to them.

What it does

Upon opening the app, the student is presented with a camera view, and can press the snapshot button on the screen to scan a homework problem. Our computer vision model then uses neural network-based text detection to process the scanned question, and passes the extracted text to our NLP model.

Our NLP text processing model runs fully integrated into Swift as a Python script, and extracts from the question a set of characters to create in AR, along with objects and their quantities, that represent the initial problem setup. For example, for the question “Sally has twelve apples and John has three. If Sally gives five of her apples to John, how many apples does John have now?”, our model identifies that two characters should be drawn: Sally and John, and the setup should show them with twelve and three apples, respectively.

The app then draws this setup using the Apple RealityKit development space, with the characters and objects described in the problem overlayed. The setup is interactive, and the user is able to move the objects around the screen, reassigning them between characters. When the position of the environment reflects the correct answer, the app verifies it, congratulates the student, and moves onto the next question. Additionally, the characters are dynamic and expressive, displaying idle movement and reactions rather than appearing frozen in the AR environment.

How we built it

Our app relies on three main components, each of which we built from the ground up to best tackle the task at hand: a computer vision (CV) component that processes the camera feed into text: an NLP model that extracts and organizes information about the initial problem setup; and an augmented-reality (AR) component that creates an interactive, immersive environment for the student to solve the problem.

We implemented the computer vision component to perform image-to-text conversion using the Apple’s Vision framework model, trained on a convolutional neural network with hundreds of thousands of data points. We customize user experience with a snapshot button that allows the student to position their in front of a question and press it to capture an image, which is then converted to a string, and passed off to the NLP model.

Our NLP model, which we developed completely from scratch for this app, runs as a Python script, and is integrated into Swift using a version of PythonKit we custom-modified to configure for iOS. It works by first tokenizing and lemmatizing the text using spaCy, and then using numeric terms as pivot points for a prioritized search relying on English grammatical rules to match each numeric term to a character, an object and a verb (action). The model is able to successfully match objects to characters even when they aren’t explicitly specified (e.g. for Sally in “Ralph has four melons, and Sally has six”) and, by using the proximate preceding verb of each numeric term as the basis for an inclusion-exclusion criteria, is also able to successfully account for extraneous information such as statements about characters receiving or giving objects, which shouldn’t be included in the initial setup. Our model also accounts for characters that do not possess any objects to begin with, but who should be drawn in the display environment as they may receive objects as part of the solution to the question. It directly returns filenames that should be executed by the AR code.

Our AR model functions from the moment a homework problem is read. Using Apple’s RealityKit environment, the software determines the plane of the paper in which we will anchor our interactive learning space. The NLP model passes objects of interest which correspond to particular USDZ assets in our library, as well as a vibrant background terrain. In our testing, we used multiple models for hand tracking and gesture classification, including a CoreML model, a custom SDK for gesture classification, a Tensorflow model, and our own gesture processing class paired with Apple’s hand pose detection library. For the purposes of Treehacks, we figured it would be most reasonable to stick with touchscreen manipulation, especially for our demo that utilizes the iPhone device itself without being worn with a separate accessory. We found this to also provide better ease of use when interacting with the environment and to be most accessible, given hardware constraints (we did not have a HoloKit Apple accessory nor the upcoming Apple AR glasses).

Challenges we ran into

We ran into several challenges while implementing our project, which was somewhat expected given the considerable number of components we had, as well as the novelty of our implementation.

One of the first challenges we had was a lack of access to wearable hardware, such as HoloKits or HoloLenses. We decided based on this, as well as a desire to make our app as accessible and scalable as possible without requiring the purchase of expensive equipment by the user, to be able to reach as many people who need it as possible.

Another issue we ran into was with hand gesture classification. Very little work has been done on this in Swift environments, and there was little to no documentation on hand tracking available to us. As a result, we wrote and experimented with several different models, including training our own deep learning model that can identify gestures, but it took a toll on our laptop’s resources. At the end we got it working, but are not using it for our demo as it currently experiences some lag. In the future, we aim to run our own gesture tracking model on the cloud, which we will train on over 24,000 images, in order to provide lag-free hand tracking.

The final major issue we encountered was the lack of interoperability between Apple’s iOS development environment and other systems, for example with running our NLP code, which requires input from the computer vision model, and has to pass the extracted data on to the AR algorithm. We have been continually working to overcome this challenge, including by modifying the PythonKit package to bundle a Python interpreter alongside the other application assets, so that Python scripts can be successfully run on the end machine. We also used input and output to text files to allow our Python NLP script to more easily interact with the Swift code.

Accomplishments we're proud of

We built our computer vision and NLP models completely from the ground up during the Hackathon, and also developed multiple hand-tracking models on our own, overcoming the lack of documentation for hand detection in Swift.

Additionally, we’re proud of the novelty of our design. Existing models that provide interactive problem visualization all rely on custom QR codes embedded with the questions that load pre-written environments, or rely on a set of pre-curated models; and Photomath, the only major app that takes a real-time image-to-text approach, lacks support for word problems. In contrast, our app integrates directly with existing math problems, and doesn’t require any additional work on the part of students, teachers or textbook writers in order to function.

Additionally, by relying only on an iPhone and an optional HoloKit accessory for hand-tracking which is not vital to the application (which at a retail price of $129 is far more scalable than VR sets that typically cost thousands of dollars), we maximize accessibility to our platform not only in the US, but around the world, where it has the potential to complement instructional efforts in developing countries where educational systems lack sufficient resources to provide enough one-on-one support to students. We’re eager to have NazAR make a global impact on improving students’ comfortability and experience with math in coming years.

What we learned

We learnt a lot from building the tracking models, which haven’t really been done for iOS and there’s practically no Swift documentation available for.
We are truly operating on a new frontier as there is little to no work done in the field we are looking at
We will have to manually build a lot of different architectures as a lot of technologies related to our project are not open source yet. We’ve already been making progress on this front, and plan to do far more in the coming weeks as we work towards a stable release of our app.

What's next for NazAR

Having the app animate the correct answer (e.g. Bob handing apples one at a time to Sally)
Animating algorithmic approaches and code solutions for data structures and algorithms classes
Being able to automatically produce additional practice problems similar to those provided by the user
Using cosine similarity to automatically make terrains mirror the problem description (e.g. show an orchard if the question is about apple picking, or a savannah if giraffes are involved)
And more!

Built With

apple-vision-framework
augmented-reality
natural-language-processing
python
reality-kit
swift

Submitted to

TreeHacks 2023
- Winner Best Startup by YC, Runner-Ups

Created by

I worked primarily on the computer vision and the app development. I also worked on asset creation for our AR environment.

Rayan Ansari
I worked primarily on natural language processing and full stack inter-platform integration of our project's various components.

Saahil Sundaresan
Mohamed S Musa
Shafin Khan

Updates

Shafin Khan started this project — Feb 19, 2023 12:52 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.