Dee - the DeepLens Educating Entertainer

Dee being tested by my three-year-old son
Dee - DeepLens, speaker, and eight picture cards

Inspiration

Young children, and some older ones with special learning needs, can struggle to interact with electronic devices. They may not be able to read a tablet screen, or use a computer keyboard, or speak clearly enough for voice recognition. But with video recognition, this can change. Technology can now understand the child's world, and discover when they do something, such as pick up an object or perform an action. And that leads to whole new ways of interaction.

DeepLens is particularly appealing for children's interactions because it can run its deep learning models offline. Which means the device can work anywhere, with no additional costs, and no privacy concerns over children's data.

What it does

Dee (the DeepLens Educating Entertainer) asks questions, by speaking. Her questions ask the participant to show something. The questions (in a JSON file and easily extended) have answers that are one of four animals (bird, cow, horse and sheep) or four forms of transport (aeroplane, bicycle, bus and motorbike). Some questions have just one right answer (e.g. "What says moo?") and some can have several (e.g. "What has wheels?"). Right answers are praised and wrong ones are given not-very-subtle hints to get it right. (This is about interaction and positive reinforcement, rather than being a challenging quiz.)

The participant answers the questions by showing Dee a picture of the relevant object. The GitHub repo includes a PDF of pictures that can be printed out for this.

How I built it

The predefined DeepLens model deeplens-object-detection worked well for this, and so creating a new one was not required. This meant that more time could be focussed on the logic in the Labmda.

A Lambda function, running on the DeepLens device (via GreenGrass, of course) handles interaction. It picks a question at random, speaks it, and then analyses the model response to see how the user answered. Plenty of messages such as "Let's do more!" and "Good choice!" help the participant feel positive and engaged about the experience.

Dee is designed to not require WiFi access (to ensure there are no connection, cost, or privacy concerns). This was tricky when it came to speech, as Amazon Polly is used. To overcome this, a script was made to capture all required phrases and store them locally. Which means the Lambda includes 69 MP3 files.

Challenges I ran into

My initial hope was that, rather than pictures, the child could show toys to Dee. Picking up, for example, a toy plane or cuddly sheep, would be more exciting than a piece of paper. But in testing, the object detection model did not see toys as being the same as their real counterparts. A toy plane is just too different from a real plane, it seems. Training a model to work on toys would fix this, of course, but I was unable to find a good and large enough training data set. This is something to work on.

Accomplishments that I'm proud of

I'm impressed with how this form of interaction really works. As you'll see from the YouTube video, we tried Dee with my three-year-old son, and he loved it. He's asking to play with it again. This may be a prototype but it's good enough for him to use.

I'm also excited with how the positive reinforcement aspects can help children with autism or Asperger's.

What I learned

This project has brought me up to speed with deep learning concepts, and AWS's approach to managing and running them (through SageMaker and GreenGrass).

Away from the tech, I've also learned the potential of technology increasingly understanding the human world. Intelligent video recognition allows all kinds of new ways to play games and learn new things.

What's next for Dee - the DeepLens Educating Entertainer

The potential for Dee is huge. If she could recognise a wider range of things, a much more varied set of questions could be asked. Consider:

"Can you hold up three fingers?" (to test counting skills)
"Show me your biggest smile!"
"Can you do a star jump?"
"Which one is the letter A?"
"Can you show me your favourite toy?"

Of course, training new models will be a key part of this. And with services such as SageMaker making training more straightforward, the possibility emerges for end-users being able to train their own models. A teacher could, for example, train Dee to recognise certain objects in the classroom. Or a carer could train Dee to respond to specific objects that are important to someone with autism.

Finally, there's plenty more improvements in the logic that could be done too. Could Dee track your progress over time, and report on how well you're learning, say, the alphabet? Could it recognise different people and set them different challenges? The possibilities are endless.