Imagine a world where ...

You can turn any idea into reality just by telling it to an AI, just as you would tell another person. Pretty awesome, right?

We call this the ideation economy, and we're quite certain this is the economy of the future.

So we resolved to create it.

Cool, but how?

We're turning your computer into your personal employee. 1 of 1, just like you.

We believe that the optimal experience for human-computer interaction is interacting with your computer just as you interact with another person. If you want something done, you tell it to the computer. If it doesn't know, then you teach it.

That's the experience we built.

Meet Ditto 👋🏼 It is your AI desktop assistant that you teach to automate the tasks you hate doing.

You teach it just as you would teach another person and then it executes tasks just like a person would.

1) You give it instructions on the task

2) You show it the task a few times

3) You answer its questions

4) You let it automate tasks by clicking and typing on your computer

How does it work

Ditto starts by observing your computer screen just as a person looks at - by feeding screenshots into a computer vision model. From that model, it learns to identify and categorize the text & UI elements on the screen, along with their pixel locations. Using that, it builds a relational knowledge structure which captures spatial relations in addition to any concepts or tasks that the user teaches it.

As Ditto learns through demonstrations and conversations, what it is actually doing is augmenting the knowledge structure with concepts. Any ambiguity that it has for, say, why the user performed a certain action, is resolved by asking the user questions.

When it learns how to do a task, it automates it by clicking and typing on the UI

How we built it

We built this application using several stack

Frontend:

Electron Application built on React JS, CSS, HTML.

Backend:

Backend is built using Python, PyTorch, Fast API.

The backend consists of several core components: NLP/OCR for named entity recognition and localization of text within the screen. YOLOv5 on PyTorch: Icon, Button, and GUI component recognition.

We then have an action queue based on a knowledge graph with embeddings representing the extracted textual components + GUI hierarchy that maps to associated actions.

E.g. Chrome -> Gmail -> compose_button -> recipient_address -> subject_header -> email_body -> press_hotkey("cmd", "enter").

This is a hard problem

Being able to learn and automate (eventually) ALL computer tasks? Sounds like a tall order. We love challenges, and this is about as difficult and rewarding as it gets. Here's how we're going about it.

Ditto's awesome user experience is powered by a system of interconnected AI components, including a computer vision system to see what's going on and a conversationa, natural language processing model. By stitching together these different modalities, we create a comprehensive system that is more robust and performant.

Each of these individual components has failure risk and putting them together drastically increases the system risk.

Not a trivial problem by any means. But we are crafty engineers.

Let's go into some of the challenges we overcame.

Cold Start Problem: Lack of available data set

To train AI models you need data. For the vast majority of AI models, there are great, publically available datasets to work with. Unfortunately for us, that wasn't the case here. We needed a dataset of GUI components and what people do on their computers, which is just interaction data that isn’t publicly available.

*Solution: Restricting the production distribution and hand labeling *

You need more data to train an AI model as the variance in the data you see when deploy your model to production increases. So, by constraining the production data distribution we train for to the task that we specify, we drastically reduce the data needs.

To overcome the lack of publicaly available data, we hand labeled the data set. Hey, someone had to do it 🤷‍♂️

*Problem: Computational heaviness of deep nets *

Deep learning models are hefty to run. So hefty in fact, that an entire market of GPUs has been catapulted by their computation needs. Running these models on a regular laptop is taxing and could compromise the integrity of the user experience if done in a naive way.

We had to find a way around it.

Solution: Hierarchical segmentation and classification The YOLOv5 model outputs bounding boxes as well as class labels for components detected on a screen. We then have a secondary network that logs and models the sequence of component actions that the user takes (through clicks, key inputs, etc). This then represents our action ontologies which is augmented by spatial relationships of the GUI components (proximity of components is a good heuristic of "relatedness" in terms of actions.

Bonus Problem: Task Representation To execute tasks on the UI, our AI has to learn tasks. Many AI folks would immediately think an end-to-end deep model would represent the task successfully. While in theory, yes, but not so much in practice. To

All modern deep learning techniques are completely unviable here given the amount of data they need. Imagine your Ditto saying "Hey, can you show that to me 100k more times" 🥴 ... Yuck

We had to find a way for Ditto to learn a task with just 1-2 demonstrations.

Solution: Come talk to us in private 🤫 We found a way, but it's a kind of a secret 🙊

What a great learning experience!

There is no better way to learn than trial by fire. By diving in, we learned some key things that will stick with us for a while:

Narrowing the scope For any open-ended problem, it is tempting to create a solution that adapts to all possible edge cases. But when starting out, that drastically increases the complexity of the problem and the time to ship. Narrowing the scope is vital.

Context sharing It's easy to get siloed and branch off in tangents during a hackathon. Learned how to do code hand-offs/documentation, regularly diagraming systems, and just learning to communicate with teammates.

Superpower-based task allocation While it sounds obvious, getting people to do what they're best at, at all times, is quite hard. It's easy to start bike-shedding or offloading unimportant/important tasks. Learning to relentlessly prioritize tasks and only working on immediate S-tier problem was a huge team efficacy learning.

Aspire's Future 🔮

We are on a mission of manifesting the ideation economy through automating away meaningless tasks. Spent more time doing what matters. This is just the beginning 💫. Come learn more, you'll like what you hear.

Share this project:

Updates