MonkAI

Inspiration

Scientists attribute human vision as one of the reasons for our intelligence which has evolved since the Cambrian Era. Computer Vision algorithms since the advent of Computing have been around, with Images and Videos being captured, processed and transmitted across different channels.

Today different domains from Remote Sensing, Security Surveillance down to Pathologies, require human visual intelligence for decision making.

The field of computer vision and deep learning is growing at an accelerating pace. Each day 100s of new research papers are being published where the focus has been on accelerated cloud porting, edge deployments, speeding up training processes, involving algorithms from interdisciplinary fields.

Currently, developers and researchers use popular Deep Learning frameworks like Tensorflow, Pytorch, Mxnet. This requires learning different styles of implementations, and every framework having its own sets of pros and cons.

Computer vision being intrinsically a complex domain requires an amalgamation of unique skills from programming and mathematics. This intrinsic complexity inadvertently requires the frameworks to be complex with a steep learning curve. This pace of advancements and non-standardized ways of presenting the research creates chaos. Some of the challenges that inspired us to build MonkAI --

As a junior developer, intern, or even a freelancer to carry full understanding of only one framework is a tough ask. Polyglot programmers are a rarity in this domain. A student pursuing MOOCs or in-person courses for certification and training in Deep Learning is also taught the concepts using one of the widely used frameworks. However, the industry today is chaotic in their understanding and expectations from the technology. Teams usually have to work with a variety of tools before becoming comfortable with a process. This causes delays and burns a lot of time and money. This was the first reason we realized the need for a syntax invariant tool, that lets developers easily transition across frameworks, for example switching from Tensorflow to PyTorch and being able to leverage all the benefits that PyTorch offers, over Tensorflow.
There has been a recent rise in competitive coding challenges on platforms such as Kaggle, HackerEarth, Codalab, etc. Developers get access to state of the art labeled datasets via these competitions. Since these challenges usually come with a deadline, developers are always on the lookout for quick prototyping tools that give them an edge in the shortest period. Winners more often than not have a standard set of tools that are applied and evaluated before increasing the complexity of their solution. Setting up these standardized workflows takes years to master. Using the concepts and practical advice from these Kaggle grandmasters and other hackathon winners we built a jupyter notebook based low code tool to help participants approach these competitions with a better starting ground.
For consumers, teams, and businesses looking to build a Computer Vision application, the absence of a single place to compare and find the best-suited solution increases the time and cost required to validate their products. For example, an application in need of a vehicle detection algorithm may wish to benchmark original versions of detectron and MediaPipe Objectron on their dataset, and since these are written in two different libraries the user has to know both TensorFlow and PyTorch. This initial challenge of understanding the multiple heuristics of application building inspired us to create a standard workflow. Using a low-code workflow across the widely used state of the art algorithms in object detection, classification and image segmentation, we started making one large, constantly growing pool of algorithms to implement, modify and compare against.
The explosion in the number of research paper submissions especially in the domain of Deep Learning has created a challenge for several conferences and journals with reproducibility of research being questioned. With no standard language or workflow and publications built using different frameworks and programming languages increases the review period and complexity. A toolkit to create projects in a standard manner that supports easy reproducibility and portability again made sense as a solution to this problem.
Most researchers today utilize pieces of previous implementations and build on top of earlier research work. The standard process involves setting up a legacy implementation with major challenges in preparing the codebase to ingest custom datasets. With an absence of standard ways to implement solutions, readability and understanding the code becomes a challenge. For example, YoloV3 original implementation compared with the latest YoloV3-SPP3 has drastic variations, hence understanding each of them would take up different amount of times. This inspired us to create standardized low code wrappers around state-of-the-art implementations/networks which also are cross-compatible with frameworks.
With the lucrative opportunities available in this industry, traditional software developers are looking to switch their expertise from java, c++ like languages to python, scala. Along with this shift, they also need to improve core skills in mathematics, algorithms, machine learning, the syntax of frameworks, etc. This inspired us to build a suite of no-code tools, with the ease of access to help beginners transition to an advanced level of theoretical and practical understanding of this field. This later took shape into our GUI tool today used in the industry by small and medium enterprises for creating applications.
It is usually found that across frameworks certain algorithms are always missing. For example, while PyTorch has AdamW optimizer, the original TensorFlow doesn’t. Mish activation is in TensorFlow, but not in the original PyTorch, but it is available in the Echo AI package.

Our vision is to standardize development across different frameworks while aggregating all host of features for Data, Deep Learning and Deployment to provide a one-stop-shop for Computer Vision.

What it does

We are building a one-stop-shop for computer vision. Our platform Monk aims to standardize computer vision development and deployment pipelines. For the past 6 months, we have been creating a syntax invariant unified wrapper for widely used Deep Learning frameworks like Keras, Pytorch and Mxnet. There are three libraries in this opensource toolkit.

Monk Image Classification - A low code programming environment to reduce the cognitive load faced by entry-level programmers while catering to the needs of Expert Deep Learning engineers. Our core focus area is at the intersection of Computer Vision and Deep Learning algorithms.
- For Beginners
  - A low code Python syntax to practice and understand deep learning a way to switch between deep learning frameworks using a single line of code and thus understand the features of each library without having to learn the nuances of each framework reduces the cognitive load of learning how to build applications
- For Pro-Developers
  - Create, manage, compare, port and reproduce prototypes and finished experiments to exploit advanced features of backend frameworks using standardized workflows
- For Researchers
  - Access to popular research datasets and models for easy compare and benchmarking custom data curation and ingestion scripts to test research implementations. Prototype and experiment management with easy to use programming and GUI access.
- For Competition and Hackathon participants
  - A jupyter notebook based programming interface with off the shelf deep learning models and easy to use compare and analyze experiments. It helps them keep a track of their entire journey through the competition saves a lot of time by removing the process of setup and customization
Monk Object Detection - Monk object detection is our take on assembling state of the art object detection, image segmentation, pose estimation algorithms in one place, making them low code and easily configurable on any machine.
- For Beginners
  - Easily set up and learn the different state of the art Object Detection algorithms at a single place
  - Access to different parameter tuning in a low code environment with support for custom datasets
- For Developers
  - Used as a benchmarking tool to select the best performing algorithms considering metrics like runtime, accuracy, etc to easily package and port their work across different platforms and devices.
- For Researchers
  - An easy starting point to create and build prototypes as per their requirements in logging and reproducing their research work along with source code while publishing research.
- For Competition and Hackathon participants
  - The hassle-free setup makes prototyping faster and easier. Our toolkit with default workflow recommendation acts as a starting point in choosing the right approaches while prototyping multiple experiments with tools.
Monk GUI - An interface over these low code tools for non-coders.
- Helps non-coders transition into the field of deep learning and computer vision
- Works as a quick prototyping playground while keeping a log of all your previous experiments
- Provides access to state of art Computer Vision models and algorithms in an easy to use, point and click interface

How We built it

Tessellate Imaging is a team of engineers solving computer vision bottlenecks using deep learning algorithms. We design, develop and deploy Computer Vision applications. We have worked on a broad spectrum of imaging modalities with clients across the globe.

As Deep Learning engineers, we faced the hassle of building and testing solutions that are cross framework. A major chunk of our time was spent in setting up custom workflows on state of art research which was usually preceeded by a steep learning curve.

Creating a Deep Learning project for a developer entering this domain is huge even in native frameworks. A low code syntax around common patterns from Deep Learning workflows across the industry helps programmers and students from different domains to analyze and process visual data.

Our development process started with creating standard workflows for simple transfer learning applications while we were participating in Kaggle like competitions. We had 3 different transfer learning codes, one from PyTorch, one from Keras and another from mxnet-gluoncv. And each of these pipelines had common patterns which we observed and explored. This idea was later appended with a project management system along with low-code syntax invariant syntax to easily transition across backend frameworks. It took us 2 months to create this base. It involved creating a common base for deep learning layers, activation functions, dataloaders, transforms, optimizer, every element involved in transfer learning.
Once we had the project management system it was easy for us to add features like comparing multiple experiments across metrics such as accuracies, losses, etc. Then adding hyper-parameter finders for choosing the best system state was the next step.
We open-sourced a stable version of this toolkit and started applying it for competitions and creating demo applications. The community reached out to us asking for similar standardization across complex algorithms like object detection, image segmentation, pose estimation, etc.
After analyzing different object detection pipelines and understanding the common patterns in each of the tested workflows we exploited the same to create monk object detection library with many state-of-the-art pipelines. At this stage, our contributors from the team started showcasing how easy it was to build applications using monk libraries
The last few weeks were spent in creating a custom neural network builder that can be used to build graphs that can be ported across the backend framework. The next part involved adding tutorials and proper documentation.
While we were working on these core libraries, non-developers approached us with the need for a GUI. We wrapped our already low code and structured project into a point and click interface, getting the basic version of GUI was a matter of 1 month.

Our toolkit Monk acts as a unified wrapper on top of these Pytorch, Keras, and Mxnet with the integration of Tensorflow in the works. Our mission is to reduce the cognitive load faced by an amateur programmer entering this field while also having mature features to cater to the demands of cutting edge researchers.

Creating, Training and Comparing state-of-the-art Computer Vision implementations is now just a few clicks away. We have created a simple, point and click interfaces abstracting away all the programming and setup hassles, making the application building exercise satisfying.

We have already open-sourced our library and are now releasing features every week. MonkAI is open-sourced with the purpose to democratize and standardize Computer Vision & Deep Learning development. We are in talks with multiple AI education providers to include MonkAI in their curriculum and use it as a base tool while teaching Deep Learning and Computer Vision courses and evaluating projects.

We are cumulatively at 160K lines of code and are connected to some of the most vibrant Deep Learning and Computer Vision communities across different social media channels.

Challenges we ran into

Building an SDK that resolves issues for a broad set of people and is inclusive of essential features for multiple tasks is a long and arduous journey.

The very first challenge was to completely understand the syntax and working of every base framework - PyTorch, Keras, TensorFlow, mxnet, gluoncv, chainer.
Finding intersections and repetitive patterns in development and creating a unified programming syntax required us to also find out challenges across various imaging modalities ranging from Satellite imagery down to Medical imagery.
The library has been designed carefully keeping in mind modularity and portability ensuring that newer feature additions and backend library additions should be faster than how it was then.

We understood the hassles in requirements understanding and vague expectations from business leaders and stakeholders along with the pains of a developer trying to manage the ever-increasing number of experiments and prototypes.

Keeping ourselves open to the fact that this tool is not just for beginners and students but also needs to cater to expert developers and research scholars. We had to integrate functionalities that could cater to everyone with the biggest challenge being to make the tool comprehensive yet simple to use.

*Some key example of our experiences turning to solutions can be seen in features such as - *

With our Object Detection library, the major challenge involves understanding the native code, resolving bugs from it to support versions of OS, python, and other libraries and then finally moving onto creating a low code wrapper on top of it.

These are the challenge we still face and are looking to find ways to overcome them :

versioning in native libraries and loss of backward compatibilities like in tf2.0 transition
changes made in project structure leading to import error and inconsistencies.
multi-OS incompatibilities
Python and CUDA versioning incompatibilities

Lastly keeping up with the neck-breaking pace of development in this industry. With 100s of new research papers being submitted to conferences and journals, it becomes impossible to find a good solution for solving a critical problem.

Accomplishments that we are proud of

Along the way, we have garnered immense appreciation and feedback from the community. Getting connected with some of the greatest leaders in the domain of Artificial Intelligence and discussing challenges of Computer Vision has been an honor.

By hosting webinars, tutorials and creating copious amounts of learning content and walkthrough we got a chance to reach out to developers and students who either professionally or academically facing bottlenecks of this domain. Some key accomplishments we are proud of :

The constant increasing activity across all offerings of MonkAI has kept us going :

doubts being raised,
increasing demand for a variety of content and tutorials and
number of periodic issues being raised on our Github repos
We have become one of the primary goto tools with participants from Zindi-Africa and HackerEarth competitions who achieve top percentile with minimal efforts using our tools.

Our increasing social influence on popular communities of Deep Learning and Computer Vision on Facebook, Linkedin and Instagram have helped reach out to a plethora of enthusiasts who review and discuss our published content.

Our blogs have been accepted in renowned medium publications like towards data science, the startup, giving our reach a further boost.

Our contributors have even received job offers based on their learnings after going through MonkAI. BLOG POST

What we learned

The absence of standard protocols and workflows acts as an impedance to the accelerated growth of this domain. We realize the benefits of Open Source contributions in the fields of Programming specifically with Computer vision and Deep learning.

Standardization is the need of the hour with Low-code and no-code tools reducing entry barriers and giving ease of access. Such comprehensive tools can only be developed with support and guidance from the community.

Some major roadblocks in the pathway are Patenting and data usage restrictions. Information overload with huge marketing creates expectation gaps between what is available and what can be developed. Also with the surplus of training and teaching material creates chaos for beginners trying to enter this domain.

Students are piling certifications with little or no exposure to the challenges faced when working on actual business problems. This leads to a shortage of talent and expensive employees for businesses.

For businesses with little to no expertise in Deep Learning, it requires 6-12 months for validating an idea along with millions of dollars in expense for building and managing teams.

What is next for MonkAI

Deploying Computer Vision applications that are scalable is a hard task today. Some of the most brilliant minds are working towards building efficient Deep Learning solutions without requiring intensive computing power.

In our attempt to advance research in this domain we are actively working towards our future roadmap with broad features like :

One-click deploy to cloud/embedded environments
Segmentation, Recognition, Tracking, Reidentification and Video processing algorithms
Integrate Data loading utilities for different imaging modalities
Access to state of art research implementations wrapped in easy to use workflows.

Some algorithm specific features and capabilities we are looking to develop on along the way :