I personally use Jira and Confluence at work on a daily basis and I’m aware of some of it’s limitations when it comes to search. And it also happens that I’m obsessed about Natural Language Understanding; specifically, I have been playing around with Question Answering (QA) for more than two years. Initially I implemented this add-on with a simple CQL query and QA. I asked if I would install this myself? The answer was an honest no. It’ll be useful but not enough to justify its position in my add-on list. At some point, somehow I recalled Elon Musk saying this an interview:
"If you're entering anything where there's an existing marketplace, against large, entrenched competitors, then your product or service needs to be much better than theirs ... It can't just be slightly better. It's got to be a lot better."
In the other corner of my brain, Eric Schmidt goes:
“You often hear people talk about search as a solved problem. But we are nowhere near close.”
My only intention was to incorporate QA into Confluence but since I’m already here I thought I might as well take a stab at it. Search is a hard problem. An incredibly hard problem if you want to do it at a world wide web scale but not as much if you’re in a safe, contained, well structured and well intended environment like Confluence. And that’s what all this is about.
What it does
Basically indexes information from Confluence & Jira and makes it easily accessible through various forms of search.
- Question Answering - Extracts answer from raw web pages for any given question
- People Also Ask - Finds similar questions related to user’s
- Federated Search across multiple confluence and jira instances
- Access Control System for Atlassian’s Permissions and Restrictions
- Custom stop words and synonyms
- Spell checking and typo tolerance
- Real-time Search
- Optical character recognition
- Image labelling
- Reverse image search
How we built it
We spent more a year intensively researching the Question Answering system which clearly came in handy. Also years of our prior experience in actively studying and researching state-of-the-art machine learning models helped quickly deploy models for People Also Ask and Reverse Image Search features.
We just use ElasticSearch as our primary search engine; MongoDB and Nodejs with Atlassian Connect to glue various microservices together. We use Tensorflow extensively to train and deploy models. It goes without saying that we primarily use Python for all our ML workloads. A messy combination of GRPC & REST for inter-service communication, Redis for cache and j*uery for frontend. I chose jQuery instead of something like React as that would slow me down even further, I already had a lot of things to learn in great painful detail.
We run all our workloads inside a single Kubernetes cluster on Google Cloud. Kubernetes allowed us to dynamically scale ridiculously expensive GPU instances down to zero instance when it’s not being actively used. On top of that we also use preemptible instances to reduce our operating costs even further. We mostly use TPUs for training and GPUs for inference.
Challenges we ran into
- Familiarizing ourselves with Atlassian ecosystem and developer toolkits
- Implementing access control system
- Coordinating communication between fairly large number of microservices
Accomplishments that we're proud of
We worked on training a machine learning model that automatically builds Knowledge Graph from raw text. It basically extracts relationships with various entities in a paragraph. It was performing relatively well to our surprise! For example, given the wikipedia page of Google as input, the model can generate subject, object verb triplets like below:
Google, subsidiaryOf, Alphabet Google, foundedOn, September 4, 1998 Google, foundedBy, Larry Page Google, foundedBy, Sergey Brin
We never got it interfaced with the rest of our system in time to feature on our demos. I’m super excited about this!
What we learned
The whole is greater than the sum of its parts. Each of these features can seem incremental on their own, but when put together, they truly are impressive. And hopefully useful.
What's next for Semantica
- Finish the Knowledge Graph generating model
- Improve model performances. Especially we could do a lot better in Spelling Correction
- Infrastructure cost optimization. GPUs account for huge margin of our operating expenses even in our current setup (preemptible & scale to zero)
- Analytics and data collection for insights
- Public Alpha - we’d really love to hear from others how we can improve!
And if there's any actual commercial interest:
- Public Beta on Atlassian Marketplace
- Consider offering PageBrain as on-prem solution