Inspiration
We started building on Arweave during the start of the Russia and Ukraine crisis. Our bots stored lots of data on Arweave, but we had very poor ways to track what we were uploading. We started taking our manifest files and aggregating them to make large indexes we could manually search through whenever we wanted to find our previous artifacts. We always had the idea of building better discovery mechanisms onchain, and we decided to start by indexing ANS-110 artifacts and using NLP and elasticsearch to help search through them.
What it does
It's a search engine, for ANS-110 content.
How we built it
We used the Goldsky GraphQL endpoint to start querying for ANS-110 content, for content that would have more data that can easily be parsed like JSON and HTML files, we extracted the text in the file itself removing elements like HTML tags. Taking this data we built an index (that is still growing because the graphql scraper is still live) to use with ElasticSearch. Through the frontend we take the queries from the user and pass them through a query parser to extract additional insights about the query and expand the terms in the search to allow for more potential matches. There are more details in the Github repo but we spent a lot of time tweaking the query parser and implementing NLP techniques to try and enhance the quality of results we were getting. The final results are then retrieved and returned by ElasticSearch.
Challenges we ran into
There are many challenges we identified during the course of the hackathon that we did not have the time to address - but plan to combat after the event is over. The biggest ones are listed below:
Uncleaned Dataset: There is a lot of redundant data on Arweave, every time someone posts a newly updated article all the previous versions of the article still remain onchain. While some protocols like Permapages have ways for us to correlate version releases, many others do not. We did not have the time to clean it We must develop a generalized approach to identifying redundant data entries + combining them as one result in the search engine. If we can do that, we can show the history of an entity evolving over time and remove duplication in our search results. We experimented w/ an approach that used the all-mpnet-base-v2 sentence transformer to embed the data of an article and check it using cosine similarity against suspected matches - but we weren't able to operationalize it in time.
Renderers: We did not have the time to build renderers for every data type that came up in our index. Instead, we just redirected our results to the arweave address to use the default arweave renderers. We need to spend some time creating multiple renderers for different file types for the search engine so results are more palatable for a regular user. Clicking a link and seeing a JSON file on a black screen would turn away a regular user.
Identifying quality: Quality measures were very difficult to identify. You'll notice that there are no quality signals in our engine at the moment and we're forced to use a TF-IDF term-matching method to find matches. Looking at results we can see that some results lower in our search results are actually much more valuable entries. We decided to implement analytics for click-through rate + an upvote/downvote system in the interim to start understanding some quality metrics, but their impact on the search algorithm is minor. We need better ways to parse the artifacts themselves - one example is to understand if an HTML file is a well-formatted page and to identify if it has sufficient text+length to be useful to the reader.
Images: With our current approach to returning search results, images cannot be utilized. There isn't enough text to return a result and so images were excluded from the search engine. Thanks to BERT, Pix2Struct, and other AI tools that can handle image --> text translation we are planning on utilizing a model to enrich the information of on-chain images to create much more informative descriptions that can be used to return images as results.
DevOps/infrastructure: Kubernetes gave us a ton of problems. ElasticSearch has a lot of documentation and many settings that need to be configured to get it up and running, and setting it up to work with our Kubernetes infrastructure took a lot of debugging.
Accomplishments that we're proud of
We got a search engine working in two weeks, which we didn't think was possible at the start. There are a lot of limitations but we have many plans to improve our product in the coming months!
What we learned
We learned a ton about modern NLP tech and search infrastructure. It was a very fun experience and hopefully, we can build out the remaining components and massively improve the quality of discovery on the permaweb.
What's next for Capsl8 Search Engine
Many of our problems have potential solutions that we have prototyped but have not gotten the chance to implement into a live product. We will likely spend the next few months optimizing and improving the quality of data, results, infrastructure, and the kind of information we can return.
We also have the foundations of SEO and Google AdWords so we can likely implement an advertising model once our engine is capable of producing some higher-quality results.
We want the engine to be open to the public and non-developers, so there is a lot of work we plan to do in user experience including utilizing Arweavekit + Othent to improve the login + user flow for onboarding.
Decentralization is also a huge goal moving forward, there are some components of our infrastructure that would be very difficult to fully decentralize, but the databases can easily be put onto arweave and should be a near-term goal, and then we will turn our attention to the remaining components.
Built With
- angular.js
- digitalocean
- docker
- elasticsearch
- github-devops
- kubernetes
- mongodb
- nestjs
- node.js
Log in or sign up for Devpost to join the conversation.