Application link: http://ec2-18-221-166-245.us-east-2.compute.amazonaws.com:5000/search
We were inspired by the search engine challenge as we took a course called information retrieval recently and wanted to try our hand at a real-world dataset and the Walmart challenge gave us one of the best opportunities.
Task I -: This was one of the most laborious and frustrating tasks as we faced a lot of forbidden issues and captcha but this was a great learning step for us as we came across a lot of workarounds and things that we can face while scrapping data and processing it.
Task II -: A search engine to look up products listed on Walmart.com. Developed as part of the TAMU Datathon 2020. We started by trying different searching models such as boolean retrieval and used different ranking methods such as vector space model, BM25, TF-IDF scores, etc. We finally ended up using BM25 for ranking and boolean retrieval for matching relevant products. The reason we chose Boolean Retrieval is that the data we are working on is an eCommerce data and while searching for products we usually want exact term matches that we have in our query and Boolean Retrieval does exactly that for us and in a quick manner.
Task III -: Initially we tried with meanshift clustering, but we didn't get good results with meanshift clustering algorithm. We tried OPTICS as well. In the end, we felt hierarchical clustering would make sense because of the structure of the data. We implemented Agglomerative hierarchical clustering using manhattan distance as the metric, with a distance threshold as 20. Clustering was done on Word2Vec Embeddings with 100-word dimensions.
Task IV -: Finally we developed a web page using flask API, we created a simple UI due to lack of time. Finally, we used the clustering methods when our boolean retrieval method showed low confidence score. After much analysis, we cam across a threshold of 0.35 for the best result.