Walmart-Product-Search-Engine

Application link: http://ec2-18-221-166-245.us-east-2.compute.amazonaws.com:5000/search

We were inspired by the search engine challenge as we took a course called information retrieval recently and wanted to try our hand at a real-world dataset and the Walmart challenge gave us one of the best opportunities.

Task I -: This was one of the most laborious and frustrating tasks as we faced a lot of forbidden issues and captcha but this was a great learning step for us as we came across a lot of workarounds and things that we can face while scrapping data and processing it.

Task II -: A search engine to look up products listed on Walmart.com. Developed as part of the TAMU Datathon 2020. We started by trying different searching models such as boolean retrieval and used different ranking methods such as vector space model, BM25, TF-IDF scores, etc. We finally ended up using BM25 for ranking and boolean retrieval for matching relevant products. The reason we chose Boolean Retrieval is that the data we are working on is an eCommerce data and while searching for products we usually want exact term matches that we have in our query and Boolean Retrieval does exactly that for us and in a quick manner.

Task III -: Initially we tried with meanshift clustering, but we didn't get good results with meanshift clustering algorithm. We tried OPTICS as well. In the end, we felt hierarchical clustering would make sense because of the structure of the data. We implemented Agglomerative hierarchical clustering using manhattan distance as the metric, with a distance threshold as 20. Clustering was done on Word2Vec Embeddings with 100-word dimensions.

Task IV -: Finally we developed a web page using flask API, we created a simple UI due to lack of time. Finally, we used the clustering methods when our boolean retrieval method showed low confidence score. After much analysis, we cam across a threshold of 0.35 for the best result.

Built With

jupyter-notebook

Submitted to

TAMU Datathon 2020
- Winner 2nd Place Walmart Challenge

Created by

I wrote the crawler program and did data preprocessing. I also made the UI for search.

Mohinish Chatterjee
I worked on the search module of this project like trying out Vector Space Model, boolean retrieval and exploring knowledge graph

Purav Zumkhawala
I worked on the clustering the data. Used agglomerative hierarchical clustering on the word2vec embeddings of dimension 100. Then used cosine similarity on query string to assign it to the respective cluster.

Vignesh Babu Manjunath Gandudi
Snehasish Sen

Updates

Mohinish Chatterjee started this project — Oct 18, 2020 10:24 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.