Hi! We are undergrads from A&M and Berkeley, and we are both majoring is CS. Trevor will be graduating this coming spring, and Danny will be graduating in the spring of 2022.
We built this product to compete in Walmart's challenge of building a search engine
What it does
Basic search engine capabilities: You type in a search query, and Waloogle returns a list of the most relevant items that we scraped from Walmart's catalogue
How we built it
We first had to scrape data from Walmart's website. This was actually the most time-consuming process of the whole Datathon. After we had the data, we fed it into GenSim, which is a python module that trains a Latent Semantic Indexing model on our data. We can then run queries against our model to return the most relevant entries to the search. We deployed this model on a Flask API server hosted on Heroku. On the front-end, we built a basic search page using React and Next.js. This page makes calls against our search API, and returns the list of most relevant items in a clean and minimalistic format. We also include a relevancy metric to show how relevant a particular search entry is to the search query. We finally hosted our front-end on Vercel, where it is available to the world!
Challenges we ran into
The biggest challenge we ran into was actually the process of scraping data from Walmart's website. Initially we tried to use Puppeteer.js, which creates an instance of Chromium, but there were a couple problems. First, it took about 2 seconds to fetch information about a single entry. Fetching 20,000 entries would therefore take way too long. The second main issue was the fact that Walmart's website started throwing reCaptchas at us from time to time, which naturally ruined the flow of our simple web scraper. Eventually, this got so bad that we started looking at different methods. After a ton of iteration and trying out different web scraping libraries, we looked on Walmart's robots.txt file and found sitemaps that contained hundreds of thousands of different products. We were able to use these sitemap files and simple fetch requests to get information for 27,000 different products. After we got the data, we were finally able to actually run our data through ML models and build search engine functionality.
The second main challenge we ran into pertains to hosting the search api server. We were up against the clock to submit our project, and we were attempting to host the api on Heroku. At the last second, in a flurry of issues with git, python virtual environments, and ML breaking, we were unable to actually host the search api. We were, however, able to get the front end hosted on Vercel, so we simply hard-coded the data for the search results of "iphone case" into our front-end. Thus if you visit our site, it will return the same data for every request. Clearly this is suboptimal, but at the very least it demonstrates what the site would look like if it were working on the frontend and backend.
Accomplishments that we're proud of
We're happy that we were able to figure out how to get through Walmart.com's layers of defense against bots and get the data we need. We're also happy that we were able to get the different parts of the project working, even though they weren't smoothly working together at the end.
What we learned
We learned about Latent Semantic Analysis, how to webscrape, and how to put a simple search engine together from bottom-up.
Walmart Challenge Tasks
Task 1: Scraping Walmart.com We had two approaches to this problem:
HTTPS requests: Our second approach was using simple HTTPS requests and parsing the HTML responses. This method ended up being extremely fast as we could issue many asynchronous requests to walmart.com.
But this approach also had a few issues. First, we could not get the similar products since those are dynamically loaded into the page so they are not in the HTML response. Another problem was that our ISP were throttling our download speeds since we were sending so much out traffic.
To solve the first problem, we used the ip sitemap from the robots.txt. This file tells a web crawler the proper ettique for crawling the site. We followed all of the rules for which pages to crawl: only visiting https://walmart.com/ip//
This sitemap gave up all the links we needed to gather our 20,000+ product information, instead of a typical approach of traversing the links on the page.
The second problem was solve by using cloud computing. Both of us ended up launching AWS EC2 instances in order to utilized the great networking capabilities of the cloud, since this was our biggest bottleneck. After getting everything set up, it only took us less than an hour in order to collect 27,000 items. Our huge json file of all the data can be seen here
Task 2/3: For the machine learning for creating a search engine, I decided to focus on Latent Semantic Analysis/Indexing. There was a lot to learn here! This is an approach that uses context clues to define words. I used a python package called Gensim to handle the complex ML for us.
After getting our model all trained and ready, I tried to create a Flask server to host our search engine backend that our user interface would be able to call. This was ultimately unsuccessful as I could not figure out how to deploy this small app to Heroku before running out of time. We instead simulated the use of the backend by creating data and serving it directly with the frontend for an example query in order to display our web apps functionality.
Task 4: For the frontend of this project, I wrote a web app using React and Next js. The web app utilizes Static Site Generation for quick load times, and we hosted the web app on Vercel. Vercel actually created Next js, so their platform is optimized for Next js. The functionality of the frontend is quite basic. There's a search bar, and you can use that search bar to search the dataset we collected for the most relevant products to your query.