Waloogle

About us

Hi! We are undergrads from A&M and Berkeley, and we are both majoring is CS. Trevor will be graduating this coming spring, and Danny will be graduating in the spring of 2022.

Inspiration

We built this product to compete in Walmart's challenge of building a search engine

What it does

Basic search engine capabilities: You type in a search query, and Waloogle returns a list of the most relevant items that we scraped from Walmart's catalogue

How we built it

We first had to scrape data from Walmart's website. This was actually the most time-consuming process of the whole Datathon. After we had the data, we fed it into GenSim, which is a python module that trains a Latent Semantic Indexing model on our data. We can then run queries against our model to return the most relevant entries to the search. We deployed this model on a Flask API server hosted on Heroku. On the front-end, we built a basic search page using React and Next.js. This page makes calls against our search API, and returns the list of most relevant items in a clean and minimalistic format. We also include a relevancy metric to show how relevant a particular search entry is to the search query. We finally hosted our front-end on Vercel, where it is available to the world!

Challenges we ran into

The biggest challenge we ran into was actually the process of scraping data from Walmart's website. Initially we tried to use Puppeteer.js, which creates an instance of Chromium, but there were a couple problems. First, it took about 2 seconds to fetch information about a single entry. Fetching 20,000 entries would therefore take way too long. The second main issue was the fact that Walmart's website started throwing reCaptchas at us from time to time, which naturally ruined the flow of our simple web scraper. Eventually, this got so bad that we started looking at different methods. After a ton of iteration and trying out different web scraping libraries, we looked on Walmart's robots.txt file and found sitemaps that contained hundreds of thousands of different products. We were able to use these sitemap files and simple fetch requests to get information for 27,000 different products. After we got the data, we were finally able to actually run our data through ML models and build search engine functionality.

The second main challenge we ran into pertains to hosting the search api server. We were up against the clock to submit our project, and we were attempting to host the api on Heroku. At the last second, in a flurry of issues with git, python virtual environments, and ML breaking, we were unable to actually host the search api. We were, however, able to get the front end hosted on Vercel, so we simply hard-coded the data for the search results of "iphone case" into our front-end. Thus if you visit our site, it will return the same data for every request. Clearly this is suboptimal, but at the very least it demonstrates what the site would look like if it were working on the frontend and backend.

Accomplishments that we're proud of

We're happy that we were able to figure out how to get through Walmart.com's layers of defense against bots and get the data we need. We're also happy that we were able to get the different parts of the project working, even though they weren't smoothly working together at the end.

What we learned

We learned about Latent Semantic Analysis, how to webscrape, and how to put a simple search engine together from bottom-up.

Walmart Challenge Tasks

Task 1: Scraping Walmart.com We had two approaches to this problem:

Puppeteer.js: In our first attempt to webcrawl Walmart’s website, we used Puppeteer js. Puppeteer is a JavaScript library that provides an API to control Chromium. We wanted Puppeteer to visit a particular product page, get information about the product by looking at specific useful tags we found, and then subsequently visit each page for products that Walmart displayed under the “Customers also viewed” title. Our goal was to make a graph of similar products that we could use to index and search through our data. In doing this, we immediately ran into a couple problems. The first problem was that visiting webpages with Puppeteer is incredibly slow. It took roughly two seconds to visit each product page and collect the necessary data for the product at hand. Because we were supposed to collect information about 20,000 products, this would have taken far too long. The second problem we had with Puppeteer was even worse, however. After a short amount of time, Walmart started throwing reCaptchas at our web crawler, which naturally caused it to halt completely. Because we weren’t logged into anything, we thought the reCaptchas might go away if we simply refreshed the page, but as might be expected, the reCaptchas popped up more and more as we continued to crawl Walmart’s site with Puppeteer. The most unfortunate aspect of this is that we spent around 5 hours getting Puppeteer to work, and the reCaptchas only started coming after we attempted to get information about hundreds of products. We eventually realized that Puppeteer would simply not work, and we began trying other web scraping methods.
HTTPS requests: Our second approach was using simple HTTPS requests and parsing the HTML responses. This method ended up being extremely fast as we could issue many asynchronous requests to walmart.com.

But this approach also had a few issues. First, we could not get the similar products since those are dynamically loaded into the page so they are not in the HTML response. Another problem was that our ISP were throttling our download speeds since we were sending so much out traffic.

To solve the first problem, we used the ip sitemap from the robots.txt. This file tells a web crawler the proper ettique for crawling the site. We followed all of the rules for which pages to crawl: only visiting https://walmart.com/ip//

This sitemap gave up all the links we needed to gather our 20,000+ product information, instead of a typical approach of traversing the links on the page.

The second problem was solve by using cloud computing. Both of us ended up launching AWS EC2 instances in order to utilized the great networking capabilities of the cloud, since this was our biggest bottleneck. After getting everything set up, it only took us less than an hour in order to collect 27,000 items. Our huge json file of all the data can be seen here

Task 2/3: For the machine learning for creating a search engine, I decided to focus on Latent Semantic Analysis/Indexing. There was a lot to learn here! This is an approach that uses context clues to define words. I used a python package called Gensim to handle the complex ML for us.

After getting our model all trained and ready, I tried to create a Flask server to host our search engine backend that our user interface would be able to call. This was ultimately unsuccessful as I could not figure out how to deploy this small app to Heroku before running out of time. We instead simulated the use of the backend by creating data and serving it directly with the frontend for an example query in order to display our web apps functionality.

Task 4: For the frontend of this project, I wrote a web app using React and Next js. The web app utilizes Static Site Generation for quick load times, and we hosted the web app on Vercel. Vercel actually created Next js, so their platform is optimized for Next js. The functionality of the frontend is quite basic. There's a search bar, and you can use that search bar to search the dataset we collected for the most relevant products to your query.

Built With

gensimlsi
heroku
next.js
python
react
vercel

Updates

Daniel Geisz started this project — Oct 18, 2020 11:21 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.