We initially saw the project on the Challenges for this year's Datathon and thought it would be a really fun and exciting one to do.
What it does
There are two parts to our program. The first step consists of several web crawlers working together to scrape over 21,000 products across 13 Departments. The second step is intelligently ranking and returning these scraped products when a user enters a query. The algorithm uses a combination of unsupervised clustering and semantic analysis to return to the user the most relevant products.
How we built it
Challenges we ran into
The biggest challenge we ran into was scraping the data off of walmart.com. With our initial method of scraping, we were only visiting about one page per 5 seconds and we constantly ran into Captcha errors. Finally, we realized that if we sacrificed a little of the data that we wanted to collect, we could feasibly complete the scrape before morning.
Accomplishments that we're proud of
Believe it or not, our proudest accomplishment was finishing the data scrape. It took several mind-numbing hours, but when it was finally finished we were ecstatic and ready to move on in our project.
What we learned
During this project, we learned about search engine implementations, and also how relevance, clustering, knowledge models work together in data science. We also picked up a thing or to about writing web scrapers.
What's next for SearchMart
Reducing our search time or improving our search success rate even more.