We initially saw the project on the Challenges for this year's Datathon and thought it would be a really fun and exciting one to do.

What it does

There are two parts to our program. The first step consists of several web crawlers working together to scrape over 21,000 products across 13 Departments. The second step is intelligently ranking and returning these scraped products when a user enters a query. The algorithm uses a combination of unsupervised clustering and semantic analysis to return to the user the most relevant products.

How we built it

We first built our web crawlers in Python. Next, we used JupyterNotebooks (also python) to organize, format, and clean our data. Finally, we used a Python Rest API capable of returning search results for queries and some HTML, CSS, and Javascript to present an intuitive GUI.

Challenges we ran into

The biggest challenge we ran into was scraping the data off of With our initial method of scraping, we were only visiting about one page per 5 seconds and we constantly ran into Captcha errors. Finally, we realized that if we sacrificed a little of the data that we wanted to collect, we could feasibly complete the scrape before morning.

Accomplishments that we're proud of

Believe it or not, our proudest accomplishment was finishing the data scrape. It took several mind-numbing hours, but when it was finally finished we were ecstatic and ready to move on in our project.

What we learned

During this project, we learned about search engine implementations, and also how relevance, clustering, knowledge models work together in data science. We also picked up a thing or to about writing web scrapers.

What's next for SearchMart

Reducing our search time or improving our search success rate even more.

Share this project: