Github: https://github.com/Palmirouze/HiddenPouteam Website: http://meanvalue.net

Mean Value

Second Hand Objects Price Estimator

Inspiration

Did you ever need buy or sell something on the second hand market ? Postings can be messy, prices vary a lot and it is a tiresome process. We want to fix that by aggregating data from known websites and compute some statistics based on averages, locations, time and popularity of items.

For the purpose of ConUHacks we focused on the smartphone market as a proof of concept as it offers a wide variety of brands and models and is an extremely popular item on the market

What it does

From Kijiji, we scrape the price, time and location of each posting related to the searched item. Users can search for a phone model and MeanValue will display it’s price and other relevant statistics. The user can also select a model and access all the Kijiji listings for this specific model.

How we built it

The Data: Scraping and Filtering

To ensure coherent and useful price data, we needed to scrape as much data as possible, while eliminating spam ads. To build a robust scraper, we used the Scrapy framework. Built on python, it allows for very flexible scraping, as well as modular data filtering. The main advantage of this framework is allowing non-synchronous scraping. Getting a pge and uploading an object to a dB are time consuming, blocking tasks. Having an engine schedule all actions and processing them in parallel allows for huge performance improvements compared to a simple solution built using basic python libraries.

In Scrapy, Spyder objects are used to generate requests, and parse the response from the website. Currently, we have only built a spyder to scrape Kijiji, however, it can easily be extended for other websites.

The Scrapy engine gathers all requests generated by the spyder and schedules them. This allows for extremely fast scraping, while allowing us to customize wait times and maximum concurrent connections.

The responses are sent back to the spyder, to extract relevant data and generate items to be inserted in the database. The items are then sent down to a pipeline to be processed. Pipeline components can be added as wanted very easily from the configuration file.

In our current pipeline, we have the following components: A filter that eliminates invalid objects (for example missing price data) A filter parses the capacity of the telephone from the ad description and ads it to the object Finally, the objects are passed to a mongoDB component, that upserts (inserts if non-existent) the objects to the DB.

Backend

We wrote the backend website in golang to interact with the mongoDB and generate the website from templates. The only library used is mgo in order to query the mongo database hosted on mlab.com. mlab was used so we can easily share the database between us.

The golang server also takes care of generating stats on startup about the number os listing, the brands and other info.

We used D3 to generate charts and display data.

Challenges we ran into

Accomplishments that we're proud of

IT FUCKING WORKS

What we learned

This hackathon taught us a lot about: ..* Scraping and building efficient data pipelines ..* Building websites in GO

What's next for MeanValue

We hope to further refine the data parsing process to better filter for outliers and unrealistic prices, as well as expand the site to list products other than phones and cities other than Montreal. We plan to support other second hand markets websites in order to gather more consistent informations.

The advantage of using a flexible scraping framework like Scrapy is that we can very easily add more filters and more data sources. Something that was unfortunately unrealizable in 24hrs is building a Machine-Learning based filter. A Support Vector Machine classification algorithm could be very well suited for this task. Once a sizeable dataset has been gathered, it would be easy to train our algorithm to detect spammy or faulty ads.

Share this project:

Updates