Metropia brought GPS data from Tucson drivers. What lies in store?
What it does
Our visualization merges two interactive views into the data. First, a spatio-temproal view. We layout the drivers and their routes over the city of Tucson and filter based on time and location. Second, an inter-user similarity view. This view shows the most closely related groups of users via an undirected graph. This view also allows filtering of the spatio-temporal view, helping answer questions about how similar users behave, as well as offering an evaluation of a given similarity measure by verification against the semantic similarity of the users paths.
How we built it
A series of scripts take the raw, large GPS data from the Metropia data set and produce a smaller, cleaner version suitable for the web. We drop points that don't add much to the visualization -- We can get away with a few points for a 10-mile straight journey on the highway, but a meandering trip around the neighborhood requires more fidelity.
Next, we ship the data onto a web client. This is where things get tricky -- the raw data set consists of 12 million points and nearly 800 megabytes, and we need to fit this into browser memory and compute budgets. Additional filtering where possible and good choice of data structures lets us present an interactive experience.
Challenges we ran into
Not enough data! Our similarity clustering algorithm as well as many failed experiments would have been more successful had we had a year of data.
Too much data! The demands on the browser are almost higher than we can manage without some serious work on our visualization.
Accomplishments that we're proud of
We've got a good looking user similarity graph, which will help inform decisions about ride-sharing.
We put 12 million data points worth of data in a web browser.
What we learned
- Vectorization is hard.
- Splitting a problem into pre-processing and client-side processing can help speed up the final product, but more importantly engage team members that are not familiar with the specific web tehniques that you are using
- Sleep is important.
What's next for Tucson Flux
Speed it up a bit, and then attack the overplotting, streaming, and latency problems that will show up with increasing data sizes.