The Cambridge Analytica scandal was a message and reminder that, if present or generated, our data is available for collection. As a part of its response to the scandal, Facebook has allowed users to download their data to see what Facebook knows about them. As a group of loyal followers to /r/dataisbeautiful, we took advantage of the opportunity to analyze ourselves.
What it does:
There are four graphs available to you upon the upload of your precious personal lives.
- Time Series: the number of messages you've sent over one year (by month).
- Option to overlay multiple years.
- Force Graph: each node is a conversation between you and a person.
- An undirected edge is plotted any time at least two unique people (neither of which are you) are in a conversation with you.
- The size of each node is scaled to however many messages have been sent to that conversation.
- This results in a graph of group chats (between you and at least three other people) in the center of the plot. Surrounding it is an empty graph (a collection of free floating nodes) which are conversations between you and another person.
- Word Cloud: The most commonly used words used in your conversations. (Excluding stop words.)
- 3 Year Heatmap: The number of messages sent per day over the most recent three years' worth of messages.
- Bar Chart: Plots the frequency of messages per hour per group.
- Under the plotted Time Series, there are four carded statistics:
- the number of peak monthly messages,
- total number of messages,
- average number of messages seen per month, and
- how many years the user has been with Facebook.
How we built it:
- Node Data Analyzer
- Web API
Challenges we ran into:
- Routing for Angular was a pain. And then we switched to Webpack.
- ...And then we switched to Bootstrap.
- The first theme we tried with Webpack was glitchy and scrolling did not work. When attempting to plot multiple charts on a single page, the offset was not handled correctly, and visually was not usable. After about an hour of trying various (hokey) workarounds, we scrapped the theme and now use the current one.
- Struggling to accept the fact that we didn't have enough time to do all the fun things we wanted to do. See What's next for CitrusAnalytica.
- Our scrum master (Will) passed out from 4:00-6:30 AM. /s
- Finding and tweaking available Stop Word Lists into meaningful ones.
Accomplishments that we're proud of:
- Jerry is a Map/Reduce legend.
- Will eats Node for breakfast after it eats him.
- Justin figured out routing for Angular (even though we didn't use it).
- Audrey can shotgun chart ideas faster than Will can say, "MongoDB is a very good non-relational database."
What we learned:
- Sunk cost. Sometimes it's easier to start over than force a glitchy old thing to work.
- The true power of Map/Reduce (and Jerry).
- "It's actually really slow in Mongo." -Jerry
- Having a server at your disposal is very liberating. (But Will knew that already.)
- Mapping out ideas (flowcharts, lists, etc.) during the planning process is a pretty good idea.
- Force graphs may not plot well with particularly dense graphs, or if Will's name is hardcoded into the client-side application.
- "It was just for testing..." -Will
- Heatmaps look even more beautiful when you suffer through making them yourself.
What's next for CitrusAnalytica:
- Host user information temporarily.
- Add user accounts (with appropriate security and privacy measures; despite our name, we're not going to repeat our namesake's shenanigans).
- Support charts for algorithms [more time consuming to implement] like Matrix Profiling.
- Improve user design.
- A color theme worthy of the data visualization it's used for.
- A better data solution; MongoDB is not the best for Map/Reduce. Perhaps Hadoop.
- Image processing and feature generation (to estimate correlations with conversation topics).
- More robust visualization systems like D3.js.
- "For the weird intersection of math majors and Web Devs." -Jerry
- "Yeah, computational geometry, Jerry." -Will
- Sentiment analysis and all its NLP magic.
- Manage people [with force graphs] like Audrey while still still making every other normal person's force graph look normal.
For the judges:
- For the best domain name category/prize, our domain name is citrusanalytica.com.
Some extra stats and fun facts:
- Over 180 MB worth of uploaded message data between the four of us.
- We spent the most time not on something data science related, but trying to get the Webpack theme to behave like any self-respecting UI.
- We had to lower the repulsion on the Force graph just for Audrey because she's too popular.
- "Having 543 groups is a little not normal." -Jerry, a paraphrase