posted an update

Project was split into two halves: a set of data extractors and a way of browsing the extracted data. The data extractors were written in Ruby in order to parse the JSON object returned by Google Takeouts (Hangouts.json). JavaScript engines refused to parse it as the size increased (one of our developers had a ~400MB Hangouts.json object which resulted in a 5GB object in memory). A similar Facebook extractor was written, except their output format was a HTML document with many sibling elements, and parsing that was easier and required less memory. Both extractors wrote out to the same format of three CSVs: One for conversations, mapping conversation IDs to participant user IDs, one for user ID to name mapping, and one containing all the messages with message IDs, conversation IDs, and sender IDs.

Log in or sign up for Devpost to join the conversation.