Inspiration We were inspired by companies being exposed to hacking and data vulnerability. Many customers fear that their data is being stolen and sold, and we wanted a way to ensure safety and trust in their hearts.
What it does: Our goal is simple, make data sets that produce the are statistically similar to the real data, while ensuring that the original data cannot be traced back from the synthetic data.
How we built it: We built this in a three step process. The first step splits up the data set into separate columns. Income, spending, age, etc would be now their own individual columns. We note down each columns mean, std, and range. Moreover, we make a correlation matrix between each column. Step 2 is adding noise, we add noise using the Laplace Noise to change up the data metrics. Finally, we can choose to generate a certain number of new data under a normal distribution using the new data metrics under noise. Then, using matrix decomposition, specifically the Cholesky Decomposition, we can correlate this completely new data set to fit the metrics we need to make it as statistically similar to the original data set.
Challenges we ran into: We realized that data isnt always numerical but it can also be categorical. This means that we had to figure out a way to make sure that we can apply noise to traits not just numbers. Moreover, we had to figure out how to make the new synthetic data that is uncorrelated back to being correlated. However, I think the biggest challenge of this Hackathon was idea generation. All of our team was new to Hackathons and we didnt have a clear idea on what type of ideas and problems were tackled in these types of events. We lost a lot of time brainstorming and pivoting ideas.
Accomplishments that we're proud of: We are proud of how we fought through the challenge and eventually got a working product. There was not only a lot of coding that was challenging but we had to figure out a lot of mathematical/statistical concepts that we were not familiar with to get the project running. I think this experience tested our patience and perseverance, and we are satisfied with how we dealt with the pressure and obstacles.
What we learned: We learned a lot of things. First, we learned how to add noise to data. There are many ways to do this but we specifically learned about how we can use whats called Laplace Noise. This noise is important for data and has an edge over other noise because it provides differential privacy with just one metric. Another thing we learned was the Cholesky Decomposition. This is like a statistical magic that can transform and uncorrelated data set into a correlated matrix. Moreover, we got an introduction into the world of data privacy and how companies can create synthetic data. We learned about existing forms such as Gaussian Copula and also its set backs. Finally, we learned so many little things along the way. Whether it be from the debugging or researching how to write a certain line of code, there were countless lessons that came up along the way.
What's next for EchoData: EchoData is a small project. We made it so that we can input a simple data set, a csv file, and recreate a new csv file. But companies dont hold data in simple files. They They hold data across databases, APIs, cloud warehouses, and interconnected systems. We want to explore how we can understand where their data actually lives, so that we can support more realistic forms of synthetic data creation.
Log in or sign up for Devpost to join the conversation.