Automation of Data Ingestion Pipeline for Data Lake

Inspiration

Banking data from disparate systems are very important to make decision by applying visualization or Data analytics, predictive analytics to take quick financial decisions. Making data available in the right format on right time is very important.

What it does

Solution is to address the problem by automating the ingestion process by using parameters through API’s which will automatically generate required configurations. Once the required configurations are generated, the data can be moved from different sources to target data lake(e.g.Hive).

How we built it

The ingestion framework is built in such a way that it automatically detects the structure of the data and generate configuration reducing the manual work to get streamlined data using spark streaming with Kafka and SQL configurations.

Challenges we ran into

The generation rate is very high for real-time data. The sources can be sensors, Cloud based applications, files etc.

Accomplishments that we're proud of

Creation of an automation pipeline to handle the real time data ingestion effectively and efficiently.

What we learned

Business data required for the visualization and analytics.

What's next for Automation of Data Ingestion Pipeline for Data Lake

Will complete the overall flow and will add support for multiple source and multiple target including cloud data ingestion with various data structures (Json, Yaml , XML etc.). This framework can expediate the process of cloud migration journey with minimal development effort.

Built With

Updates

Madhav Pandit started this project — Apr 08, 2022 10:10 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.