📖 Full Doc

Inspiration

As billions of data get generated daily having a robust system that can manage such an influx of data is of high importance. With OpenSource technology achieving such a process is possible but the question arises, how can this be done also on an enterprise level for business users?

Having a solution that can stream data in real-time to OpenSource technology like Kafka and also doing the same in Microsoft Fabric to create an LLM solution that can help businesses make quick and fast decisions.

What it does

A real-time data stream collects data from a producer and stores it in an Azure PostgreSQL Database. An LLM is built out of the stored data to respond to user queries (RAG Solution).

How we built it

Part 1

The solution performs a CDC approach for a database where data is generated by a producer (mobile application) and sent to the Azure PostgreSQL database via a connection string. A CDC is done with Apache Kafka and Debezium to capture real-time changes in the Azure PostgreSQL and send to Kafka topics.

A consumer script is created to capture the data from Kafka topics and inserted into ElasticSearch index for storage and Kibana for Visualization. The same consumer script also captures data from Kafka topics and sends a JSON file format to the Azure Storage Account (adls).

Part 2:

Using the CDC feature available in Microsoft Fabric EventStream we connected to Azure PostgreSQL database and consumed data in real-time from it. A KQL database was created for querying and standardizing the entire streaming data. We then created a near-to-real-time dashboard solution with Power BI from the KQL database queries.

A Fabric pipeline was created to load data incrementally from the Azure Storage Account to the Fabric Lakehouse file folder. This file will be used in our LLM creation.

Part 3

Data transformation was done on the JSON data written to the Lakehouse File folder. It was flattened and converted to a Spark dataframe. The Spark dataframe was further cleansed and standardized before being written as a delta table.

We created an LLM (RAG) and read the delta table created and promoted the LLM to create SQL queries based on the data in the dalta table to answer any user question about the table.

Challenges we ran into

  • There was a conflict between the CDC of debezium and Microsoft Fabric. We found out that the PostgreSQL Database does not support multiple slots.

The previous slot used for the Debezium connector affected the connection and setting for Fabric CDC for PostgreSQL. I had to provision a new server to solve this issue.

Accomplishments that we're proud of

  • We were able to create a real-time CDC for streaming data between a Mobile Application and OpenSource Technology and Fabric Enterprise solution.

  • A near to real-time Analytic dashboard and LLM were created based on the streaming data for semantic mining purposes.

What we learned

  • How to perform CDC with both OpenSource Technology and Enterprise Solutions.
  • During the whole solution we learnt how to integrate the LLM solution with an existing table(delta) to derive a solution.

What's next for CDC LLM Data Architecture with OpenSource and Fabric

The next step is to scale the solution for the production use case using Kubernetes on Docker and moving from Local Kafka infrastructure to Confluent Kafka on the cloud to better streamline the process.

Built With

Share this project:

Updates