🔦 Inspiration
The Story of Delta-Buddy: Empowering Delta Lake Data with Privacy. 🐍
Once upon a time, there was a passionate explorer in the realm of data lakehouse. He was on a quest to democratize the use of Databricks and empower every user in a company using Delta Lake to unlock the true potential of their data by having a simple way to receive questions on all the features of Delta Lake, Databricks, and data. Inspired by privacy and safety, he embarked on a mission to create a GPT-like chatbot using the Dolly Databricks model to quickly develop everyone in a company to ask questions on the datasets, data catalog, and features.
💡 He had in mind for every Delta Lake user to:
- Find all the catalog, tables, delta lake tables, and dashboards... easily.
- Know what are the features and releases of each delta lake library.
- Provide a UI to answer questions locally and on Databricks environments with Dolly.
- Add all the information found on a Databricks account to help everyone access the right data at the right moment.
- Finally, open all the gates to make the Delta-Buddy integration easy for every usage (with a web API, locally, with langchain or Databricks notebooks...).
Dolly was envisioned as an intelligent companion equipped with the knowledge and expertise to guide users through the vast landscape of Databricks and Delta Lakes. He knew that access to accurate information was vital to fostering a culture of data-driven decision-making.
To ensure Dolly was an oracle of wisdom, He delved deep into the world of Delta. He ingested every codebase of Delta Lake, examined the comprehensive documentation, and assimilated all the information available within their Databricks account. With this extensive knowledge, Delta-Buddy became an unparalleled resource for anyone seeking answers about their data.
Users could now rely on Dolly to reveal the secrets of their data's whereabouts, guide them on accessing it securely, and even keep them informed about the latest releases and updates. With Delta-buddy as their ally, users felt confident and empowered to explore their data in a privacy-focused environment.
Everything was quickly setup up locally, on Docker, on API, on a Web UI, and even with Databricks notebooks.
However, despite its remarkable capabilities, Delta-Buddy had its limitations. The chatbot's knowledge was based solely on the metadata of the Databricks account and simple documentation, making it reliant on the information available within that scope. As Databricks evolved, new features and functionalities might not be immediately accessible to Delta-Buddy.
And so, the story of Delta-Buddy serves as a testament to the power of AI and the quest for accessible and private data exploration. With each interaction, Delta-Buddy empowers users to unleash the true potential of their data, transforming them into data superheroes within the realm of Lakehouse.
⚡️ What it does
- A chatbot to ask questions to Dolly on your documents and datasets.
- Ingest documents in a ChromaDB database locally, API, or from a Databricks notebook.
- Provide a web UI for asking and receiving questions.
- Provide an API with FastAPI to receive questions from everywhere.
- Run locally or Databricks while keeping your data safe.
🏗️ How I built it
- Ingest all the documents with langchain.
- Use chroma vectorized database to store data safely.
- Use chainlit for the ChatBot UI.
- Use dolly for the LLM model.
- Use databricks-sdk-py and databricks-sql-python to collect all the Databricks metadata.
- Use Databricks notebooks and clusters to prepare and launch Databricks delta-buddy jobs.
- Have two ways of launching questions: locally or remotely with Databricks clusters.
🔥 Challenges I ran into
- Not enough good documentation for Delta Lake and Databricks; I wish to have easy access to better-structured documentation like a PDF book of Delta Lake.
- The project depends on the user metadata on Databricks (catalog, tables, and so on...), so it was hard to test without having a fully documented environment.
- The integration with other Delta Lake libraries was not tested because of a lack of time, but I would like to integrate delta-rs pandas Data Frame to ask questions.
- Not yet capable of translating questions to SQL using an LLM open-source model to respect data privacy (Dolly needed to be trained on this).
- 2048 tokens are complicated with chat history; it's based on the documents vectorization in the Chromadb, so on the data provided (documentation and Databricks metadata).
🏆 Accomplishments that I am proud of
- In this timeframe, everything is accessible and easy to reproduce to make an excellent future for Delta Buddy.
- It works locally, on a Databricks notebook, via API... it's easily extendable.
- Delta-buddy has a lot of helpers and is extensible to every use case to test Dolly on a lot of documents (PDF, HTML, email, Github...).
- Already working on launching via Chat, API, notebooks... Easy to change and adapt for everyone!
📚 What I learned
Well, a lot... It has been my first LLM model investigation. I could set it up after knowing how the data preparation, tokenization, and model serving work! My best advice would be to train or prepare the model on perfect and clean datasets with a clear purpose and focus.
⚙️ What's next for Delta-Buddy
- Integrate with delta-rs to ask questions on the data frame like pandas-ai.
- Improve the data preparation with all information I could find on Databricks or Delta Lake.
- Improve the running time of the run notebook to have questions from Dolly in the Databricks notebook.
- Integrate MLFlow serving for asking questions and receiving answers in production.
- Improve the Fast API to be able to connect from everywhere.
- Include the history in the chat.
Built With
- chromadb
- dolly
- fastapi
- langchain
- notebook
- python
Log in or sign up for Devpost to join the conversation.