chat for text-to-sql

Inspiration

I had seen some text-to-SQL applications before, but I wasn't clear on how they were implemented. Out of curiosity, I decided to develop a text-to-SQL application for my own use and explore its internal workings.

What it does

I have implemented two common functionalities:

Performing vector semantic similarity matching, providing table structures to LLM, and answering my SQL questions.
Trigger/execute can run related SQL statements, except for drop and create.

How we built it

You can build it by following the steps in the README file. Just set up the relevant TiDB Serverless connection information and a ChatGPT API key. I have prepared the basic SQL structure and test data. Simply run python example.py prepare to get started.

Challenges we ran into

First time working with TiDB, I found that there is very little documentation on Python operations, making the trial-and-error cost quite high.
Vector matching cannot reduce the similarity threshold. You can only query and then directly ask the large model.
User questions might just be for conversation. Keyword template matching is needed; I only perform a knowledge base search after matching. Not all questions require a knowledge base search.

Accomplishments that we're proud of

Vector search can be performed based on the user's SQL questions, while other questions will be answered through normal conversation.
SQL statements provided by the user can be executed, and the results will be returned.

What we learned

Delved into TiDB's vector search capabilities.
Personally implemented a complete text-to-SQL process and mastered the related technologies.

What's next for chat for text-to-sql

Currently, the stored vector fields are based only on questions I've created myself. The next step is to introduce a large model to help generate relevant question matches.
Currently, only internal keyword matching is used for question identification. I hope to introduce vector semantic matching technology to more accurately identify SQL questions.
The judgment during SQL execution is still based on keyword matching; future plans include optimization to improve accuracy.