Vector Based Fuzzy Name Search

Inspiration

There are regular Vector Databases like Milvus,PineCone,Weaviate,ChromaDB. Now that TiDB going to be providing support for Vector Stograge and retrieval, We are going to best of the both the worlds in the same product. The aim is to do a case study to store vector in TiDB perform similarity search and share experience

What it does

       Rather than using FastText,NLP or any other Vector Embeddding library, Aim is to build a custom vector embedding function exclusively for name search but that is lot more accrurate than the established NN libraries for NLP

How we built it

  Wrote a custom function in python that can generate more optimal and accurate vector embedding for names (person or city or any thing you want)
 used Faker library to generate dummy data and used custom Embedding function to generate embedding for the same dummy data. Imported using Bulk Import option and Queried the table using similarity function instead of using SQL like operator and achieved better results

Challenges we ran into

      During Import, we experienced strange errors with Column Headers and we skipped the same and ran the import successfully

Accomplishments that we're proud of

         1. Custom Embedding that is more optimal in size(26)
         2. Completed the project in few hours

What we learned

       1. Having Vector Search inside RDBMS brings best of both the worlds. No need for two different products
      2. Unlike VectorDBs where it is hard to use 
                  1. Custom Vector Embedding(ChromaDB),
                             [link](https://medium.com/@shreyavishwalingam/custom-vector-embedding-with-chroma-db-vector-database-a44594d54da0) 
                  2. Get up and Running (Milvus)
                  3.Performance problems(pinecone)
                  4. Peculiar errors (Weaviate), Custom Embedding worked when we inserted but failed when we retrieved, The error reported to Weaviate
                 5. More transparent importing of vector Data unlike Vector DBs where we need to play around a lot before get the vector in right "format" before it can be inserted successfully. Given below

function to generate vector embedding for image Python pip install imgbeddings def VectorOfImage(strURL): import requests from PIL import Image url = strURL image = Image.open(requests.get(url, stream=True).raw) from imgbeddings import imgbeddings ibed = imgbeddings() embedding = ibed.to_embeddings(image) vector = embedding.flatten().tolist() return vector 6. Size of the vector defined . Ensuring Data Quality a big plus when compared to regular Vector Databases 7. Able to see the status of indexing which is additional plus when compared to regular Vector Databases

What's next for Vector Based Fuzzy Name Search

Comparing TiDB Vector functionalities with other SQL databases exclusively built for Vector Functionality 1. MyScaleDB(link Developed from the code base of ClickHouse with Vector functionality 2. pgVectorScale(link From TimeScaleDB, built on top of pgVector PostGres. Product claims they have following functionalities not commonly available with Vector Databases. A new index type called StreamingDiskANN, inspired by the DiskANN algorithm, based on research from Microsoft. Statistical Binary Quantization: developed by Timescale researchers, This compression method improves on standard Binary Quantization. See how Custom Vector Embedding for a large dataset can be generated async +Parallel using Polars to fine tune performance even better

Built With

faker
python

Updates

Balachandar Ganesan started this project — Aug 14, 2024 12:36 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.