Inspiration
There are regular Vector Databases like Milvus,PineCone,Weaviate,ChromaDB. Now that TiDB going to be providing support for Vector Stograge and retrieval, We are going to best of the both the worlds in the same product. The aim is to do a case study to store vector in TiDB perform similarity search and share experience
What it does
Rather than using FastText,NLP or any other Vector Embeddding library, Aim is to build a custom vector embedding function exclusively for name search but that is lot more accrurate than the established NN libraries for NLP
How we built it
Wrote a custom function in python that can generate more optimal and accurate vector embedding for names (person or city or any thing you want)
used Faker library to generate dummy data and used custom Embedding function to generate embedding for the same dummy data. Imported using Bulk Import option and Queried the table using similarity function instead of using SQL like operator and achieved better results
Challenges we ran into
During Import, we experienced strange errors with Column Headers and we skipped the same and ran the import successfully
Accomplishments that we're proud of
1. Custom Embedding that is more optimal in size(26)
2. Completed the project in few hours
What we learned
1. Having Vector Search inside RDBMS brings best of both the worlds. No need for two different products
2. Unlike VectorDBs where it is hard to use
1. Custom Vector Embedding(ChromaDB),
[link](https://medium.com/@shreyavishwalingam/custom-vector-embedding-with-chroma-db-vector-database-a44594d54da0)
2. Get up and Running (Milvus)
3.Performance problems(pinecone)
4. Peculiar errors (Weaviate), Custom Embedding worked when we inserted but failed when we retrieved, The error reported to Weaviate
5. More transparent importing of vector Data unlike Vector DBs where we need to play around a lot before get the vector in right "format" before it can be inserted successfully. Given below
function to generate vector embedding for image
Python
pip install imgbeddings
def VectorOfImage(strURL):
import requests
from PIL import Image
url = strURL
image = Image.open(requests.get(url, stream=True).raw)
from imgbeddings import imgbeddings
ibed = imgbeddings()
embedding = ibed.to_embeddings(image)
vector = embedding.flatten().tolist()
return vector
6. Size of the vector defined . Ensuring Data Quality a big plus when compared to regular Vector Databases
7. Able to see the status of indexing which is additional plus when compared to regular Vector Databases
What's next for Vector Based Fuzzy Name Search
Comparing TiDB Vector functionalities with other SQL databases exclusively built for Vector Functionality 1. MyScaleDB(link Developed from the code base of ClickHouse with Vector functionality 2. pgVectorScale(link From TimeScaleDB, built on top of pgVector PostGres. Product claims they have following functionalities not commonly available with Vector Databases. A new index type called StreamingDiskANN, inspired by the DiskANN algorithm, based on research from Microsoft. Statistical Binary Quantization: developed by Timescale researchers, This compression method improves on standard Binary Quantization. See how Custom Vector Embedding for a large dataset can be generated async +Parallel using Polars to fine tune performance even better
Built With
- faker
- python
Log in or sign up for Devpost to join the conversation.