What is a Vector Database, How it Works, and How Can You Create One From Scratch!

Harjot Pahwa
15 min readMay 20, 2024

--

Source: Image generated using Gemini

What comes to your mind when you think of unstructured data? Probably a massive repository of labelled images, videos, audios, etc. Now, the unstructured data is the oil for any AI model development, and handling this unstructured data becomes a priority.

One way to organise this data would be through human-assigned labels. But again, it’s very tedious, limited, and not scalable. Unstructured data comes with several challenges such as:

  • Difficulty in searching and analysing: Unstructured data lacks a predefined schema, making it hard to search using traditional methods like keyword queries.
  • Limited insights: Unstructured data can be rich in information, but it’s often locked away because traditional databases can’t process it effectively.
  • Scalability issues: As the amount of unstructured data grows, traditional databases can struggle to keep up.

Contents:

  1. Introduction
  2. Vector Embeddings
  3. Comparison with Traditional Databases
  4. Popularity and Need for Vector Databases
  5. How a Vector Database Works
  6. Creating Vector Database from Scratch
  7. Conclusion

1. Introduction

Source: Image generated using Gemini

Vector databases or in short, Vector DB are a class of database systems designed to efficiently handle vector space operations, which are crucial in an era dominated by AI. Unlike traditional databases that store and manage data in rows and tables, vector databases are optimised for operations on vectors — one-dimensional arrays representing numerical data in a high-dimensional space.

Here’s an analogy — Imagine a library full of books without any categorisation system. Searching for a specific topic would be very difficult. Vector databases act like a new way to organise the books by their content, allowing you to find similar books (data points) even if the titles or exact wording is different.

1.1. The Crux of Vector DB

  • The core idea behind vector databases is to support similarity search, which is the ability to find elements similar to a query item, rather than exact matches.
  • This is particularly useful for tasks such as finding the most similar images, documents, or even audio files, where the notion of similarity doesn’t translate well into traditional SQL queries.

1.2. The Need for Vector DB

  • As data grows in complexity and volume, traditional databases struggle to keep up with the demands of near-instantaneous, high-dimensional data retrieval. Vector databases step in to fill this gap, offering a specialised solution that can handle the intricacies of unstructured data that’s often represented as vectors in machine learning models.
  • Their rise to prominence is closely tied to the explosion of data-driven applications that require rapid processing of complex, unstructured data. From recommendation systems that suggest products to users, to facial recognition software that needs to sift through millions of images, vector databases offer a performance and scalability that traditional databases simply cannot match when it comes to these specific tasks.

1.3. What’s in it for you?

In this blog post, I’ll explore the ins and outs of vector databases: why they are becoming a necessity in certain fields, how they function under the hood, and what advantages they bring to the table. I’ll also walk you through the process of creating your own vector database from scratch, providing insights into the challenges and rewards of this exciting technology.

Whether you’re a seasoned database professional or a Data Scientist or a ML Engineer or a curious technologist, understanding vector databases is becoming increasingly important as the landscape of data storage and retrieval evolves. So, let’s dive in and demystify the world of vector databases.

2. Vector Embeddings

What if there was a way to bring order and structure to this massive amounts of unstructured data? Well, if you think about it, everything can be represented using vector embeddings. I have drawn a parallel to vector embeddings and colours in this blog. This will give you an intuitive understanding of vector embeddings.

Vector embeddings are a type of mathematical representation for data, particularly unstructured data like text, images, or audio. They essentially convert these complex data types into numerical vectors, which are lists of numbers.

The key thing about these vectors is that their positions in a high-dimensional space correspond to the meaning or relationships between the data they represent.

2.1. Breakdown

  • Capturing meaning: Imagine a vector embedding for the word “king.” The numbers within the vector would encode information about the meaning of “king,” such as its relation to words like “queen,” “crown,” or “royal.”
  • Similarity in space: Words with similar meanings will have vector embeddings that are closer together in this high-dimensional space. Going back to the “king” example, the embedding for “queen” would likely be very close to “king” in this space.
  • Unlocking analysis: By using vector similarity, machines can process and analyse unstructured data more effectively. For tasks like recommendation systems or natural language processing, vector embeddings allow machines to understand the relationships between different data points.

Think of it like this — Vector embeddings are a way to translate the complexities of human language, images, or audio into a numerical language that machines can understand and work with. This allows them to find similar data points and extract meaningful insights from unstructured information.

3. Comparison with Traditional Databases

Source: Image generated using Gemini

When we think about databases, the image that typically comes to mind is that of a traditional relational database. These databases are designed around a structured schema where data is stored in tables consisting of rows and columns. Each row represents a record with a unique identifier, and each column represents a different data attribute. This model is excellent for a wide range of applications, but it has its limits, especially when dealing with complex, unstructured data.

3.1. Data Structure Differences (Vectors vs. Tables)

  • Vector databases depart from the tabular data model by focusing on vectors. In machine learning, vectors are used to encode features of unstructured data such as text, images, and audio. This allows for a representation that captures the nuances of the data in a form that can be easily processed by algorithms.
  • While traditional databases can store vectors as BLOBs or serialised strings, they are not optimised to perform operations on them. Vector databases, on the other hand, are built from the ground up to handle these vectors efficiently. They can quickly compute distances and similarities between vectors, which is a fundamental operation for tasks such as content-based retrieval, clustering, and classification.

3.2. Query Processing (Similarity Search vs. Exact Match)

  • Traditional databases excel at exact match queries. For instance, finding a customer by their ID or retrieving all orders placed on a specific date. These operations are fast and efficient, thanks to indexing strategies like B-trees and hash maps.
  • Vector databases, however, shine when it comes to similarity search. This type of search doesn’t look for an exact match but rather for the most similar items. For example, in a vector database, you could query for images that are visually similar to a provided image. The database would then return a list of images ranked by how closely they resemble the query image based on vector similarity measures such as cosine similarity or Euclidean distance.

3.3 Scalability and Performance Considerations

  • Scalability and performance are critical factors for any database. Traditional databases are designed to scale vertically, with performance improvements often coming from upgrading hardware. However, they can struggle with horizontal scaling and with the processing demands of large volumes of high-dimensional vector data.
  • Vector databases are often designed with horizontal scalability in mind, allowing them to distribute the workload across multiple nodes. This is particularly important for vector operations, which are computationally intensive and can benefit from parallel processing. Moreover, vector databases typically employ specialised indexing techniques that are geared towards high-dimensional data, such as KD-trees or Annoy, which are not typically found in traditional databases.
  • The performance of vector databases is enhanced by these specialised indexing and search algorithms, which can process and retrieve similar vectors with remarkable speed. This makes vector databases an ideal choice for applications that require the rapid analysis of large datasets, such as real-time recommendation systems, image retrieval platforms, and natural language processing applications.

In summary, while traditional databases remain a staple for applications requiring structured data management and exact match queries, vector databases offer unique advantages for applications dealing with high-dimensional, unstructured data and the need for similarity search. The choice between a traditional database and a vector database ultimately depends on the specific needs and nature of the data being handled.

4. Popularity and Need for Vector Databases

Source: Image generated using Gemini

In the past decade, there has been an unprecedented rise in the volume of data generated and the complexity of tasks that we expect our technologies to handle. Much of this complexity is driven by the rapid advancement and integration of AI into various industries. These technologies rely heavily on the ability to quickly analyse and derive insights from large datasets, many of which are unstructured and high-dimensional in nature. Herein lies the growing popularity and necessity for vector databases.

4.1. The Rise of AI and Machine Learning Workloads

AI and ML models, particularly those involving deep learning, transform raw data such as text, images, and sounds into a numerical format known as feature vectors. These vectors capture the essential aspects of the data and are used by models to make predictions, classify data, or find patterns. As AI and ML become increasingly pervasive, the need to store, search, and manage these vectors efficiently has become critical. Vector databases are purpose-built to address these needs, providing the foundation for AI-driven applications to perform at scale.

4.2. Use Cases Driving the Popularity

Several use cases are driving the adoption of vector databases, including:

  • Recommendation Systems: Personalised recommendations, whether for products on e-commerce sites or for content on streaming platforms, rely on understanding user preferences. Vector databases can quickly find items that are most similar to a user’s past behaviour or preferences by comparing feature vectors.
  • Image and Video Search: Services like reverse image search require the comparison of visual content at a scale that traditional databases cannot handle. Vector databases can process and compare image representations to find matches or similar items almost instantaneously.
  • Natural Language Processing (NLP): Applications like sentiment analysis, text classification, and language translation work with text converted into vector form. Vector databases facilitate the efficient storage and retrieval of semantic representations of text.
  • Fraud Detection: In financial services, vector databases can help to identify unusual patterns of behaviour by comparing transactions against a baseline of normal activity.
  • Bioinformatics: In the field of genomics, vector databases can store and analyse complex biological data, enabling researchers to match genetic markers and understand genetic variations at scale.

Vector databases are built to efficiently handle the complexity and computational demands of high-dimensional data, which is why they are increasingly becoming a cornerstone of AI and ML infrastructure.

5. How a Vector Database Works

Photo by Alina Grubnyak on Unsplash

Vector databases are engineered to facilitate efficient operations on high-dimensional data, which is predominantly represented in the form of vectors. Understanding how they work requires an exploration of vector space, indexing mechanisms, and search algorithms that are fundamental to their performance.

5.1. Overview of Vector Space and High-Dimensional Data

A vector space can be thought of as a mathematical n-dimensional space where each dimension represents a feature of the data. For instance, in a 3-dimensional space, a point could be represented as a vector [x, y, z], but in high-dimensional spaces, vectors have many more dimensions, sometimes in the hundreds or thousands. This is typical in machine learning, where each dimension can correspond to a feature learned from the data.

High-dimensional vectors encapsulate a rich amount of information, which makes them incredibly powerful for representing complex data. However, manipulating and searching through these high-dimensional spaces is not trivial due to the computational intensity and the “curse of dimensionality” which makes traditional search methods inefficient.

5.2. Indexing Mechanisms

To handle the complexity of high-dimensional vector spaces, vector databases use specialised indexing mechanisms that are designed to partition the space in a way that makes search operations more efficient. Some of these indexing mechanisms include:

  • KD-trees (k-dimensional trees): KD-trees are a type of binary tree that recursively partitions the space into hyper-rectangles. At each level of the tree, all the points are split into two groups along a dimension, which helps narrow down the search space for queries. However, KD-trees can become less effective as the number of dimensions grows due to the curse of dimensionality.
  • HNSW (Hierarchical Navigable Small World): HNSW is a graph-based approach that creates layers of connected nodes, with each layer having a subset of nodes from the layer below. Searches can quickly navigate through the layers to find the closest nodes, making it highly efficient for nearest neighbour searches in high-dimensional spaces.
  • ANNOY (Approximate Nearest Neighbours Oh Yeah): ANNOY uses a forest of trees to facilitate the approximate nearest neighbour search. It builds up multiple random projection trees where each tree is constructed by randomly choosing a hyperplane to split the dataset into two partitions, and then recursively splitting the partitions until each one is small enough.

5.3. Nearest Neighbour Search Algorithms

The primary operation that vector databases are optimised for is the nearest neighbour search, which finds the closest vectors to a given query vector. The closeness is often determined by distance metrics like Euclidean distance or cosine similarity. Nearest neighbour search algorithms in vector databases include:

  • Exact Nearest Neighbour Search: This method computes the distance between the query vector and every other vector in the database to find the closest match. While it guarantees the most accurate results, it is computationally intensive and not practical for large datasets.
  • Approximate Nearest Neighbour Search (ANN): To overcome the limitations of exact searches, ANN algorithms are used to find close, if not exact, neighbours more quickly. These algorithms, such as ANNOY, Locality-Sensitive Hashing (LSH) and tree-based partitioning, trade off a small amount of accuracy for a substantial gain in speed and are the backbone of many vector database operations.

By leveraging these specialised indexing mechanisms and search algorithms, vector databases can efficiently process and retrieve relevant high-dimensional data. This makes them uniquely suited for a wide range of applications where speed and scalability are crucial for handling complex datasets.

6. Creating Vector Database from Scratch

Source: Image generated using Gemini

Creating a full functional Vector Database is quite a challenging task, which requires meticulous planning and careful engineering. For the purposes of this blog, I will create a sample database that you can store and retrieve in a directory, without creating any scaling architecture or end point APIs.

We will be creating a database that stores text data and can find similar sentences to a given query sentence. For this, we will be using a BERT model to create embeddings for the text data, and ANNOY algorithm for indexing and searching. So, let’s get started.

6.1. Install the required libraries

We will be using Hugging Face transformers library to load BERT model, and annoy package.

pip install transformers annoy

6.2. Load the libraries

Now, we will import all the required packages.

from transformers import BertTokenizer, BertModel
import torch
from annoy import AnnoyIndex
import numpy as np

6.3. Initialise BERT Model

This will fetch the BERT model from Hugging Face’s repository and load it into the tokenizer and model variables. Tokenisation is an import part of NLP as it converts text into a format that can be easily understood and processed by machine learning models.

# Initialize BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

Let’s also set the embedding dimension from the model because that is a crucial variable in building the ANNOY index. This must be done for ANNOY to know how to interpret the vectors it receives and how to organise them within its internal data structures.

# Choose the number of dimensions for the embeddings
# BERT model embeddings are of 768 dimensions for the base model
embedding_dim = model.config.hidden_size

6.4. Initialise ANNOY Index

To initialise the ANNOY index, we must specify the embedding dimensions of the input vectors (that we stored in previous step), and the distance metric for calculation.

# Build an Annoy index
# Use 'angular' as the metric
annoy_index = AnnoyIndex(embedding_dim, metric='angular')
  • Angular Distance is a measure of the angle between two vectors in the vector space. When using the “angular” distance metric, ANNOY is essentially measuring the cosine of the angle between any two vectors to determine their similarity. A smaller angle (and thus a higher cosine value) indicates greater similarity.
  • In the context of ANNOY, the angular distance is based on the concept of cosine similarity but is transformed so that similar vectors have a small angular distance.

6.5. Adding vectors to ANNOY index

Now, we will convert the input data into vector embeddings using BERT model, and add them to the ANNOY index we initialised earlier.

# Assume we have some texts to embed
texts = [
"Hello, how are you?",
"Hope you are loving this blog sor far",
"Subscribe to my feed for more such useful content on AI and ML",
# ... add more texts as needed
]

We define a function that creates the vector embeddings.

def get_bert_embedding(text):
# Tokenize and convert to tensor
inputs = tokenizer(text, return_tensors='pt',
padding=True, truncation=True)
# Get embeddings from BERT model
outputs = model(**inputs)
# Use the mean of the last layer hidden states as the embedding
embeddings = outputs.last_hidden_state.mean(dim=1).detach().numpy()
return embeddings[0]

Now, for every text, we will call the above function, and store the resultant embeddings to the index.

# Add items to the index
for i, text in enumerate(texts):
vector = get_bert_embedding(text)
annoy_index.add_item(i, vector)

6.6. Building the ANNOY index

Once we are done with adding items, it’s time to build and save the index

# Build the index
annoy_index.build(n_trees=10) # 10 trees

# Save the index to disk
annoy_index.save('my_vector_database.ann')

# Load the index (optional, if you want to use it later)
annoy_index.load('my_vector_database.ann')

ANNOY uses a forest of trees, where each tree is a binary tree constructed by randomly splitting the space with hyperplanes.

  • Construction: The dataset is recursively divided by these random hyperplanes until each leaf node of the tree contains a small number of points. Each tree in the forest is built using different random choices, which increases the chances of an exhaustive search.
  • Searching: During a query, ANNOY examines the trees in the forest to find the leaf nodes that the query vector falls into. It then searches within these nodes and their neighbors. The randomness in the construction ensures that the search covers different parts of the space, providing a good approximation of the nearest neighbours.

6.7. Find the nearest neighbours to a query

Finally, it’s time to find the nearest neighbours to a given query from our ANNOY Vector Database.

# Now you can find the nearest neighbors to a given query
query = "useful blog on AI and ML"
query_vector = get_bert_embedding(query)

# Find the top 3 nearest neighbors
nearest_neighbors = annoy_index.get_nns_by_vector(query_vector, 3, include_distances=True)

# Print the nearest neighbors
for neighbor_id, distance in zip(*nearest_neighbors):
print(f"Neighbor ID: {neighbor_id}, Distance: {distance}, Text: {texts[neighbor_id]}")

Running all this code will give us:

Neighbor ID: 2, Distance: 0.7112755179405212, Text: Subscribe to my feed for more such useful content on AI and ML
Neighbor ID: 1, Distance: 0.8993874788284302, Text: Hope you are loving this blog sor far
Neighbor ID: 0, Distance: 1.0436733961105347, Text: Hello, how are you?

As you can see, the distance is minimum for the text with ID 2, which is the most similar to the given query!

7. Conclusion

Throughout this exploration, we’ve delved into the intricate world of vector databases, unveiling their pivotal role in powering modern applications that handle complex, unstructured data. Vector databases represent a significant departure from traditional relational databases, providing specialised capabilities for managing high-dimensional vectors that are essential for AI and machine learning tasks.

The importance of vector databases cannot be overstated. They are the engines behind the scenes of services we use daily — from recommending the next product we might like to buy, to instantly fetching relevant images based on a search query, to understanding natural language in real-time interactions. As data continues to grow in volume and complexity, the relevance and necessity of vector databases will only increase.

For developers, engineers, and organisations looking to stay ahead in a data-driven landscape, there is a clear call to action: explore and experiment with creating custom vector databases. The tools and technologies available today make it more accessible than ever to build a vector database tailored to your specific needs. Whether leveraging existing libraries like ANNOY and FAISS or developing bespoke solutions, the potential to innovate and improve data retrieval processes is vast.

Well I hope this tutorial was useful for you and if you want to stay updated with more such articles, do show some love by leaving as many claps as you can, follow my page, and subscribe to my feed :)

Find me here: https://bento.me/harjotpahwa

--

--

Harjot Pahwa

AI Engineer | Integrating AI into businesses and everyday workflows | Mentor