vector columns. The vector column type
is provided by the pgvector extension. A common way to store vectors is
alongside the data they have indexed. For example, to store embeddings for documents, a common table structure is:
contents column), and the embedded vector.
This may seem like a bit of a weird design: why aren’t the embeddings simply a separate column in the document table? The answer has to do with context length limits of embedding models and of LLMs. When embedding data, there is a limit to the length of content you can embed (for example, OpenAI’s ada-002 has a limit of 8191 tokens ), and so, if you are embedding a long piece of text, you have to break it up into smaller chunks and embed each chunk individually. Therefore, when thinking about this at the database layer, there is usually a one-to-many relationship between the thing being embedded and the embeddings which is represented by a foreign key from the embedding to the thing.
Of course, if you do not want to store the original data in the database and you are just storing only the embeddings, that’s totally fine too. Just omit the foreign key from the table. Another popular alternative is to put the foreign key into the metadata JSONB.
Querying vectors using pgvector
The canonical query for vectors is for the closest query vectors to an embedding of the user’s query. This is also known as finding the K nearest neighbors. In the example query below,$1 is a parameter taking a query embedding, and the <=> operator calculates the distance between the query embedding and embedding vectors stored in the database (and returns a float value).
WHERE clauses (such as filters on the metadata), joins, etc.
Vector distance types
The query shown above uses something called cosine distance (using the <=> operator) as a measure of how similar two embeddings are. But, there are multiple ways to quantify how far apart two vectors are from each other.In practice, the choice of distance measure doesn’t matters much and it is recommended to just stick with cosine distance for most applications.
Description of cosine distance, negative inner product, and Euclidean distance
Here’s a succinct description of three common vector distance measures- Cosine distance a.k.a. angular distance: This measures the cosine of the angle between two vectors. It’s not a true “distance” in the mathematical sense but a similarity measure, where a smaller angle corresponds to a higher similarity. The cosine distance is particularly useful in high-dimensional spaces where the magnitude of the vectors (their length) is less important, such as in text analysis or information retrieval. It ranges from -1 (meaning exactly opposite) to 1 (exactly the same), with 0 typically indicating orthogonality (no similarity). See here for more on cosine similarity.
- Negative inner product: This is simply the negative of the inner product (also known as the dot product) of two vectors. The inner product measures vector similarity based on the vectors’ magnitudes and the cosine of the angle between them. A higher inner product indicates greater similarity. However, it’s important to note that, unlike cosine similarity, the magnitude of the vectors influences the inner product.
- Euclidean distance: This is the “ordinary” straight-line distance between two points in Euclidean space. In terms of vectors, it’s the square root of the sum of the squared differences between corresponding elements of the vectors. This measure is sensitive to the magnitude of the vectors and is widely used in various fields such as clustering and nearest neighbor search.
- The cosine distance is
1−dot product. - The negative inner product is
−dot product. - The Euclidean distance is related to the dot product, where the squared Euclidean distance is
2(1−dot product).
Recommended vector distance for use in
Using cosine distance, especially on unit vectors, is recommended. These recommendations are based on OpenAI’s recommendation as well as the fact that the ranking of different distances on unit vectors is preserved.Vector search indexing (approximate nearest neighbor search)
In and other relational databases, indexing is a way to speed up queries. For vector data, indexes speed up the similarity search query shown above where you find the most similar embedding to some given query embedding. This problem is often referred to as finding the K nearest neighbors.The term “index” in the context of vector databases has multiple meanings. It can refer to both the storage mechanism for your data and the tool that enhances query efficiency. These docs use the latter meaning.
| Algorithm | Build Speed | Query Speed | Need to rebuild after updates |
|---|---|---|---|
| StreamingDiskANN | Fast | Fastest | No |
| HNSW | Fast | Fast | No |
| ivfflat | Fastest | Slowest | Yes |