timescale_vector is the
Python interface you use to interact with a pgai on programmatically.
Before you get started with timescale_vector:
- Sign up for pgai on : Get 90 days free to try pgai on .
- Follow the Get Started Tutorial: Learn how to use pgai on for semantic search on a real-world dataset.
Prerequisites
timescale_vector depends on the source distribution of psycopg2 and adheres
to best practices for psycopg2.
Before you install timescale_vector:
- Follow the psycopg2 build prerequisites.
Install
To interact with pgai on using Python:-
Install
timescale_vector: -
Install
dotenv:In these examples, you usedotenvto pass secrets and keys.
Basic usage of the timescale_vector library
First, import all the necessary libraries:.env file:
| name | description |
|---|---|
service_url | URL / connection string |
table_name | Name of the table to use for storing the embeddings. Think of this as the collection name |
num_dimensions | Number of dimensions in the vector |
- A UUID to uniquely identify the embedding
- A JSON blob of metadata about the embedding
- The text the embedding represents
- The embedding itself
Advanced search section.
A simple search example that returns one item using a similarity search
constrained by a metadata filter is shown below:
| name | description |
|---|---|
| id | The UUID of the record |
| metadata | The JSON metadata associated with the record |
| contents | the text content that was embedded |
| embedding | The vector embedding |
| distance | The distance between the query embedding and the vector |
Advanced usage
This section goes into more detail about the Python interface. It covers:- Search filter options - how to narrow your search by additional constraints
- Indexing - how to speed up your similarity queries
- Time-based partitioning - how to optimize similarity queries that filter on time
- Setting different distance types to use in distance calculations
Search options
Thesearch function is very versatile and allows you to search for the right vector in a wide variety of ways. This section describes the search option in 3 parts:
- Basic similarity search.
- How to filter your search based on the associated metadata.
- Filtering on time when time-partitioning is enabled.
Narrowing your search by metadata
There are two main ways to filter results by metadata:filtersfor equality matches on metadata.predicatesfor complex conditions on metadata.
Using filters for equality matches
You could specify a match on the metadata as a dictionary where all keys have to match the provided values (keys not in the filter are unconstrained):Using predicates for more advanced filtering on metadata
Predicates allow for more complex search conditions. For example, you could use greater than and less than conditions on numeric values.Predicates
objects are defined by the name of the metadata key, an operator, and a value.
The supported operators are: ==, !=, <, <=, >, >=
The type of the values determines the type of comparison to perform. For
example, passing in "Sam" (a string) performs a string comparison while
a 10 (an int) performs an integer comparison, and a 10.0
(float) performs a float comparison. It is important to note that using a
value of "10" performs a string comparison as well so it’s important to
use the right type. Supported Python types are: str, int, and
float.
One more example with a string comparison:
& operator (for combining predicates with AND semantics) and |(for
combining using OR semantic). So you can do:
AND
semantics. You can pass in multiple 3-tuples to
Predicates:
Filter your search by time
When usingtime-partitioning (see below) you can very efficiently
filter your search by time. Time-partitioning associates the timestamp embedded
in a UUID-based ID with an embedding. First,
create a collection with time partitioning and insert some data (one
item from January 2018 and another in January 2019):
uuid_time_filter:
UUIDTimeRange
can specify a start_date or end_date or both(as in the example above).
Specifying only the start_date or end_date leaves the other end
unconstrained.
start_inclusive and end_inclusive parameters. Setting
start_inclusive to true results in comparisons using the >=
operator, whereas setting it to false applies the > operator. By
default, the start date is inclusive, while the end date is exclusive.
One example:
start_inclusive=False option because the first row has the exact
timestamp specified by start_date.
It is also easy to integrate time filters using the filter and
predicates parameters described above using special reserved key names
to make it appear that the timestamps are part of your metadata. This
is useful when integrating with other systems that just want to
specify a set of filters (often these are “auto retriever” type
systems). The reserved key names are __start_date and __end_date for
filters and __uuid_timestamp for predicates. Some examples below:
Indexing
Indexing speeds up queries over your data. By default, the system creates indexes to query your data by the UUID and the metadata. To speed up similarity search based on the embeddings, you have to create additional indexes. Note that if performing a query without an index, you always get an exact result, but the query is slow (it has to read all of the data you store for every query). With an index, your queries are order-of-magnitude faster, but the results are approximate (because there are no known indexing techniques that are exact). Luckily, provides 3 excellent approximate indexing algorithms, StreamingDiskANN, HNSW, and ivfflat. Below are the trade-offs between these algorithms:| Algorithm | Build speed | Query speed | Need to rebuild after updates |
|---|---|---|---|
| StreamingDiskAnn | Fast | Fastest | No |
| HNSW | Fast | Faster | No |
| ivfflat | Fastest | Slowest | Yes |
distance type section
below.
Each of these indexes has a set of build-time options for controlling
the speed/accuracy trade-off when creating the index and an additional
query-time option for controlling accuracy during a particular query. The
library uses smart defaults for all of these options. The
details for how to adjust these options manually are below.
StreamingDiskANN index
The StreamingDiskANN index is a graph-based algorithm that uses the DiskANN algorithm. You can read more about it in the blog announcing its release. To create this index, run:| Parameter name | Description | Default value |
|---|---|---|
num_neighbors | Sets the maximum number of neighbors per node. Higher values increase accuracy but make the graph traversal slower. | 50 |
search_list_size | This is the S parameter used in the greedy search algorithm used during construction. Higher values improve graph quality at the cost of slower index builds. | 100 |
max_alpha | Is the alpha parameter in the algorithm. Higher values improve graph quality at the cost of slower index builds. | 1.0 |
search() function
using the query_params argument. You can set the
search_list_size(default: 100). This is the number of additional
candidates considered during the graph search at query time. Higher
values improve query accuracy while making the query slower.
You can specify this value during search as follows:
pgvector HNSW index
Pgvector provides a graph-based indexing algorithm based on the popular HNSW algorithm. To create this index, run:| Parameter name | Description | Default value |
|---|---|---|
m | Represents the maximum number of connections per layer. Think of these connections as edges created for each node during graph construction. Increasing m increases accuracy but also increases index build time and size. | 16 |
ef_construction | Represents the size of the dynamic candidate list for constructing the graph. It influences the trade-off between index quality and construction speed. Increasing ef_construction enables more accurate search results at the expense of lengthier index build times. | 64 |
search() function
using the query_params argument. You can set the ef_search(default:
40). This parameter specifies the size of the dynamic candidate list
used during search. Higher values improve query accuracy while making
the query slower.
You can specify this value during search as follows:
pgvector ivfflat index
Pgvector provides a clustering-based indexing algorithm. The blog post describes how it works in detail. It provides the fastest index-build speed but the slowest query speeds of any indexing algorithm. To create this index, run:lists index parameter that is automatically set
with a smart default based on the number of rows in your table. If you
know that you’ll have a different table size, you can specify the number
of records to use for calculating the lists parameter as follows:
lists parameter directly:
search() function
using the query_params argument. You can set the probes. This
parameter specifies the number of clusters searched during a query. It
is recommended to set this parameter to sqrt(lists) where lists is the
num_list parameter used above during index creation. Higher values
improve query accuracy while making the query slower.
You can specify this value during search as follows:
Time partitioning
In many use cases where you have many embeddings, time is an important component associated with the embeddings. For example, when embedding news stories, you often search by time as well as similarity (for example, stories related to Bitcoin in the past week or stories about Clinton in November 2016). Yet, traditionally, searching by two components “similarity” and “time” is challenging for Approximate Nearest Neighbor (ANN) indexes and makes the similarity-search index less effective. One approach to solving this is partitioning the data by time and creating ANN indexes on each partition individually. Then, during search, you can:- Step 1: filter partitions that don’t match the time predicate.
- Step 2: perform the similarity search on all matching partitions.
- Step 3: combine all the results from each partition in step 2, re-rank, and filter out results by time.
uuid_from_time function
uuid_time_filter in the
search call:
Distance metrics
Cosine distance is used by default to measure how similarly an embedding is to a given query. In addition to cosine distance, Euclidean/L2 distance is also supported. The distance type is set when creating the client using thedistance_type parameter. For example, to use the Euclidean
distance metric, you can create the client with:
distance_type are cosine and euclidean.
It is important to note that you should use consistent distance types on
clients that create indexes and perform queries. That is because an
index is only valid for one particular type of distance measure.
Note that the StreamingDiskANN index only supports cosine distance at
this time.