- Designate any text column for embedding using customizable rules (or, if you are embedding binary documents such as PDFs, you can see our guide for embedding documents)
- Automatically generate and maintain searchable embedding tables
- Keep embeddings continuously synchronized with source data (asynchronously)
- Utilize a convenient view that seamlessly joins base tables with their embeddings
- Select an embedding provider and set up your API Keys
- Define a vectorizer
- Query an embedding
- Inject context into vectorizer chunks
- Improve query performance on your Vectorizer
- Control vectorizer run time
- The embedding storage table
- Monitor a vectorizer
Select an embedding provider and set up your API Keys
Vectorizer supports the following vector embedding providers as first-party integrations: Additionally, through the LiteLLM provider we support: When using an external embedding service, you need to setup your API keys to access the service. To store several API keys, you give each key a name and reference them in theembedding section of the Vectorizer configuration. The default API key
names match the embedding provider’s default name.
The default key names are:
| Provider | Key name |
|---|---|
| OpenAI | OPENAI_API_KEY |
| Voyage AI | VOYAGE_API_KEY |
-
Tiger Cloud
- In Tiger Console > Project Settings, click
AI Model API Keys. - Click
Add AI Model API Keys, add your key, then clickAdd API key.
- In Tiger Console > Project Settings, click
-
Self-hosted Postgres
Set an environment variable that is the same as your API key name.
For example:
Define a vectorizer
You can configure the system to automatically generate and update embeddings for a table’s data. Let’s consider the following example table:nomic-embed-text embedding model hosted on a local
Ollama instance. Vectorizer supports other embedding providers, for more details
consult the embedding configuration
section of the vectorizer API reference.
The loading parameter specifies the source of the data to generate embeddings from. E.g. from the contents column.
Vectorizer supports other loaders, such as the
ai.loading_uri, which loads external documents from local or remote buckets like S3, etc.
For more details, check the loading configuration section
of the vectorizer API reference or our guide for embedding documents.
Additionally, if the contents field is lengthy, it is split into multiple chunks,
resulting in several embeddings for a single blog post. Chunking helps
ensure that each embedding is semantically coherent, typically representing a
single thought or concept. A useful mental model is to think of embedding one
paragraph at a time.
However, splitting text into chunks can sometimes lead to losing context. To
mitigate this, you can reintroduce context into each chunk. For instance, you
might want to repeat the blog post’s title in every chunk. This is easily
achieved using the formatting parameter, which allows you to inject row data
into each chunk:
Query an embedding
Thecreate_vectorizer command generates a view with the same name as the
specified destination. This view contains all the embeddings for the blog table.
Note that you’ll typically have multiple rows in the view for each blog entry,
as multiple embeddings are usually generated for each source document.
The view includes all columns from the blog table plus the following additional columns:
| Column | Type | Description |
|---|---|---|
| embedding_uuid | UUID | Unique identifier for the embedding |
| chunk | TEXT | The text segment that was embedded |
| embedding | VECTOR | The vector representation of the chunk |
| chunk_seq | INT | Sequence number of the chunk within the document, starting at 0 |
<=> operator calculates the distance between the query embedding and each
row’s embedding vector. This is a simple way to do semantic search.
Tip: You can use the ai.ollama_embed function in our PostgreSQL extension to generate an embedding for a user-provided query right inside the database.
You can combine this with metadata filters by adding a WHERE clause:
Inject context into vectorizer chunks
Formatting allows you to inject additional information into each chunk. This is needed because splitting the text into chunks can lead to losing important context. For instance, you might want to include the authors and title with each chunk. This is achieved using Python template strings, which have access to all columns in the row and a special$chunk variable containing the chunk’s text.
You may need to reduce the chunk size to ensure the formatted text fits within
token limits. Adjust the chunk_size parameter of the text_splitter
accordingly:
$chunk.
Improve query performance on your Vectorizer
A vector index on the embedding column improves query performance. On Tiger Cloud, a vectorscale index is automatically created after 100,000 rows of vector data are present. This behaviour is configurable, you can also specify other vector index types. The following example uses a HNSW index:Control the vectorizer run time
When you use Vectorizer on Tiger Cloud, you use scheduling to control the time when vectorizers run. A scheduled job checks for work to be done and, if so, runs the cloud function to embed the data. By default, scheduling uses TimescaleDB background jobs running every five minutes. Once the table is large enough, scheduling also handles index creation on the embedding column. When you self-host vectorizer, the vectorizer worker uses a polling mechanism to check whether there is work to be done. Thus, scheduling is not needed and is deactivated by default. Note: when scheduling is disabled, the index is not created automatically. You need to create it manually.The embedding storage table
The view is based on a table storing blog embeddings, namedblog_contents_embeddings_store. You can query this table directly for
potentially more efficient queries. The table structure is as follows:
Destination Options for Embeddings
Vectorizer supports two different ways to store your embeddings. You should choose the option to use based on whether:- You need multiple embeddings per source row because of chunking. This is the common case. You should choose table destination.
- You need a single embedding per source row. This happens if you are either embedding small text fragments (e.g. a single sentence) or if have already chunked the document and the souce table contains the chunks. In this case, you should choose a column destination.
1. Table Destination (Default)
The default approach creates a separate table to store embeddings and a view that joins with the source table:- When you need multiple embeddings per row (chunking)
- For large text fields that need to be split
- You are vectorizing documents (which typically require chunking)
2. Column Destination
For simpler cases, you can add an embedding column directly to the source table. This can only be used when the vectorizer does not perform chunking because it requires a one-to-one relationship between the source data and the embedding. This is useful in cases where you know the source text is short (as is common if the chunking has already been done upstream in your data pipeline). The workflow is that your application inserts data into the table with a NULL in the embedding column. The vectorizer will then read the row, generate the embedding and update the row with the correct value in the embedding column.- When you need exactly one embedding per row
- For shorter text that doesn’t require chunking
- When your application already takes care of the chunking before inserting into the database
- When you want to avoid creating additional database objects
ai.chunking_none() since it can only store one embedding per row.
Monitor a vectorizer
Since embeddings are created asynchronously, a delay may occur before they become available. Use thevectorizer_status view to monitor the vectorizer’s
status:
| id | source_table | target_table | view | pending_items |
|---|---|---|---|---|
| 1 | public.blog | public.blog_contents_embeddings_store | public.blog_contents_embeddings | 1 |
pending_items column indicates the number of items still awaiting embedding creation.
If the number of pending items exceeds 10,000, we return the maximum value of a bigint (9223372036854775807)
instead of exhaustively counting the items. This is done for performance.
Alternately, you can call the ai.vectorizer_queue_pending function to get the count of pending items
for a single vectorizer. The exact_count parameter is defaulted to false, but passing true
will exhaustively count the exact number of pending items.