What is a vectorizer?
A vectorizer automates the entire embedding workflow:- Automated embedding generation: Create embeddings for table data automatically
- Automatic synchronization: Triggers keep embeddings in sync with source data
- Background processing: Async processing minimizes impact on database operations
- Scalability: Batch processing handles large datasets efficiently
- Highly configurable: Customize embedding models, chunking, formatting, indexing, and scheduling
Key features
- Multiple AI providers: OpenAI, Ollama, Cohere, Voyage AI, and LiteLLM support
- Efficient storage: Separate tables with appropriate indexing for similarity searches
- View creation: Automatic views join source data with embeddings
- Access control: Fine-grained permissions for vectorizer objects
- Monitoring: Built-in tools to track queue status and performance
Quick start
Create a basic vectorizer
Table destination (separate embeddings table)
Column destination (embedding in source table)
Configuration functions
Core functions
create_vectorizer(): create and configure a new vectorizerdrop_vectorizer(): remove a vectorizer and clean up resources
Destination configuration
destination_table(): store embeddings in a separate table (default)destination_column(): store embeddings in the source table
Loading configuration
loading_column(): load data from a columnloading_uri(): load data from a file URI
Parsing configuration
parsing_auto(): auto-detect document format (default)parsing_none(): no parsing for text dataparsing_docling(): parse documents with Doclingparsing_pymupdf(): parse PDFs with PyMuPDF
Chunking configuration
chunking_character_text_splitter(): split by character countchunking_recursive_character_text_splitter(): recursive splitting (default)
Embedding configuration
embedding_openai(): OpenAI embedding modelsembedding_ollama(): local Ollama modelsembedding_litellm(): unified API for 100+ providersembedding_voyageai(): Voyage AI models
Formatting configuration
formatting_python_template(): format with Python templates
Indexing configuration
indexing_default(): default HNSW indexingindexing_diskann(): DiskANN indexingindexing_hnsw(): HNSW indexing with optionsindexing_none(): no automatic indexing
Scheduling configuration
scheduling_default(): run every 5 minutesscheduling_timescaledb(): use TimescaleDB job schedulingscheduling_none(): disable automatic scheduling
Processing configuration
processing_default(): default processing settings
Access control
grant_to(): specify user permissions
Management functions
enable_vectorizer_schedule(): resume automatic processingdisable_vectorizer_schedule(): pause automatic processing
Monitoring
vectorizer_status: view vectorizer status and statisticsvectorizer_queue_pending(): check pending work items