Skip to main content
Recursively split text into chunks using multiple separators. This provides more fine-grained control over the chunking process and can better preserve semantic meaning by trying separators in order.

Purpose

  • Recursively split text using multiple separators
  • Preserve more semantic meaning in chunks
  • Try separators in order (paragraphs, then sentences, then words)
  • Default configuration balances context preservation and chunk size

How it works

The function tries each separator in order. If a chunk is still too large after applying a separator, it tries the next separator in the list. This helps preserve natural text boundaries like paragraphs and sentences.

Samples

Default recursive splitting

Use the default separator hierarchy:
SELECT ai.create_vectorizer(
    'blog_posts'::regclass,
    loading => ai.loading_column('content'),
    embedding => ai.embedding_openai('text-embedding-3-small', 768),
    chunking => ai.chunking_recursive_character_text_splitter()
);

Custom chunk size and overlap

SELECT ai.create_vectorizer(
    'documents'::regclass,
    loading => ai.loading_column('content'),
    embedding => ai.embedding_openai('text-embedding-3-small', 768),
    chunking => ai.chunking_recursive_character_text_splitter(256, 20)
);

Custom separator hierarchy

Try newlines first, then spaces:
SELECT ai.create_vectorizer(
    'text_data'::regclass,
    loading => ai.loading_column('text'),
    embedding => ai.embedding_openai('text-embedding-3-small', 768),
    chunking => ai.chunking_recursive_character_text_splitter(
        chunk_size => 512,
        chunk_overlap => 50,
        separators => array[E'\n\n', E'\n', ' ', '']
    )
);

Arguments

NameTypeDefaultRequiredDescription
chunk_sizeint800Maximum number of characters per chunk
chunk_overlapint400Number of characters to overlap between chunks
separatorstext[]array[E'\n\n', E'\n', '.', '?', '!', ' ', '']Array of separators to try in order
is_separator_regexboolfalseSet to true if separators are regular expressions

Returns

A JSON configuration object for use in create_vectorizer().