Skip to main content
Split text into chunks based on a specified separator, with control over chunk size and overlap between chunks.

Purpose

  • Split text into chunks based on a specified separator
  • Control the chunk size and amount of overlap between chunks
  • Simple, predictable chunking strategy

Samples

Basic character splitting

Split content into 128-character chunks with 10-character overlap:
SELECT ai.create_vectorizer(
    'blog_posts'::regclass,
    loading => ai.loading_column('content'),
    embedding => ai.embedding_openai('text-embedding-3-small', 768),
    chunking => ai.chunking_character_text_splitter(128, 10)
);

Custom separator

Split on newlines:
SELECT ai.create_vectorizer(
    'documents'::regclass,
    loading => ai.loading_column('content'),
    embedding => ai.embedding_openai('text-embedding-3-small', 768),
    chunking => ai.chunking_character_text_splitter(512, 50, E'\n')
);

Regex separator

Split using a regular expression:
SELECT ai.create_vectorizer(
    'text_data'::regclass,
    loading => ai.loading_column('text'),
    embedding => ai.embedding_openai('text-embedding-3-small', 768),
    chunking => ai.chunking_character_text_splitter(
        chunk_size => 800,
        chunk_overlap => 100,
        separator => E'\\n\\n|\\. ',
        is_separator_regex => true
    )
);

Arguments

NameTypeDefaultRequiredDescription
chunk_sizeint800Maximum number of characters in a chunk
chunk_overlapint400Number of characters to overlap between chunks
separatortextE'\n\n'String or character used to split the text
is_separator_regexboolfalseSet to true if separator is a regular expression

Returns

A JSON configuration object for use in create_vectorizer().