Restless Bytes

Introduction

Although framework-agnostic in theory, most RAG pipelines use vector databases to store and retrieve data. And while plenty of good options are already out there, such as Pinecone, Chroma, Weaviate etc., in this post, I’d like to explore a bit of an unusual contender: DuckDB.

Also, disclaimer: I’m not associated with DuckDB in any way :P

DuckDB as a vector store?

DuckDB’s vector store capabilities are still a bit … “experimental”, they are now implemented as a core extension which, in my opinion, makes DuckDB a great vector DB for certain use cases:

DuckDB ships as a “dependency-free” single binary which makes setup and integration into existing applications very easy - especially considering that it runs ‘in-memory’ per default!
Similar to e.g. SQLite, it’s also portable and can be stored in a single file. This means that it can be used as a way to store and share document embeddings (vectors) together with metadata, text chunks etc. in a single file - or, better yet, together with the original file.
DuckDB itself is already quite established, in the sense that it has a large community and user-base, i.e. the number of people that are familiar with DuckDB is way bigger than those familiar with a particular vector database - or vector databases in general, really.
I know this is a bit of a personal thing but I really don’t like the often “cryptic” or overly complex output from vector databases. Gimme that simple, flat, tabular output and I’m happy :)

Now, before you roll your eyes trying to come up with something snarky: Yes, I’m aware that points 1 and 2 also apply to vector databases like Chroma.

However, if you decide one day that RAG is no longer the way to go, with DuckDB, you’d then still have a perfectly fine general-purpose database system, i.e. something that’s useful for a ton of other things. Something that can’t be said about pure vector databases which, as mentioned, have a pretty narrow use-case.

Also, if DuckDB is already part of your tech stack, consider using it over vector DBs before you introduce yet another dependency ;)

How does all that work?

With that out of the way, let’s find out how DuckDB can be used as part of a RAG pipeline.

However, before we dive into the practical side of things, let’s quickly review our setup:

Models: I’ll be using the following local models via ollama:
- mxbai-embed-large as embedding model
- gemma3-12b-it-qat (GGUF, Q6_K) as main LLM
We’ll use the official duckdb Python package to work with DuckDB
Also, while the plan is to get our hands dirty, we don’t want to re-invent the wheel on every step and instead use langchain and langchain-ollama wherever possible to make our lives easier.
Unit tests will be conducted with pytest

Also, you can find all of the code from this article on my Github.

Data Ingestion

The general data ingestion workflow for RAG is pretty straightforward:

Step 1 - Load data from source as plain text documents

I’ll keep things simple and assume that we’re only dealing with textual data, so here’s our simple load function:

@dataclass
class Document:
    name: str
    content: str


def load(path: Path) -> list[Document]:
    def load_file(file_path: Path) -> Document:
        return Document(name=file_path.name, content=file_path.read_text())

    if path.is_dir():
        return [load_file(item) for item in path.iterdir() if item.is_file()]
    if path.is_file():
        return [load_file(path)]
    raise ValueError(f"Path must point to either a file or a directory!")

Now, in reality, the load data step is likely going to be way more complicated. But no matter how the process actually looks like, it’s important to note that in the end, we want our text documents to be as plain-text-ish as possible, e.g. no structural artifacts like HTML tags or JSON leftovers.

Step 2 - Split documents into text chunks

After we’ve loaded our text documents, we’re going to split them up into smaller chunks (of text). The reason we’re doing this is that it’s easier to select relevant parts of the text without selecting the whole document (which may contain a lot of irrelevant stuff).

While “context is king” in RAG, too much context can actually be detrimental.

def split(documents: list[Document]) -> list[Document]:
    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    results: list[Document] = []
    for document in documents:
        for chunk in splitter.split_text(document.content):
            results.append(Document(name=document.name, content=chunk))
    return results

We’re using langchain’s RecursiveCharacterTextSplitter here to split the text up.

Step 3 - Create embeddings for those chunks with the help of an embedding model

Now, things get a bit a bit more interesting. Because, as mentioned in the first part of the article, DuckDB’s vector store capabilities are still a tad “experimental” and are lacking in features.

One of those features is embedding: most specialized vector databases do this “out-of-the-box”, i.e. you just provide the text and they’re doing the whole “embed-and-insert” part.

With DuckDB, ~~we’re out of luck~~ we’ve got the opportunity to build our very own little embedding function - rejoyce!

@dataclass
class Embedding:
    document: Document
    vector: list[float]


def embedding_model() -> OllamaEmbeddings:
    return OllamaEmbeddings(model='mxbai-embed-large')


def embed(documents: list[Document]) -> list[Embedding]:
    texts = [document.content for document in documents]
    embeddings = embedding_model().embed_documents(texts)
    results: list[Embedding] = []
    for i, document in enumerate(documents):
        results.append(Embedding(document=document, vector=embeddings[i]))
    return results

Ok, admittedly, most of the embedding is done by langchain’s OllamaEmbeddings doing the actual “text chunk to vector embedding” conversions. But still.

The Embedding class is meant to combine the text chunk and its corresponding embedding vector into a single object, making it easier to keep them together.

Step 4 - Insert text chunks and their embeddings into DuckDB

Now that we’ve converted our text chunks to vectors, we can finally get to the meat of the matter and do something with DuckDB.

I think that the easiest way would be to present you with the code and then walk you through it - so, here we go:

def add_embeddings_and_chunks_to_database(
    embeddings: list[Embedding], database: str
) -> int:
    # Connect to database and create table for text chunks and embeddings
    connection = duckdb.connect(database)
    connection.execute("""
    CREATE TABLE IF NOT EXISTS embeddings (
    document VARCHAR,
    chunk VARCHAR,
    enumeration INTEGER,
    embedding FLOAT[1024],
    PRIMARY KEY (document, chunk));
    """)
    # Insert text chunks and embeddings
    records = []
    for i, embedding in enumerate(embeddings):
        name = embedding.document.name
        content = embedding.document.content
        vector = embedding.vector
        records.append((name, content, i, vector))
    connection.executemany("""
    INSERT OR REPLACE 
    INTO embeddings (document, chunk, enumeration, embedding) 
    VALUES (?, ?, ?, ?)
    """, records)
    # Create HNSW index that allows us to perform vector similarity searches
    duckdb.connect(database).execute("""
    INSTALL vss;
    LOAD vss;
    SET hnsw_enable_experimental_persistence = true;
    CREATE INDEX cosine_index ON embeddings USING HNSW (embedding)
    WITH (metric = 'cosine');
    """)
    if result := connection.execute("SELECT COUNT(*) FROM embeddings").fetchone():
        return result[-1]
    return 0

First, we connect to our DuckDB instance and create a table that we’ll use to store our embeddings - aptly called embeddings:

CREATE TABLE IF NOT EXISTS embeddings (
    document VARCHAR, -- name of the source document
    chunk VARCHAR, -- plain text representation of embedding
    enumeration INTEGER, -- enumeration number that encodes the "order" of chunks in the original text
    embedding FLOAT[1024], -- vector representation of text chunk - must be an array column
    PRIMARY KEY (document, chunk) -- ensure we're not adding the same chunks multiple times
);

The array size of 1024 for our embedding column is determined by the embedding model that we used to create our embedding vectors.

Also, the reason why we enumerate the embedding items is to preserve the order in which they appear(ed) in the original document. We’ll make use of this later on in the RAG part of the process.

Now, let’s take a step back for a moment and realize that we didn’t really do anything vector DB specific up to this point, instead it was all plain old relational database stuff - things we’ve been doing for years, and things we’re already familiar with.

I wanted to mention this here because the next part is going to be rather vector store specific: We’ll be using DuckDB’s vss (vector similarity search) extension to create an HNSW index for our embeddings table:

INSTALL vss;
LOAD vss;
-- This is optional but needed if you want to save embeddings to file
SET hnsw_enable_experimental_persistence = true;
-- Create an HNSW index on column 'embedding' ...
CREATE INDEX cosine_index ON embeddings USING HNSW (embedding)
-- ... expecting 'cosine similarity' as a distance metric
WITH (metric = 'cosine');

Again, if you’re familiar with SQL, you should roughly understand what’s going on here. What’s new here is that the vss extension allows us to create indices that operate on - you guessed it - vector similarity instead of things like primary keys.

The next step would be to verify that everything is working as expected, so let’s let’s define a function that captures our little ingestion pipeline …

def pipeline(path: Path, database: str) -> tuple[int, str | None]:
    # Ensure our database file has an "appropriate" suffix
    database = database.removesuffix(".db") + ".db"
    documents = load(path=path)
    chunks = split(documents=documents)
    embeddings = embed(documents=chunks)
    inserted = add_embeddings_and_chunks_to_database(
        embeddings=embeddings, database=database
    )
    # Just to make sure ...
    assert inserted == len(chunks), (
        f"It seems that now all documents have been added to the database:",
        f"expected={len(chunks)}, inserted={inserted}",
    )
    return inserted, database

… and test it!

Unit Test - Use semantic search to test document retrieval

I know, unit tests can be pretty boring, so let’s make this one a (very simple) semantic search scenario instead!

We’ll start by generating a couple of (very short and very silly) test documents, embed and then insert them into DuckDB:

# ... import statements omitted ...

TEST_DATABASE = "test-rag.db"
TEST_DOCUMENTS = [
    (1, Document("a.txt", "This is a document about ducks.")),
    (
        2,
        Document(
            "b.txt",
            "Confucius says: Ducks are a kind of birds, but they prefer to live in the water.",
        ),
    ),
    (3, Document("c.txt", "This is a document about the secrets of taming cats.")),
    (4, Document("d.txt", "Confucius says: Ducks are kind of cool by the way.")),
    (
        5,
        Document(
            "e.txt",
            "A text about dogs. Because there has to be something about dogs, for good measure.",
        ),
    ),
]


def test_semantic_search(monkeypatch):
    # Clean up leftovers from previous tests
    Path(TEST_DATABASE).unlink(missing_ok=True)
    # Don't "load" documents since we already have them as a 'list'
    monkeypatch.setattr(ingestion, "load", lambda path: TEST_DOCUMENTS)
    # Also, don't split them since they're too short anyway
    monkeypatch.setattr(ingestion, "split", lambda documents: TEST_DOCUMENTS)
    
    inserted, database = pipeline(Path("does/not/exist"), database=TEST_DATABASE)
    query = "What does Confucius say about ducks?"
    results = retrieve(query=query, database=TEST_DATABASE)

    assert TEST_DATABASE == database
    assert len(TEST_DOCUMENTS) == inserted
    assert {document.name for _, document in results} == {"a.txt", "b.txt", "d.txt"}

You may have noticed that there’s a function in there that we haven’t seen yet: retrieve. This is a function that we haven’t implemented yet but will do so in the next section. For now, you can assume that retrieve does the following:

Takes as input the string "What does Confucius say about ducks?"
Creates an embedding of this string using the same embedding model that we used to embed our test documents
Uses vector similarity to find embeddings in the database that are similar to the embedding of the question:

If we inspect our test documents closely we’ll find that documents a.txt, b.txt and d.txt all contain the terms Confucius and / or duck(s). Therefore, their embeddings should be similar (or very similar) to the embedding of our question string, so let’s test that by adding

assert {document.name for _, document in results} == {"a.txt", "b.txt", "d.txt"}

to the list of assert statements in our test_semantic_search function.

Running this with pytest should then indeed confirm our assumptions.

Talking with your documents, DuckDB edition

Thanks to the unit test at the end of the last part, we can now be relatively certain that our DuckDB setup and integration work, so let’s put it to the test (again) in a more realistic scenario: “Talking with our documents”.

I decided to use an English translation of Thucydides’ “History of the Peloponnesian War” (in plain text from gutenberg.org) as our “test document”. The text is long enough to test the splitting step this time as well (although we’re relying mostly on langchain’s RecursiveCharacterTextSplitter), and its well-known story makes it easy for us to to verify the “correctness” of the LLM generated responses.

By the way, the good thing about the “RAG” acronym is that it basically serves as a “reminder” of the main steps comprising this process:

Retrieve > Augment > Generate

Let’s create a Python function for each of these real quick:

def retrieve(query: str, database: str, limit: int = 3) -> list[tuple[int, Document]]:
    vector = embedding_model().embed_query(query)
    vector_length = len(vector)
    embedded_query = f"{vector}::FLOAT[{vector_length}]"
    search_statement = f"SELECT * FROM embeddings ORDER BY array_cosine_distance(embedding, {embedded_query}) LIMIT {limit}"
    results = duckdb.connect(database).execute(search_statement)
    documents: list[tuple[int, Document]] = []
    for document, content, enumeration, _ in results.fetchall():
        documents.append((enumeration, Document(name=document, content=content)))
    return documents


def augment(query: str, documents: list[tuple[int, Document]]) -> ChatPromptValue:
    prompt = hub.pull("rlm/rag-prompt")
    documents.sort(key=lambda t: t[0])
    context = "\n\n".join(doc[1].content for doc in documents)
    return prompt.invoke({"question": query, "context": context})


def generate(prompt: ChatPromptValue, model: str) -> str:
    llm = init_chat_model(model=model, model_provider="ollama")
    response = llm.invoke(prompt)
    return response.text()

Most of the code should be pretty straightforward, the “boring but reliable” part of the pipeline. The interesting bit is the SQL SELECT statement in the retrieve function:

SELECT * FROM embeddings 
ORDER BY array_cosine_distance(embedding, <query>) 
LIMIT <number>

The ORDER BY clause uses a distance function (array_cosine_distance) from DuckDB’s vss extension to calculate cosine distances between the embedding of the query and the document ebeddings in the database.

Mathematically speaking, cosine similarity measures how aligned two vectors are, which we’ll interpret as “finding text chunks that are semantically similar to the query”.

To put it another way: The closer an embedding is to the query, the higher up it’ll appear in the results.

We’ve seen this function in action before: it was part of the unit test “pipeline” from the last section where we used it to search documents that are similar to our input query - we’re using the same principle here as well!

Also, the augment part of our pipeline is pretty simple and doesn’t involve a lot of the context augmentations techniques that you’d typically see here, like summarization, ranking, trimming etc.

What we did do however is reorder the text chunks “chronologically”, i.e. in the same order that they would appear in the original text, regardless or their similarity score because I’m convinced that coherence matters more than raw similarity when it comes to context engineering (fight me about it!)

But now, with all parts of the pipeline set up, let’s build the actual RAG pipeline:

# file 'main.py'

# ... import statements go here ...

def answer(query: str, model: str, database: str, limit: int = 3) -> str:
    documents = retrieve(query=query, database=database, limit=limit)
    prompt = augment(query=query, documents=documents)
    return generate(prompt=prompt, model=model)
    

query = "Who was allied to Athens before the war?"
model = "hf.co/unsloth/gemma-3-12b-it-qat-GGUF:Q6_K"

print(f"Ollama model name: {model}")
print(f"Question: {query}")

# This is just a "convenience" function that loads our example text from file
document_path = peloponnesian_war_ebook()
# 'pipeline' is our data ingestion function
pipeline(document_path, database=document_path.stem)
# Perform some RAG magic here
response = answer(query=query, model=model, database='Peloponnesian_War.db', limit=10)

print("")
print(f"Answer: {response}")

Again, nothing unusual or complicated here, so let’s briefly talk about the wording of our user query:

Who was allied to Athens before the war?

The question is kept intentionally vague and does not specify what “war” we’re talking about - this is important because it forces the LLM to rely on the context from our database rather than what it might know based on its training data.

If we had mentioned the Peloponnesian War in the question, then there’s a (high) chance that the answer could’ve been generated from the LLM’s internal knowledge instead of the retrieved context.

Let’s now run it and see what we’ll get back:

$ uv run python main.py
Ollama model name: hf.co/unsloth/gemma-3-12b-it-qat-GGUF:Q6_K
Question: Who was allied to Athens?
Answer: Athens was allied with various groups, including the 
Chians, Lesbians, Plataeans, Messenians in Naupactus, most of 
the Acarnanians, the Corcyraeans, Zacynthians, and cities in 
Caria, Ionia, and the Hellespont. Additionally, they formed 
alliances with the Argives, Eleans, and Mantineans for a hundred 
years, and later with the Larisaeans, Pharsalians, and other 
Thracian groups like Sitalces and Perdiccas. Ancient alliances 
also brought the Thessalians to aid Athens.

Process finished with exit code 0

It worked! It correctly listed a some allies of the Athenians from the Peloponnesian War - according to the text at hand!

–

And since this post turned out to be much longer than anticipated, I’ll wrap it up here - happy coding!