Duck on Deck: Docker, DuckDB & MotherDuck

Crescent Team
Apr 8
4 min read

Are you drowning in data spread across your Dockerized applications? You're not alone. Imagine this: you're running a Docker Compose stack with several apps – maybe a web front-end, an API, and a back-end service – all interacting with users and diligently logging data to a PostgreSQL database. You know there's valuable information hidden in those logs: insights into user behavior, performance bottlenecks, error patterns... but actually getting to that information feels like swimming upstream.

That's the problem. Quick, iterative data analysis during development (and production) – the kind that gives you immediate feedback – becomes a major headache. In our compose stack, we have a few containers exposed to clients and users and they all natively feed into a PostgreSQL container. From this data, we need to answer questions like: "Which user actions are leading to the most errors?" or "How are users interacting with that new feature we just launched?" and more.

Traditional tools like pgAdmin (as we experienced) often fall short. They can be slow, cumbersome to set up within a Docker environment, and don't easily integrate into a rapid development workflow. We're stuck writing complex SQL queries in a separate tool, constantly switching contexts, and losing precious development time. With growing data sizes, we want to extract data from our users’ and developers’ workflows and get querying fast.

DuckDB is a blazing-fast, embedded database engine. Think of it as a tiny, incredibly efficient data analyst that lives inside your Docker environment. It lets you query your PostgreSQL data directly, without the overhead of a separate database server or complex configuration. This means you can get those crucial insights instantly, right where you need them – in your development workflow. Sounds like a great solution, but how do I get access to my data from anywhere, and run useful data exploration tasks? MotherDuck has the answer: it extends DuckDB’s power to the cloud, allowing for more data exploration, complex analysis, collaboration, and anywhere-access to your data, using a fast and simple connection.

DuckDB 🔗 Postgres

Setting up a DuckDB connection in your stack will take you less than 3 minutes. DuckDB is embedded. You can either install it directly within an existing container (e.g., your application container) or use a very lightweight DuckDB-specific image. You can either run it as follows or configure this in the docker-compose.yaml as an entry point

curl https://install.duckdb.org | sh
duckdb // or /root/.duckdb/cli/latest/duckdb

DuckDB automatically loads the Postgres extension! Then attach the DuckDB to the Postgres data using

ATTACH 'dbname=user_interactions user=root password=password host=postgres' AS postgres_db (TYPE POSTGRES, READ_ONLY);

SHOW ALL TABLES;

If I want to copy the data to DuckDB (in-memory) for faster analytics I use

CREATE TABLE duckdb_table AS FROM postgres_db.customer_data;

Or directly query the table I am analyzing

SELECT * FROM postgres_db.customer_data;

In this case, I want to find all messages from an AI agent to my user that resulted in an error in the workflow

SELECT * FROM postgres_db.customer_chat_errors WHERE message LIKE '%ai%';

Scaling to Cloud ☁️

For more complex analytics, data exploration, and if I want to access the data anywhere easily, I load the data into MotherDuck. This seamless integration between DuckDB and MotherDuck makes it even faster to get answers from our data. Using the simple

ATTACH 'md:';

Our Docker compose stack data can be loaded to MotherDuck cloud. The DuckDB CLI automatically loads the MotherDuck extension requirements and launches a connection. I only have to verify the connection once (3 clicks and 1 copy-paste in total) and I’m immediately good to go. I can now copy my data to the MotherDuck cloud storage using

INSERT INTO MD_database.chat_histories SELECT * FROM postgres_db.chat_histories;

That’s it! Literally all you need to load your data to a cloud storage and analyze it anywhere is 2 SQL commands. I then load the MotherDuck app (because of the ease of exploratory data analysis using the column explorer) and get querying!

Data exploration and quick queries aren’t the only reasons we use MotherDuck here. JSON string formatting within DuckDB is so seamless, it has become our internal go-to for processing certain sets of data. For example, some of our data is user interactions with an LLM enabled bot. The output saved in the chat_histories tables looks like this

  "human": "hi there what are you working on now?",

  "AI": "Hello! I am currently running the WebSearcher tool to extract more information about Elon Musk. The tool has be running since 18:56:43 and has been called a total of 5 times. Is there anything specific you would like to know about the current run?",

  "with": {

    "tool_calls": "5",

    "tool_access": "xxxxx",

    "user_id": "xxxxxxx",

    "run_date": "xxxxxxx"

With more elaborate information about histories, tool calls and API requests. I can quickly write a query that answers my question of “How many API calls have been made to XYZ API in the last 3 days?” to calculate the associated cost. I run the MotherDuck prompt() with an associated expected output structure using the parameter struct. That’s it, with MotherDuck I can get answers to any questions I have within seconds. We believe this question-to-insight time is truly the best benchmark!