A lean data stack idea

Simbu
10 min readDec 12, 2024

--

Hello data folks! How’re you?

As 2024 is about to end, I am finally publishing my first tech blog (one of this year’s resolution, which I have been mulling for quite a while). I’m going to start with a little about me, my name is Simbu, working in the field of data engineering for more than a decade. I’m always passionate about learning anything new in the data industry and especially since the AI boom manifested by OpenAI/Nvidia, the demand for data engineering is rapidly expanding across the entire Data and AI landscape.

Intro

Today, I’m thrilled to share an interesting data stack I carried out with open source tools, involving MinIO (object storage) and DuckDB inspired from this blog post. I have been wanting to get my hands into DuckDB for a while, and with the recent exposure I had on MinIO, I thought its a perfect concoction for a great recipe. The limitations of my lightweight laptop, with only 8GB of RAM, prompted me to explore this setup for storing and processing data on a single node. My past challenges with resource-heavy Docker images, which require substantial RAM, further motivated this choice. With the availability of cost-effective storage solutions today, MinIO serves as an excellent choice for developers seeking an S3-compatible object storage solution for both local and cloud setups. DuckDB, on the other hand, shines in OLAP scenarios, offering a combination of speed and simplicity.

Harnessing the power of MinIO and DuckDB

So what is MinIO?

MinIO is an object storage solution that provides an Amazon Web Services S3-compatible API and supports all core S3 features. MinIO is built to deploy anywhere — public or private cloud, BareMetal infrastructure, orchestrated environments, and edge infrastructure.

This means we can use MinIO locally by downloading minio.exe file for Windows, setting it up to run on our localhost, and accessing it through a web browser. This effectively simulates S3 APIs, providing a comparable experience.

## navigate to the path where you have the minio.exe file downloaded and use the path where you want to store the console logs 
.\minio.exe server </path/to/console-log> --console-address :9001
We can start the MinIO server from the PowerShell/terminal with the above command

If you’ve successfully configured, you’d be facing a login page ( http:127.0.0.1:9001) similar to the below image, where you can use the default login minioadmin/minioadmin. You could use the same login or create a new user and manage your activities.

This is the UI login page which gets created from the above command as a result

In this use case, I created 3 different users on random basis (which can be created under Identity → Users) who has varying accesses in the MinIO buckets. To simplify, the developer role can read and write data to both the buckets (that’s the role I’m using for this exploration) while the consumer has only read access to the buckets and the producer has only write access (never used these two though).

Under identity option, if you select users, you’d be redirected to this page where you can create or edit the current users and vary their permissions in terms of s3 bucket policies
This image lists the current users

And now, for creating a bucket, navigate to the Buckets tab (under Administrator section, hence this is defined with the super user privilege) and click on the “Create Bucket +” button at the top. I created two buckets with my own configurations as provided below, one to store the raw data as such and the second one is to store transformations (here the raw-dataset bucket has csv files, whereas the compressed-zone has files in parquet) which we’ll use as a source for analysis through DuckDB.

Configuration applied when creating these buckets through the MinIO console

And now that the buckets are in place, I manually uploaded some files to the raw-dataset bucket through the console UI. For this activity, I downloaded F1 world championship public dataset (since I’ve been following F1 since 2005) from Kaggle.

While all the above are done through the MinIO console (I suggest you could explore many other features through this console and get yourself familiarized), I was also able to achieve the same through the MinIO API for the most frequent use cases involving, creating/removing a bucket, listing its objects or removing them and uploading data (files and folders) to these buckets. I used Python 3.9 for this exercise and the entire code is available in this GitHub repo. Do note however, in order to get your hands dirty through these codes, its important to setup your access keys (this is similar to access key and secret access key in AWS) to connect to the MinIO server from your machine. We can generate our keys through the browser and store them locally in the codebase, or via environment variables or somewhere like a vault/secrets manager.

Access key with its secret to access the MinIO server

Clicking on the pencil icon at the right most, allows me to edit the policies associated with this user (developer who has both read and write access to the buckets) as follows,

In the above JSON, we can edit the Actions under Allow and resource to restrict access on the attribute level

Diving into the raw-dataset bucket for its content, we see many csv files that has specific information and the same is converted to parquet files using DuckDB and stored in a secondary bucket, compressed-zone.

raw-dataset bucket and its content (csv files)
compressed-zone bucket and its content — post transformation with DuckDb

With this, let’s dive into DuckDB for transforming these data to a compressed form with a simplified SQLite based in-process db approach where the computations happen in-memory

And DuckDB? As per its official site,

DuckDB is a fast simple, in-process, analytical, open-source, portable, DBMS supporting a feature-rich SQL dialect complemented with deep integrations into client APIs.

So why is this efficient to run on a single node? There is no separate DBMS server software to install, update and maintain. It does not run as a separate process, but completely embedded within a host process. For the analytical use case that DuckDB targets, this has the additional advantage of high-speed data transfer to and from the database. In some cases, DuckDB can process foreign data without copying (similar to Snowflake’s zero-copy feature). But unlike Snowflake, DuckDB operates entirely in-process, avoiding remote metadata storage dependencies. This design makes it lightweight and perfect for local or embedded analytics scenarios. Running locally on your PC/laptop, DuckDB can be used in two modes, in-memory or persistent modes.

By default, its connected with “memory” option,

import duckdb
con1 = duckdb.connect(":memory:") # in-memory mode
con2 = duckdb.connect(r"D:/DuckDB/my_database.duckdb") # persistent to disk

In the above code snippet, con1 is an in-memory connection and con2 is the persisted connection, meaning the db objects are referenced from this path when we use the later mode.

A simple Duck SQL command to convert csv files to parquet is as below,

COPY (SELECT * FROM 's3://raw-dataset/inputs/circuits.csv') 
TO 's3://compressed-zone/inputs' (FORMAT 'PARQUET');

That’s how easy it is to convert files from one format to the other using DuckDB. Care must be taken to setup DuckDB to reach the MinIO local server, below is the code I used to connect DuckDB while also reading/writing data from/to MinIO

def connect(self):
"""
Establishes a connection for the duckdb client
"""
try:
self.con = duckdb.connect(database=self.db_path, read_only=False)
self.con.execute(f"""
INSTALL httpfs;
LOAD httpfs;
SET s3_endpoint='{self.minio_host}';
SET s3_access_key_id='{self.minio_access_key}';
SET s3_secret_access_key='{self.minio_secret_key}';
SET s3_use_ssl=false;
SET s3_url_style='path';
""")
return self.con
...
  • INSTALL/LOAD httpfs; — This DuckDB extension is used to connect to remote resources like S3 (or over HTTPS network) and since this is a local installation of MinIO.
  • SET s3_use_ssl=false — Used to avoid using certificates expiration (for SSL connections)
  • SET s3_url_style='path' — Useful to connect to S3 resources with the same way we define paths within S3 (or it could be set as ’vhost’ for S3 URI instead)

With this setup, we were able to iterate over all the source csv files inside the prefix, s3://raw-dataset/inputs and convert them one by one over the MinIO list_objects iterator (similar to S3's) to parquet and store in the destination effortlessly by DuckDB. And once these data are ready (mind you since DuckDB can directly read csv files as well, but I wanted to try and its better to use parquet for analytical queries which is the forte of this setup and how this can be leveraged for future activities). We can start with some basic queries, like who’s the driver or constructor with most points, wins, poles etc.

To understand the data in the source files, I created two schemas — raw_data and derived_data (yes DuckDB supports schema as well, however grants and user management features are something that is not yet supported) I created tables from these parquet files in raw_data schema as these has some dimensional data, like drivers, races, constructors, status etc, the source data model would look like,

A simplified logical data model of the source data — included only necessary tables for this analysis

As we can see from the above, the results & sprint_results table seem like a central fact table and the drivers, constructors, circuits, status are more of a dimensional tables each has its own attributes which seldom change (but in the event they do, the historical values are not preserved and hence these are the SCD type 1 model) and hence they’re part of the classic star schema of a typical warehouse. While the tables like driver_standings, constructor_standings & constructor_results are a quasi-fact tables because they do help in finding some metrics like, number of wins, points got by driver/team, though these doesn’t help in extracting fine grained metrics like # of drivers winning from Pole or drivers who won from most different position on the grid or fastest laps per driver/constructor metrics.

For extracting those finer metrics from the raw_data, I’m going to create the derived_data schema and model this resultFacts table, with the below definition,

USE raw_data;

--- Note that this derived table is created in derived_data schema
DROP TABLE IF EXISTS derived_data.resultFacts;
CREATE TABLE derived_data.resultFacts AS
SELECT res1.resultId, rac1.raceId, 'Normal race' AS race_type, rac1.name||' '||rac1.year as race_name, dr1.forename||' '||dr1.surname AS driver_name, cons1.name AS constructor_name, res1.grid AS starting_grid, res1.position, res1.positionText AS finishPosition, res1.positionOrder AS finishOrder, res1.points, res1.laps, res1.time||'.'||res1.milliseconds AS raced_time, res1.fastestLap, res1.rank, res1.fastestLapTime, res1.fastestLapSpeed AS fastestLapSpeed_kmph, st1.status
FROM results res1
JOIN status st1 USING (statusId)
JOIN races rac1 USING (raceId)
JOIN drivers dr1 USING (driverId)
JOIN constructors cons1 USING (constructorId)
UNION ALL
SELECT res2.resultId, rac2.raceId, 'Sprint race' AS race_type, rac2.name||' '||rac2.year as race_name, dr2.forename||' '||dr2.surname AS driver_name, cons2.name AS constructor_name, res2.grid AS starting_grid, res2.position, res2.positionText AS finishPosition, res2.positionOrder AS finishOrder, res2.points, res2.laps, res2.time||'.'||res2.milliseconds AS raced_time, res2.fastestLap, NULL AS rank, res2.fastestLapTime, NULL AS fastestLapSpeed_kmph, st2.status
FROM sprint_results res2
JOIN status st2 USING (statusId)
JOIN races rac2 USING (raceId)
JOIN drivers dr2 USING (driverId)
JOIN constructors cons2 USING (constructorId);

This materialization of the resultant cube is cleaned, transformed from the base dimensional points, and is currently loaded in an atomic granularity on the combination of driver, constructor, races thereby any aggregation metrics needed on the race results/driver/constructor can be done as applicable. For ex, to arrive at the Most points per driver over a season,

SELECT driver_name, SUM(points) AS driver_points 
FROM resultFacts --- includeds both Normal races & Sprints
WHERE RIGHT(race_name, 4) = 2021
GROUP BY driver_name
ORDER BY driver_points DESC, driver_name;
DuckDB driver points data — 2021 season
F1 site driver points data — 2021 season

This result includes the points scored by the driver as part of Sprint races as well, as the resultFacts table is inclusive of ‘Normal race’ and ‘Sprint race’ with an UNION ALL clause.

Another interesting feature I’d like to point here is, IMPORT and EXPORT database command. Export command simply exports the schema definition and data to the local path we’re giving as below,

-- Exports the in-memory database to disk
EXPORT DATABASE "/path/to/export"
(FORMAT PARQUET,
COMPRESSION ZSTD,
ROW_GROUP_SIZE 100_000);

-- Imports the data from disk to memory
IMPORT DATABASE "/path/of/exported/folder";

Most fastest laps per driver in career,

Data derived from our dataset
Data from google

While the result above is far from the reality for drivers of the past, I found that in the source data itself, fastest laps metrics aren’t captured for races prior to 2004 and hence Michael Schumacher here is heavily unfavored here. So, this is essential an issue from the dataset and not from DuckDB or MinIO.

This is a result of 2003 Australian GP, we could see the fastestLap#, fastestLapTime and rank are by default \N

I have collated some of simpler metrics and its computations in the queries.sql Here, I have defined the schema for the views and the tables. I would like to deep dive into the analytical prowess of DuckDB like pivot/unpivot, window/qualify, filter clauses and keep updating this file.

Pros and Cons — MinIO + DuckDB

  • Lightweight and easy to setup (both MinIO and DuckDB) and open-source (for individuals), need I say more?
  • The MinIO APIs are quite similar to S3, and this makes it easy to port to S3 in the future, with ease.
  • MinIO sets the bar as one of the fastest object storage solution available, ensuring data access and retrieval are lightning-fast.
  • When integrated with DuckDB, this performance boost enhances the efficiency of data analytics and processing.
  • DuckDB’s zero-copy integration with data sources like Parquet files allows for high-speed query execution directly without duplicating data.
  • While DuckDB is super fast, it is not suitable for live/transactional workloads a.k.a high-frequency writes.
  • With DuckDB, schema support is basic compared to other databases and user management is outside its current scope due to its design as an in-process DBMS.

Next steps?

  • Keeping this simplicity in mind, I would like to explore lakehouse tables (Hudi/Iceberg) that DuckDB supports to read/write.
  • Exploring the features inside MinIO, viz., replication, encryption (SSE-S3, KMS), IAM, Lifecycle management, versioning as securing this setup is equally important.
  • Scouting DuckDB’s features like extensions, ADBC protocols, Indexing and benchmarking for bulkier data volumes (with persisted mode)

Outro

With this post, I’d like to highlight the light-weight nature of this setup, along with the advantages and disadvantages within it and if we can counter the same (or at least try to) optimally.

Suggestions are welcome with your feedback on improvising this stack is hugely appreciated. Thank you!

--

--

Simbu
Simbu

Written by Simbu

A concoction of a data professional, an F1 enthusiast, and an ardent traveler.

Responses (1)