Hello data folks! How’re you?
As 2024 is about to end, I am finally publishing my first tech blog (one of this year’s resolution, which I have been mulling for quite a while). I’m going to start with a little about me, my name is Simbu, working in the field of data engineering for more than a decade. I’m always passionate about learning anything new in the data industry and especially since the AI boom manifested by OpenAI/Nvidia, the demand for data engineering is rapidly expanding across the entire Data and AI landscape.
Intro
Today, I’m thrilled to share an interesting data stack I carried out with open source tools, involving MinIO (object storage) and DuckDB inspired from this blog post. I have been wanting to get my hands into DuckDB for a while, and with the recent exposure I had on MinIO, I thought its a perfect concoction for a great recipe. The limitations of my lightweight laptop, with only 8GB of RAM, prompted me to explore this setup for storing and processing data on a single node. My past challenges with resource-heavy Docker images, which require substantial RAM, further motivated this choice. With the availability of cost-effective storage solutions today, MinIO serves as an excellent choice for developers seeking an S3-compatible object storage solution for both local and cloud setups. DuckDB, on the other hand, shines in OLAP scenarios, offering a combination of speed and simplicity.
So what is MinIO?
MinIO is an object storage solution that provides an Amazon Web Services S3-compatible API and supports all core S3 features. MinIO is built to deploy anywhere — public or private cloud, BareMetal infrastructure, orchestrated environments, and edge infrastructure.
This means we can use MinIO locally by downloading minio.exe file for Windows, setting it up to run on our localhost, and accessing it through a web browser. This effectively simulates S3 APIs, providing a comparable experience.
## navigate to the path where you have the minio.exe file downloaded and use the path where you want to store the console logs
.\minio.exe server </path/to/console-log> --console-address :9001
If you’ve successfully configured, you’d be facing a login page ( http:127.0.0.1:9001
) similar to the below image, where you can use the default login minioadmin/minioadmin
. You could use the same login or create a new user and manage your activities.
In this use case, I created 3 different users on random basis (which can be created under Identity → Users) who has varying accesses in the MinIO buckets. To simplify, the developer
role can read and write data to both the buckets (that’s the role I’m using for this exploration) while the consumer
has only read access to the buckets and the producer
has only write access (never used these two though).
And now, for creating a bucket, navigate to the Buckets tab (under Administrator section, hence this is defined with the super user privilege) and click on the “Create Bucket +” button at the top. I created two buckets with my own configurations as provided below, one to store the raw data as such and the second one is to store transformations (here the raw-dataset bucket has csv files, whereas the compressed-zone has files in parquet) which we’ll use as a source for analysis through DuckDB.
And now that the buckets are in place, I manually uploaded some files to the raw-dataset bucket through the console UI. For this activity, I downloaded F1 world championship public dataset (since I’ve been following F1 since 2005) from Kaggle.
While all the above are done through the MinIO console (I suggest you could explore many other features through this console and get yourself familiarized), I was also able to achieve the same through the MinIO API for the most frequent use cases involving, creating/removing a bucket, listing its objects or removing them and uploading data (files and folders) to these buckets. I used Python 3.9 for this exercise and the entire code is available in this GitHub repo. Do note however, in order to get your hands dirty through these codes, its important to setup your access keys (this is similar to access key and secret access key in AWS) to connect to the MinIO server from your machine. We can generate our keys through the browser and store them locally in the codebase, or via environment variables or somewhere like a vault/secrets manager.
Clicking on the pencil icon at the right most, allows me to edit the policies associated with this user (developer
who has both read and write access to the buckets) as follows,
Diving into the raw-dataset
bucket for its content, we see many csv files that has specific information and the same is converted to parquet files using DuckDB and stored in a secondary bucket, compressed-zone
.
With this, let’s dive into DuckDB for transforming these data to a compressed form with a simplified SQLite based in-process db approach where the computations happen in-memory
And DuckDB? As per its official site,
DuckDB is a fast simple, in-process, analytical, open-source, portable, DBMS supporting a feature-rich SQL dialect complemented with deep integrations into client APIs.
So why is this efficient to run on a single node? There is no separate DBMS server software to install, update and maintain. It does not run as a separate process, but completely embedded within a host process. For the analytical use case that DuckDB targets, this has the additional advantage of high-speed data transfer to and from the database. In some cases, DuckDB can process foreign data without copying (similar to Snowflake’s zero-copy feature). But unlike Snowflake, DuckDB operates entirely in-process, avoiding remote metadata storage dependencies. This design makes it lightweight and perfect for local or embedded analytics scenarios. Running locally on your PC/laptop, DuckDB can be used in two modes, in-memory or persistent modes.
By default, its connected with “memory” option,
import duckdb
con1 = duckdb.connect(":memory:") # in-memory mode
con2 = duckdb.connect(r"D:/DuckDB/my_database.duckdb") # persistent to disk
In the above code snippet, con1
is an in-memory connection and con2
is the persisted connection, meaning the db objects are referenced from this path when we use the later mode.
A simple Duck SQL command to convert csv files to parquet is as below,
COPY (SELECT * FROM 's3://raw-dataset/inputs/circuits.csv')
TO 's3://compressed-zone/inputs' (FORMAT 'PARQUET');
That’s how easy it is to convert files from one format to the other using DuckDB. Care must be taken to setup DuckDB to reach the MinIO local server, below is the code I used to connect DuckDB while also reading/writing data from/to MinIO
def connect(self):
"""
Establishes a connection for the duckdb client
"""
try:
self.con = duckdb.connect(database=self.db_path, read_only=False)
self.con.execute(f"""
INSTALL httpfs;
LOAD httpfs;
SET s3_endpoint='{self.minio_host}';
SET s3_access_key_id='{self.minio_access_key}';
SET s3_secret_access_key='{self.minio_secret_key}';
SET s3_use_ssl=false;
SET s3_url_style='path';
""")
return self.con
...
INSTALL/LOAD httpfs;
— This DuckDB extension is used to connect to remote resources like S3 (or over HTTPS network) and since this is a local installation of MinIO.SET s3_use_ssl=false
— Used to avoid using certificates expiration (for SSL connections)SET s3_url_style='path'
— Useful to connect to S3 resources with the same way we define paths within S3 (or it could be set as’vhost’
for S3 URI instead)
With this setup, we were able to iterate over all the source csv files inside the prefix, s3://raw-dataset/inputs
and convert them one by one over the MinIO list_objects iterator (similar to S3's) to parquet and store in the destination effortlessly by DuckDB. And once these data are ready (mind you since DuckDB can directly read csv files as well, but I wanted to try and its better to use parquet for analytical queries which is the forte of this setup and how this can be leveraged for future activities). We can start with some basic queries, like who’s the driver or constructor with most points, wins, poles etc.
To understand the data in the source files, I created two schemas — raw_data
and derived_data
(yes DuckDB supports schema as well, however grants and user management features are something that is not yet supported) I created tables from these parquet files in raw_data
schema as these has some dimensional data, like drivers
, races
, constructors
, status
etc, the source data model would look like,
As we can see from the above, the results
& sprint_results
table seem like a central fact table and the drivers
, constructors
, circuits
, status
are more of a dimensional tables each has its own attributes which seldom change (but in the event they do, the historical values are not preserved and hence these are the SCD type 1 model) and hence they’re part of the classic star schema of a typical warehouse. While the tables like driver_standings
, constructor_standings
& constructor_results
are a quasi-fact tables because they do help in finding some metrics like, number of wins
, points got by driver/team
, though these doesn’t help in extracting fine grained metrics like # of drivers winning from Pole
or drivers who won from most different position on the grid
or fastest laps per driver/constructor
metrics.
For extracting those finer metrics from the raw_data
, I’m going to create the derived_data
schema and model this resultFacts
table, with the below definition,
USE raw_data;
--- Note that this derived table is created in derived_data schema
DROP TABLE IF EXISTS derived_data.resultFacts;
CREATE TABLE derived_data.resultFacts AS
SELECT res1.resultId, rac1.raceId, 'Normal race' AS race_type, rac1.name||' '||rac1.year as race_name, dr1.forename||' '||dr1.surname AS driver_name, cons1.name AS constructor_name, res1.grid AS starting_grid, res1.position, res1.positionText AS finishPosition, res1.positionOrder AS finishOrder, res1.points, res1.laps, res1.time||'.'||res1.milliseconds AS raced_time, res1.fastestLap, res1.rank, res1.fastestLapTime, res1.fastestLapSpeed AS fastestLapSpeed_kmph, st1.status
FROM results res1
JOIN status st1 USING (statusId)
JOIN races rac1 USING (raceId)
JOIN drivers dr1 USING (driverId)
JOIN constructors cons1 USING (constructorId)
UNION ALL
SELECT res2.resultId, rac2.raceId, 'Sprint race' AS race_type, rac2.name||' '||rac2.year as race_name, dr2.forename||' '||dr2.surname AS driver_name, cons2.name AS constructor_name, res2.grid AS starting_grid, res2.position, res2.positionText AS finishPosition, res2.positionOrder AS finishOrder, res2.points, res2.laps, res2.time||'.'||res2.milliseconds AS raced_time, res2.fastestLap, NULL AS rank, res2.fastestLapTime, NULL AS fastestLapSpeed_kmph, st2.status
FROM sprint_results res2
JOIN status st2 USING (statusId)
JOIN races rac2 USING (raceId)
JOIN drivers dr2 USING (driverId)
JOIN constructors cons2 USING (constructorId);
This materialization of the resultant cube is cleaned, transformed from the base dimensional points, and is currently loaded in an atomic granularity on the combination of driver, constructor, races thereby any aggregation metrics needed on the race results/driver/constructor can be done as applicable. For ex, to arrive at the Most points per driver
over a season,
SELECT driver_name, SUM(points) AS driver_points
FROM resultFacts --- includeds both Normal races & Sprints
WHERE RIGHT(race_name, 4) = 2021
GROUP BY driver_name
ORDER BY driver_points DESC, driver_name;
This result includes the points scored by the driver as part of Sprint races as well, as the resultFacts table is inclusive of ‘Normal race’ and ‘Sprint race’ with an UNION ALL clause.
Another interesting feature I’d like to point here is, IMPORT
and EXPORT
database command. Export command simply exports the schema definition and data to the local path we’re giving as below,
-- Exports the in-memory database to disk
EXPORT DATABASE "/path/to/export"
(FORMAT PARQUET,
COMPRESSION ZSTD,
ROW_GROUP_SIZE 100_000);
-- Imports the data from disk to memory
IMPORT DATABASE "/path/of/exported/folder";
Most fastest laps per driver in career,
While the result above is far from the reality for drivers of the past, I found that in the source data itself, fastest laps metrics aren’t captured for races prior to 2004 and hence Michael Schumacher here is heavily unfavored here. So, this is essential an issue from the dataset and not from DuckDB or MinIO.
I have collated some of simpler metrics and its computations in the queries.sql Here, I have defined the schema for the views and the tables. I would like to deep dive into the analytical prowess of DuckDB like pivot/unpivot, window/qualify, filter clauses and keep updating this file.
Pros and Cons — MinIO + DuckDB
- Lightweight and easy to setup (both MinIO and DuckDB) and open-source (for individuals), need I say more?
- The MinIO APIs are quite similar to S3, and this makes it easy to port to S3 in the future, with ease.
- MinIO sets the bar as one of the fastest object storage solution available, ensuring data access and retrieval are lightning-fast.
- When integrated with DuckDB, this performance boost enhances the efficiency of data analytics and processing.
- DuckDB’s zero-copy integration with data sources like Parquet files allows for high-speed query execution directly without duplicating data.
- While DuckDB is super fast, it is not suitable for live/transactional workloads a.k.a high-frequency writes.
- With DuckDB, schema support is basic compared to other databases and user management is outside its current scope due to its design as an in-process DBMS.
Next steps?
- Keeping this simplicity in mind, I would like to explore lakehouse tables (Hudi/Iceberg) that DuckDB supports to read/write.
- Exploring the features inside MinIO, viz., replication, encryption (SSE-S3, KMS), IAM, Lifecycle management, versioning as securing this setup is equally important.
- Scouting DuckDB’s features like extensions, ADBC protocols, Indexing and benchmarking for bulkier data volumes (with persisted mode)
Outro
With this post, I’d like to highlight the light-weight nature of this setup, along with the advantages and disadvantages within it and if we can counter the same (or at least try to) optimally.
Suggestions are welcome with your feedback on improvising this stack is hugely appreciated. Thank you!