An intro to Hudi with MinIO — I

Simbu
7 min read6 days ago

--

MinIO + Hudi

As I keep diving into the open-source wonders of the modern data stack, this time, I landed on something fascinating — Apache Hudi. It’s a complex, yet powerful open lakehouse table format (OLTF), built for lightning-fast upserts, supporting multi-modal and record-level indexing, and top-notch concurrency for reads and writes (across multiple engines!). In this post, I’ll walk you through the fundamentals, features, setup, and a hands-on use case — running Apache Hudi locally with MinIO for storage and Spark for compute.

First things first, what is a lakehouse? According to this awesome blog¹ by Dipankar, “A data lakehouse is a hybrid data architecture that combines the best attributes of data warehouses and data lakes to address their respective limitations. This innovative approach to data management brings the transactional capabilities of data warehouses to cloud-based data lakes, offering scalability at lower costs.” In simple terms, its a perfect concoction of the features offered by almost infinitely scalable cloud storage of data lakes, with the data management features of a typical data warehouse on top, essentially providing us with ACID features, partitioning, clustering, time-travel, concurrency controls, etc.

And what about table formats? A key component of a lakehouse architecture is the table format, which acts as a metadata layer above file formats like Apache Parquet or Apache ORC. Essentially this layer allows different engines to concurrently read and write on the same dataset, supporting ACID transactions. Notable table formats include Apache Hudi, Apache Iceberg, and Delta Lake and these are open sourced (Delta lake a proprietary version too) and hence they’re open table formats.

Layers within the typical data lakehouse¹

Comparison of OLTF

As the world of open lakehouse table formats evolves, three major players — Apache Hudi, Delta Lake, and Apache Iceberg stand out. Each of these table formats aims to bring ACID transactions, schema evolution, and time-travel capabilities to data lakes, but they differ in approach and use cases.

Feature comparison between Hudi, Iceberg and Delta Lake

Why Hudi?

If you’re fascinated by fast upserts, incremental data processing, and real-time analytics, then Apache Hudi is the way to go! Its strong record-level indexing and CDC support make it a great choice for dynamic datasets that require frequent updates. Of course, Delta Lake and Iceberg are strong in their own domains, but if your goal is to efficiently manage constantly evolving datasets without rewriting entire partitions, Hudi stands out!

Introduction to Apache Hudi

Apache Hudi¹ (pronounced as Hoodie) is short for “Hadoop Upserts Deletes and Incrementals”, is an open source transactional data lakehouse platform built around a database kernel. It provides table-level abstractions over open file formats like Apache Parquet and ORC thereby delivering core warehouse and database functionalities directly in the data lake with the support of transactional capabilities.

Hudi format layout² — a pictorial representation of read/write workflow

Components of Hudi tables

  • Base files
    — actual data files, usually of the type Parquet / ORC
  • Log files
    — these are the delta files, usually stored in Avro format
    — they capture the incremental changes, aka upserts
    — applicable only for Merge On Read (MOR) table types
  • Metadata files
    — a catalog that keeps track of data and how it’s organized
    — includes information like file paths, record keys, and partition details
  • Timeline
    — this is a time-based log of all actions performed on the table
    — helps Hudi maintain ACID transactions
    — enables features like time travel
  • Indexes
    mapping of record keys and their physical locations [base or log] files
    — helps Hudi quickly locate records during upserts and deletes
    supports multi-modal (Non-PK columns), expression based, record-level indexing
  • Partitions
    logical divisions of data, often based on a date/categorical column
    — help organize data for faster queries and better management.
  • Table services
    — these are services that run asynchronously (configurable)
    Compaction: merges upsert log files into base files, for optimizing storage and query performance
    Cleaning: removes old or unused files to free up space

There are two table types supported by Hudi, viz.,

  • Copy On Write (COW)
    — Updates and deletes rewrite the entire data file.
    — Best for read-heavy workloads where query performance is critical.
    — Hence no external table services aren’t necessary!
  • Merge On Read (MOR)
    — Updates and deletes are stored in separate log files.
    — Best for write-heavy workloads where fast writes are more important than immediate query performance.

Contents of Hoodie Metadata

For a Hudi table having inserts, updates, deletes transactions, then its timeline would look like,

Internal representation of a Hudi metadata²

You could see every transaction will have its own timeline based logs, that tells whether a particular transaction was successful to commit (and checks for other concurrency based conflicts, in case of multiple concurrent writers). For example, below is the screenshot from my Hudi table, after I first created the table with some records, followed by an update.

Inside the hoodie metadata folder

This Hudi table is of MOR type, so we could see 3 sets of files per transaction, viz., to determine if the transaction went through without any conflicts. These are essential if multiple writers are writing for a table (say from multiple engines) to ensure that each of the transaction is processed with any conflicts. In the event of any conflict being deducted, Hudi automatically rolls back those transaction that couldn’t be committed, hence preserving the ACID nature of a classic OLTP database model.

  • <transaction_id>.[replace/delta]commit.requested —A request to replace old/insert new data files has been initiated.
  • <transaction_id>.[replace/delta]commit.inflight — The replacement process is currently in progress, and the new data files are being written. Like a lock in data warehouse, but has added features.
  • <transaction_id>.[replace/delta]commit — The transaction is completed and marked successful in the metadata/catalog, and the new data files are now active and visible for queries.

And the base files (parquet) are stored in the partition folder (if applicable) along with its metadata,

A base file with its partition metadata

The metadata file gets automatically created by Hudi in the partition, “.hoodie_partition_metadata” which has the info about this partition along with the commit time and the depth of the partition as,

.hoodie_partition_metadata file contents

Query types that are supported by Hudi

  • Snapshot query
    — Shows the latest snapshot of the table as of the latest commit.
    — Similar to SQL queries running on a table in database.
    — Hudi storage engine accelerates these snapshot queries with indexes whenever possible, on supported query engines.
  • Incremental query (Latest state)
    —Displays only return new data written to the table since an instant on the timeline.
    — Provides latest value of records inserted/updated (i.e. 1 record output by query for each record key), since a given point.
    — Can be used to “diff” table states between two points in time.
    — Can save costs by a lot when dealing with incremental refresh cases
  • Incremental query (CDC)
    — These provides database like change data capture streams out of Hudi tables. (slightly different from the previous type)
    — Output of a CDC query contain records inserted or updated or deleted since a point in time or between two points in time.
  • Read optimized query (only MOR)
    — These provides excellent snapshot query performance via the base files.
    — Typically a compaction strategy aligns with a transaction boundary, to provide older consistent views of table/partitions.
    — This is useful to integrate Hudi tables from data warehouses that typically only query columnar base files as external tables
  • Time-travel query
    — Shows a snapshot of a table as of a given instant in the past.
    — Helps access multiple versions of a table on instants in the active timeline or save-points in the past.
    — Can be used through a timestamp or a commit version

And that’s a wrap for Part 1 of our deep dive into Apache Hudi! But we’re just getting warmed up. Next time, we’ll roll up our sleeves and get hands-on — setting up a Hudi table, unraveling what really happens when you upsert an existing key, and decoding how Hudi tracks metadata, specifically for MOR tables. Oh, and we’ll also see how this whole process shakes things up for COW tables. Stay tuned — it’s about to get interesting!

--

--

Simbu
Simbu

Written by Simbu

A concoction of a data professional, an F1 enthusiast, and an ardent traveler.

No responses yet