February 01, 20265 min read

How to Build a Lakehouse Using Delta Lake & Databricks

A comprehensive guide to building a modern data lakehouse architecture using Delta Lake and Databricks, combining the best of data lakes and data warehouses.

DatabricksDelta LakeData Architecture

The lakehouse is not a new buzzword — it is a concrete architectural pattern that solves the fundamental tension between data lakes (cheap, flexible, unreliable) and data warehouses (expensive, rigid, reliable). Delta Lake on Databricks is the most mature implementation of this pattern available today.

This guide covers building a production lakehouse from scratch: storage design, medallion architecture, governance, and operational concerns.

Why Lakehouse, Not Warehouse

Before committing to the architecture, it's worth being precise about the trade-offs: Data warehouse (Snowflake, Redshift, BigQuery): ACID transactions, fast query performance, strong governance, expensive storage, proprietary format, limited ML integration. Data lake (raw S3/GCS + Spark): cheap storage, schema flexibility, supports any compute engine, no ACID guarantees, poor query performance on analytical workloads, data swamps at scale. Lakehouse: open storage format (Parquet + Delta transaction log), ACID transactions, fast query via data skipping and caching, native ML integration, separation of compute and storage.

The critical enabler is Delta Lake's transaction log — a JSON-based commit log that sits alongside Parquet files and provides atomicity, isolation, and consistency without a proprietary catalog.

Storage Design

Every lakehouse needs a deliberate storage topology. The pattern that scales:

s3://your-bucket/
├── bronze/          # Raw, immutable, append-only
│   ├── events/
│   ├── orders/
│   └── users/
├── silver/          # Cleaned, deduplicated, typed
│   ├── events/
│   ├── orders/
│   └── users/
└── gold/            # Aggregated, domain-modeled, query-optimized
    ├── finance/
    ├── product/
    └── marketing/

Key decisions:

Partition strategy: partition by ingestion date at bronze, by business date at silver and gold. Avoid high-cardinality partitions (user_id) — they create millions of small files.
File sizing: target 128MB–1GB Parquet files. Use OPTIMIZE to compact small files regularly.
Retention: set delta.logRetentionDuration and delta.deletedFileRetentionDuration based on your time-travel and audit requirements.

Medallion Architecture Implementation

Bronze Layer — Raw Ingestion

Bronze is append-only. Never transform at this layer. The goal is a durable, replayable record of what arrived and when.

from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, lit
from delta import DeltaTable
spark = SparkSession.builder \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()
def ingest_bronze(source_path: str, target_path: str, source_name: str):
    df = spark.read.json(source_path)
    
    # Add audit columns — never modify source columns at bronze
    df = df \
        .withColumn("_ingested_at", current_timestamp()) \
        .withColumn("_source", lit(source_name)) \
        .withColumn("_source_path", lit(source_path))
    
    df.write \
        .format("delta") \
        .mode("append") \
        .partitionBy("_ingested_date") \
        .save(target_path)

Silver Layer — Cleansing and Deduplication

Silver is the canonical, trusted version of each entity. Apply schema enforcement, type casting, deduplication, and null handling here.

from pyspark.sql.functions import row_number, desc
from pyspark.sql.window import Window
def process_silver_orders(bronze_path: str, silver_path: str):
    bronze_df = spark.read.format("delta").load(bronze_path)
    
    # Type casting and null handling
    typed_df = bronze_df \
        .withColumn("order_total", col("order_total").cast("decimal(18,2)")) \
        .withColumn("created_at", to_timestamp(col("created_at"))) \
        .filter(col("order_id").isNotNull()) \
        .filter(col("order_total") >= 0)
    
    # Deduplication — keep latest record per order_id
    window = Window.partitionBy("order_id").orderBy(desc("_ingested_at"))
    deduped_df = typed_df \
        .withColumn("rn", row_number().over(window)) \
        .filter(col("rn") == 1) \
        .drop("rn")
    
    # Merge into silver (upsert)
    if DeltaTable.isDeltaTable(spark, silver_path):
        silver_table = DeltaTable.forPath(spark, silver_path)
        silver_table.alias("target").merge(
            deduped_df.alias("source"),
            "target.order_id = source.order_id"
        ) \
        .whenMatchedUpdateAll() \
        .whenNotMatchedInsertAll() \
        .execute()
    else:
        deduped_df.write.format("delta").partitionBy("created_date").save(silver_path)

Gold Layer — Aggregation and Domain Modeling

Gold serves analysts and dashboards. Optimize for read performance: pre-aggregate, pre-join, and cluster on query-critical columns.

def build_gold_daily_revenue(silver_orders_path: str, gold_path: str):
    orders = spark.read.format("delta").load(silver_orders_path)
    
    daily_revenue = orders \
        .filter(col("status") == "completed") \
        .groupBy("created_date", "region", "product_category") \
        .agg(
            sum("order_total").alias("total_revenue"),
            count("order_id").alias("order_count"),
            avg("order_total").alias("avg_order_value"),
            countDistinct("customer_id").alias("unique_customers")
        )
    
    daily_revenue.write \
        .format("delta") \
        .mode("overwrite") \
        .option("replaceWhere", f"created_date = '{processing_date}'") \
        .save(gold_path)

Performance Optimization

Z-Ordering

Z-ordering collocates related data in the same files, reducing the number of files scanned for selective queries.

OPTIMIZE silver.orders
ZORDER BY (customer_id, created_date);

Use Z-ordering on columns that appear frequently in WHERE clauses but have too high cardinality to partition on. Effective for: customer_id, product_id, region.

Data Skipping

Delta Lake maintains column-level min/max statistics in the transaction log. Queries with range predicates on partitioned or Z-ordered columns skip files automatically. Monitor skipping effectiveness:

DESCRIBE DETAIL silver.orders;
-- Check: numFiles, sizeInBytes, partitionColumns

Caching

For gold-layer tables accessed repeatedly by dashboards, enable Delta Caching on Databricks compute:

spark.conf.set("spark.databricks.io.cache.enabled", "true")
spark.conf.set("spark.databricks.io.cache.maxDiskUsage", "50g")

Governance with Unity Catalog

Unity Catalog provides centralized access control, lineage, and auditing across all Databricks workspaces.

Key setup:

-- Create a catalog per environment
CREATE CATALOG IF NOT EXISTS prod;
CREATE CATALOG IF NOT EXISTS dev;
-- Create schemas per domain
CREATE SCHEMA IF NOT EXISTS prod.finance;
CREATE SCHEMA IF NOT EXISTS prod.product;
-- Grant access by role
GRANT SELECT ON SCHEMA prod.finance TO finance-analysts;
GRANT ALL PRIVILEGES ON SCHEMA prod.finance TO data-engineers;
-- Column-level masking for PII
CREATE OR REPLACE FUNCTION prod.mask_email(email STRING)
RETURNS STRING
RETURN CASE WHEN is_account_group_member('pii-access') 
            THEN email 
            ELSE CONCAT(LEFT(email, 2), '@.*') 
       END;

Operational Runbook

Weekly maintenance:

-- Compact small files
OPTIMIZE bronze.events;
OPTIMIZE silver.orders ZORDER BY (customer_id);
-- Remove files no longer referenced by Delta
VACUUM silver.orders RETAIN 168 HOURS;  -- 7 days
-- Analyze table statistics for query optimization
ANALYZE TABLE gold.daily_revenue COMPUTE STATISTICS;

Monitoring:

Track delta_log size growth — large logs indicate frequent small commits
Alert on partition skew: one partition >10× the median size
Monitor job duration trends — consistent increase signals file fragmentation

The lakehouse pattern delivers when it is operated, not just deployed. The architecture is sound; the discipline is in the maintenance.

Share on Twitter / X

Jan 28, 20266 min read

Snowflake vs BigQuery vs Databricks — Architecture Breakdown

An in-depth architectural comparison of three leading cloud data platforms: Snowflake, Google BigQuery, and Databricks. Learn which platform is right for your use case.

SnowflakeBigQuery

Jun 05, 20244 min read

Migrating from Redshift to Apache Iceberg: What We Learned

A detailed account of migrating a 50TB Redshift warehouse to an open lakehouse on Apache Iceberg + S3, saving $400K/year and improving query performance 3x.

Apache IcebergLakehouse