Portfolio

Projects

Production-grade data engineering systems built to handle enterprise-scale workloads. Each project solves real business problems with measurable impact.

Sub-5s end-to-end latency (from 8min), 10M+ events/day throughput, 0 data loss with exactly-once delivery, 99.9% uptime

10M Events/Day Kafka → Spark Streaming Pipeline

Production Kafka → Spark Structured Streaming pipeline processing 10M+ events/day with exactly-once delivery to Delta Lake. Watermark-based late-event handling, idempotent MERGE upserts, and dead-letter queue with automatic replay. Reduced end-to-end latency from 8 minutes to under 5 seconds.

Apache KafkaSpark Structured StreamingDelta LakePySparkAWS EMR
60% pipeline runtime reduction, 50+ sources unified, 0 schema conflicts, full data lineage with Unity Catalog

Enterprise Lakehouse — Databricks Medallion Architecture

Unified 50+ isolated AWS Glue jobs into a Databricks Delta Lake medallion architecture (Bronze/Silver/Gold). Unity Catalog for governance, dbt for schema contracts, Photon-powered Gold layer. Achieved 60% pipeline runtime reduction and eliminated schema conflicts across 8 engineering teams.

DatabricksDelta LakePySparkUnity Catalogdbt
70% query performance improvement (p95: 42s → 11s), 40% cost reduction, 100TB migrated with zero downtime

100TB Warehouse Migration — Redshift & Oracle → Snowflake + BigQuery

Led migration of 100+ TB from on-premise Oracle and legacy AWS Redshift to Snowflake and BigQuery using dual-write validation strategy. Re-modeled physical layer with micro-partition clustering and incremental ELT using dbt. Achieved 70% query performance improvement (p95: 42s → 11s) and 40% cost reduction with zero-downtime cutover.

SnowflakeBigQuerydbtApache AirflowAirbyte
1,000+ features centralized, 4 ML teams served, p99 < 8ms online latency, 0 training-serving skew, feature dev time: days → hours

ML Feature Store — 1,000+ Features, p99 < 8ms Online Serving

Centralized dual-mode feature platform on Databricks: Delta Lake offline store (point-in-time correct for training) and Redis online store (p99 < 8ms for inference). Eliminated training-serving skew across 4 ML teams, reduced feature engineering time from days to hours.

Databricks Feature StoreMLflowPySparkRedisDelta Lake
Decisioning latency: 48h → < 2min, 100K+ applications/day, 95%+ model accuracy maintained, real-time fraud detection

Real-Time Credit Decisioning — 48h Batch → < 2min Streaming

Replaced overnight batch credit scoring with Kafka-driven real-time pipeline. PySpark micro-batch feature engineering computes 200+ credit risk signals in real time, integrated with REST model serving layer. Reduced decisioning latency from 48 hours to under 2 minutes while maintaining 95%+ model accuracy at 100K+ applications/day.

PySparkApache KafkaPostgreSQLAWSMLflow
40% platform spend reduction in 90 days, 90% reduction in idle resource costs, full cost attribution per team

Cost Engineering Framework — 40% Platform Spend Reduction

Automated framework for Spark cluster rightsizing, S3 → Glacier storage tiering, and cross-workspace cost anomaly detection using Isolation Forest ML. Built centralized cost analytics aggregating AWS Cost Explorer, Databricks, and Snowflake usage. Achieved 40% platform spend reduction in 90 days.

PythonTerraformDatabricksAWS Cost ExplorerGCP