Senior Data Engineer buildingreliable batch & streaming data platforms

I build large-scale data platforms, real-time streaming systems, and AI-ready pipelines using Databricks, PySpark, Snowflake, BigQuery, Delta Lake, AWS, GCP, Airflow, and Kafka.

DatabricksPySparkSnowflakeBigQueryAWSGCPDelta LakeAirflowKafkaPythonSQLDockerKubernetes

⚡ Designed data platforms processing 1B+ records

⚡ Reduced ETL cost by 40% with optimized Delta pipelines

⚡ Built real-time streaming pipelines with Kafka & Spark

About Me

I design, build, and operate production-grade data platforms that teams trust.

I am a Senior Data Engineer with 6+ years of experience designing, building, and optimizing large-scale data platforms across AWS, GCP, Databricks, and modern lakehouse architectures.

I specialize in building real-time and batch data systems using PySpark, Delta Lake, Snowflake, BigQuery, Airflow, and Kafka.

I am passionate about data architecture, fintech systems, streaming pipelines, and MLOps.

What I Do

  • Build large-scale data platforms
  • Develop real-time pipelines (Kafka + Spark)
  • Design lakehouse architectures

What I Am Focusing On

  • Fintech AI and credit risk pipelines
  • LLM and vector database engineering
  • MLOps architecture

Skills & Technologies

Data Engineering

  • Python
  • SQL
  • PostgreSQL
  • MySQL
  • MongoDB
  • Apache Spark (PySpark)
  • Apache Kafka
  • Apache Airflow
  • Hadoop
  • HDFS
  • Hive
  • ETL / ELT
  • Data Warehousing
  • Data Modeling
  • Data Pipelines
  • Batch & Streaming Pipelines
  • Delta Lake
  • Databricks
  • Snowflake
  • dbt
  • Distributed Systems

AWS Cloud

  • AWS S3
  • AWS EC2
  • AWS Lambda
  • AWS Glue
  • AWS Redshift
  • AWS EMR
  • AWS IAM
  • AWS VPC
  • AWS CloudWatch
  • AWS SNS / SQS
  • AWS RDS

Google Cloud (GCP)

  • GCP BigQuery
  • GCP Dataproc
  • GCP Dataflow
  • GCP Cloud Storage
  • GCP Pub/Sub
  • GCP Composer (Managed Airflow)
  • GCP IAM
  • GCP VPC Networking
  • GCP Monitoring

Experience

Senior Data Engineer

2020 – Present

  • Designed and operated large-scale batch and streaming pipelines processing 1B+ records using PySpark, Databricks, and Delta Lake.
  • Built real-time streaming systems using Kafka and Spark Structured Streaming.
  • Reduced cloud cost by 40% via lakehouse migration.
  • Implemented CI/CD with Docker, GitHub Actions, and Terraform.
  • Modernized data warehouses on Snowflake and BigQuery.

Data Engineer

2018 – 2020

  • Built ETL workflows using Airflow, Python, and SQL.
  • Optimized BigQuery with partitioning and clustering (60% faster queries).
  • Developed cloud-native ingestion pipelines.
  • Collaborated on data modeling and warehouse design.

Featured Projects

Production-grade data engineering projects showcasing end-to-end expertise

Real-time Streaming Pipeline

End-to-end real-time data pipeline processing millions of events per second with Kafka, Spark Structured Streaming, and Delta Lake for real-time analytics and ML feature engineering.

Key Highlights

  • Processes 10M+ events/second with sub-second latency
  • Automated schema evolution and data quality checks
  • Cost-optimized architecture with auto-scaling
Apache KafkaSpark StreamingDelta LakeAWS S3PythonDocker

Enterprise Lakehouse on Databricks

Modern data lakehouse architecture built on Databricks with Delta Lake, enabling ACID transactions, time travel, and unified batch/streaming workloads.

Key Highlights

  • Unified 50+ data sources into a single lakehouse
  • Reduced pipeline runtime by 60%
  • Implemented governance with Unity Catalog
DatabricksDelta LakePySparkUnity CatalogAWSTerraform

Snowflake / BigQuery Modernization

Migrated legacy warehouse workloads to Snowflake and BigQuery with optimized data models and automated ELT pipelines.

Key Highlights

  • Migrated 100+ TB of historical data
  • Improved query performance by 70%
  • Automated transformations with DBT
SnowflakeBigQueryDBTAirflowPythonGitHub Actions

MLOps Feature Store Architecture

Production-grade feature store supporting real-time and batch feature computation, lineage, and versioning.

Key Highlights

  • Served 1000+ features with ms latency
  • Automated feature pipelines
  • Full feature lineage and versioning
Databricks Feature StoreMLflowPySparkDelta LakeFastAPI

Fintech Credit Risk Pipeline

Real-time credit risk scoring system with ML-driven decisioning and streaming feature engineering.

Key Highlights

  • Reduced approval time from days to minutes
  • Processed 100K+ applications daily
  • Achieved 95%+ model accuracy
PySparkKafkaPostgreSQLFastAPIAWS LambdaDocker

Cloud Cost Optimization Framework

Automated framework for monitoring, anomaly detection, and dynamic resource scaling.

Key Highlights

  • Reduced cloud costs by 40%
  • Automated rightsizing
  • Real-time anomaly detection
PythonAWS Cost ExplorerTerraformLambdaCloudWatch

Get In Touch

Let's talk about data engineering, cloud platforms, real-time systems, and AI-ready architectures.

About

Senior Data Engineer specializing in scalable batch & streaming platforms, cloud-native data systems, and AI-ready architectures.

Expertise

  • Databricks / PySpark
  • Kafka / Airflow / Delta Lake
  • Snowflake / BigQuery / PostgreSQL
  • AWS & GCP Data Platforms

Connect

© 2026 Vasudev Rao · Built with precision, scaled for impact