NADEBlg!
Featured
Setting Up Data Lineage in Spark on AWS EMR (From Scratch)
Quick apology up front: this post is late. I meant to ship it last week, but a production incident reminded me (again) why lineage matters. So here it is: a practical guide to implementing data lineage from scratch in a Spark on AWS EMR environment. This is Part 3 of...
Common Data Lineage Mistakes (and How to Fix Them)
I wrote the first post in this series after a messy incident. This second post is written after the third time I saw the same lineage mistakes repeat at a different company. The pattern is always the same: lineage is treated as a dashboard, not as part of the pipeline...
What Is End-to-End Data Lineage? (And Why It’s Not the Same as an Ontology)
I first started caring about end-to-end data lineage on a Friday night incident. A dashboard metric moved, everyone panicked, and the only honest question in the room was: where did that number actually come from? This post is Part 1 of 4 in a short lineage series. We start with...
The Databricks Serverless Hybrid Model: Running Both Classic and Serverless Data Planes (and When You Actually Need To)
Databricks has evolved into a platform with two compute “planes” that can execute workloads: Classic compute plane (customer-managed networking + compute in your cloud account/VPC), which runs in an AWS account (or whatever your CSP is), within the customer’s AWS organization. Serverless compute plane (Databricks-managed compute that you consume on-demand),...
Configuring Databricks Apps: app.yaml, Resources, and Fast Troubleshooting When Things Go Sideways
Databricks Apps is at its best when it’s boring: you deploy an app, wire it to the right Databricks resources, assign permissions via groups, and it just works. But when it doesn’t work, when the app is slow, unresponsive, or “sort of works for me but not for anyone else”,...
Databricks Apps Review: What It Is, What It Isn’t, and How to Roll It Out Safely
Databricks Apps is one of those platform features that looks deceptively simple: “deploy an app next to the data.” In practice, it changes how teams deliver internal tools (dashboards, forms, RAG UIs, lightweight workflows) because you can ship an application inside the Databricks security and governance boundary without standing up...
Walkthrough: Deploying My Personal EMR Clusters with Terraform + Ansible (Reference Repo)
In my previous post I covered Terraform + Ansible best practices for a secure, production-ready EMR environment. In this post, I’m going to do the practical follow-up: a walkthrough of the reference repo I use to deploy my personal EMR clusters. The example repo for this post is: GitHub repo:...
Terraform + Ansible Best Practices for a Secure, Production-Ready AWS EMR Environment
I get asked all the time: “How do I securely automate my EMR deployment so I only need to write the code once to create a repeatable production ready environment?”. If you want EMR to feel boring in production (the best outcome), you need Infrastructure as Code (IaC) for the...
How One Skewed Join Key Turned a 12-Minute Spark Job into a 2-Hour Job (and How to Fix It Without Resizing the Cluster)
In my last post, I showed how to diagnose skew in Spark UI quickly. In this post, I want to show the real-world version: one skewed join key took a job that normally ran in ~12 minutes and pushed it past 2 hours. The punchline: you don’t need to resize...
Diagnose Skew, Spill, and Too Many Small Tasks in Spark UI (EMR) in Under 10 Minutes
In my last post, I promised a Spark UI walkthrough you can actually use under pressure. This is that post. The goal is simple: open the Spark UI (on EMR) and identify whether your job is suffering from: Skew (a few tasks do all the work) Spill (not enough memory,...
PySpark Best Practices on AWS EMR
If you’ve ever asked “Why is my PySpark job slow on EMR?” the honest answer is usually: it’s not one thing. It’s a handful of small decisions that compound—cluster sizing, file layout, shuffle tuning, join strategy, and the never-ending battle with small files on S3. This post is my “battle...
Using Referential Integrity
I am not sure about you, but tax season is a busy time of year for my teams. With that, I have jumped in the mix to assist with code reviews, PR approvals and branch merging in order to free up some of my Senior Data Engineers to do more...
Competitive Advantage
I recently started reading Tomasz Tunguz and Frank Bien’s Winning with Data; Tranform your Culture, Empower your People, and Shape the Future. For many us in the data management field, whether in Data Engineering, Business Intelligence, Data Architecture, Database Administration, or even Software Engineering, understanding and extending the usage of...
Regular
Setting Up Data Lineage in Spark on AWS EMR (From Scratch)
Quick apology up front: this post is late. I meant to ship it last week, but a production incident reminded me (again) why lineage matters. So here it is: a practical guide to implementing data lineage from scratch in a Spark on AWS EMR environment.
Common Data Lineage Mistakes (and How to Fix Them)
I wrote the first post in this series after a messy incident. This second post is written after the third time I saw the same lineage mistakes repeat at a different company. The pattern is always the same: lineage is treated as a dashboard, not as part of the pipeline...
What Is End-to-End Data Lineage? (And Why It’s Not the Same as an Ontology)
I first started caring about end-to-end data lineage on a Friday night incident. A dashboard metric moved, everyone panicked, and the only honest question in the room was: where did that number actually come from?
The Databricks Serverless Hybrid Model: Running Both Classic and Serverless Data Planes (and When You Actually Need To)
Databricks has evolved into a platform with two compute “planes” that can execute workloads:
Configuring Databricks Apps: app.yaml, Resources, and Fast Troubleshooting When Things Go Sideways
Databricks Apps is at its best when it’s boring: you deploy an app, wire it to the right Databricks resources, assign permissions via groups, and it just works.