NADEBlg!

Featured

Garbage In, Garbage Out: Why Data Quality is the Foundation of Every AI Initiative

There is a phrase that has been floating around data circles since the earliest days of computing: garbage in, garbage out. It is one of those sayings that is so old and so obvious that most people stopped thinking about it. And that is exactly the problem. Right now, companies...

21 April 2026

Lineage, Ontology, and Data Contracts on Databricks E2 (With RDS, Marquez, Protégé, and Great Expectations)

This is Part 5 of the lineage series, and it is the most opinionated one. The first four posts talked about what lineage is, where companies break it, and how to wire it up on EMR. This post is about building the full picture: lineage, ontology, and data contracts working...

03 March 2026

Data Lineage, Ontology, and Data Contracts: Better Together

By now, if you have been following this series, you know what lineage is, where teams go wrong, and how to wire it up on EMR. This final post is about the bigger picture: how data lineage, ontology, and data contracts work together as a system instead of three separate...

28 February 2026

Setting Up Data Lineage in Spark on AWS EMR (From Scratch)

Quick apology up front: this post is late. I meant to ship it last week, but a production incident reminded me (again) why lineage matters. So here it is: a practical guide to implementing data lineage from scratch in a Spark on AWS EMR environment. This is Part 3 of...

23 February 2026

Common Data Lineage Mistakes (and How to Fix Them)

I wrote the first post in this series after a messy incident. This second post is written after the third time I saw the same lineage mistakes repeat at a different company. The pattern is always the same: lineage is treated as a dashboard, not as part of the pipeline...

16 February 2026

What Is End-to-End Data Lineage? (And Why It’s Not the Same as an Ontology)

I first started caring about end-to-end data lineage on a Friday night incident. A dashboard metric moved, everyone panicked, and the only honest question in the room was: where did that number actually come from? This post is Part 1 of 4 in a short lineage series. We start with...

13 February 2026

The Databricks Serverless Hybrid Model: Running Both Classic and Serverless Data Planes (and When You Actually Need To)

Databricks has evolved into a platform with two compute “planes” that can execute workloads: Classic compute plane (customer-managed networking + compute in your cloud account/VPC), which runs in an AWS account (or whatever your CSP is), within the customer’s AWS organization. Serverless compute plane (Databricks-managed compute that you consume on-demand),...

10 February 2026

Configuring Databricks Apps: app.yaml, Resources, and Fast Troubleshooting When Things Go Sideways

Databricks Apps is at its best when it’s boring: you deploy an app, wire it to the right Databricks resources, assign permissions via groups, and it just works. But when it doesn’t work, when the app is slow, unresponsive, or “sort of works for me but not for anyone else”,...

06 February 2026

Databricks Apps Review: What It Is, What It Isn’t, and How to Roll It Out Safely

Databricks Apps is one of those platform features that looks deceptively simple: “deploy an app next to the data.” In practice, it changes how teams deliver internal tools (dashboards, forms, RAG UIs, lightweight workflows) because you can ship an application inside the Databricks security and governance boundary without standing up...

03 February 2026

Walkthrough: Deploying My Personal EMR Clusters with Terraform + Ansible (Reference Repo)

In my previous post I covered Terraform + Ansible best practices for a secure, production-ready EMR environment. In this post, I’m going to do the practical follow-up: a walkthrough of the reference repo I use to deploy my personal EMR clusters. The example repo for this post is: GitHub repo:...

27 January 2026

Terraform + Ansible Best Practices for a Secure, Production-Ready AWS EMR Environment

I get asked all the time: “How do I securely automate my EMR deployment so I only need to write the code once to create a repeatable production ready environment?”. If you want EMR to feel boring in production (the best outcome), you need Infrastructure as Code (IaC) for the...

25 January 2026

How One Skewed Join Key Turned a 12-Minute Spark Job into a 2-Hour Job (and How to Fix It Without Resizing the Cluster)

In my last post, I showed how to diagnose skew in Spark UI quickly. In this post, I want to show the real-world version: one skewed join key took a job that normally ran in ~12 minutes and pushed it past 2 hours. The punchline: you don’t need to resize...

23 January 2026

Diagnose Skew, Spill, and Too Many Small Tasks in Spark UI (EMR) in Under 10 Minutes

In my last post, I promised a Spark UI walkthrough you can actually use under pressure. This is that post. The goal is simple: open the Spark UI (on EMR) and identify whether your job is suffering from: Skew (a few tasks do all the work) Spill (not enough memory,...

21 January 2026

PySpark Best Practices on AWS EMR

If you’ve ever asked “Why is my PySpark job slow on EMR?” the honest answer is usually: it’s not one thing. It’s a handful of small decisions that compound—cluster sizing, file layout, shuffle tuning, join strategy, and the never-ending battle with small files on S3. This post is my “battle...

20 January 2026

Using Referential Integrity

I am not sure about you, but tax season is a busy time of year for my teams. With that, I have jumped in the mix to assist with code reviews, PR approvals and branch merging in order to free up some of my Senior Data Engineers to do more...

15 March 2021

Competitive Advantage

I recently started reading Tomasz Tunguz and Frank Bien’s Winning with Data; Tranform your Culture, Empower your People, and Shape the Future. For many us in the data management field, whether in Data Engineering, Business Intelligence, Data Architecture, Database Administration, or even Software Engineering, understanding and extending the usage of...

08 February 2021

Featured

Regular