Why normalization matters for your data model
Normalization is the discipline of organizing a database so that each fact is stored in exactly one place. When the same fact lives in two rows or two tables, updating it means updating it in multiple places — and eventually, someone will forget one.
The result is inconsistent data: two rows describing the same customer with different cities, or a product price that differs across invoice records. Normalization eliminates these anomalies by design rather than by discipline.
Mental Model
Normalization answers: where does this fact belong? If customer email appears in three tables, it belongs in exactly one — customers. Every other table should reference it by foreign key. One source of truth, many references.
First Normal Form: eliminate repeating groups and ensure atomicity
A table is in First Normal Form (1NF) when every column contains atomic (indivisible) values and there are no repeating groups. The most common 1NF violations are storing multiple values in a single column — like comma-separated tags — or using numbered columns like item_1, item_2, item_3 to store a list.
The fix is always the same: one value per cell, one fact per row. Move repeating groups into their own table and join them back with a foreign key.
-- BEFORE 1NF: repeating groups as columns
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer VARCHAR,
item_1 VARCHAR, -- repeating groups!
item_2 VARCHAR,
item_3 VARCHAR,
qty_1 INT,
qty_2 INT,
qty_3 INT
);
-- Problem: item_4 requires a schema change; NULL padding wastes storage.
-- AFTER 1NF: one fact per row, no repeating groups
CREATE TABLE orders (
order_id INT,
customer VARCHAR,
item_name VARCHAR,
quantity INT,
PRIMARY KEY (order_id, item_name)
);
-- Now adding a fourth item is just a new row, not a schema change.Second Normal Form: remove partial dependencies
A table is in Second Normal Form (2NF) when it is already in 1NF and every non-key column depends on the entire primary key — not just part of it. Partial dependencies only appear in tables with composite primary keys.
The classic example: an order_items table with a composite key of (order_id, product_id), where product_name depends only on product_id and not on order_id. That attribute belongs in a products table, not in order_items.
-- BEFORE 2NF: product_name depends on product_id only
-- (partial dependency on a composite PK)
CREATE TABLE order_items (
order_id INT,
product_id INT,
product_name VARCHAR, -- depends only on product_id!
quantity INT,
PRIMARY KEY (order_id, product_id)
);
-- AFTER 2NF: product attributes moved to their own table
CREATE TABLE order_items (
order_id INT,
product_id INT REFERENCES products(product_id),
quantity INT,
PRIMARY KEY (order_id, product_id)
);
CREATE TABLE products (
product_id INT PRIMARY KEY,
product_name VARCHAR,
unit_price NUMERIC(10,2)
);
-- Now renaming a product is one UPDATE in one table, not thousands.Third Normal Form: eliminate transitive dependencies
A table is in Third Normal Form (3NF) when it is in 2NF and no non-key column depends on another non-key column. This is called a transitive dependency. The most common example: storing city and state alongside zip code in a customers table, when city and state are actually determined by zip.
If you update a customer's zip code but forget to update their city and state, the record becomes internally inconsistent. The fix is to move the transitive dependency into its own lookup table.
-- BEFORE 3NF: city and state depend on zip, not customer_id
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
name VARCHAR,
email VARCHAR,
zip CHAR(5),
city VARCHAR, -- depends on zip, not customer_id!
state CHAR(2) -- depends on zip, not customer_id!
);
-- AFTER 3NF: transitive dependency extracted to zip_codes
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
name VARCHAR,
email VARCHAR,
zip CHAR(5) REFERENCES zip_codes(zip)
);
CREATE TABLE zip_codes (
zip CHAR(5) PRIMARY KEY,
city VARCHAR,
state CHAR(2)
);
-- Updating a city for a zip code now fixes every customer at once.When to intentionally denormalize for analytics
Normalization is the right default for transactional databases where write correctness matters. For analytical workloads, deliberate denormalization often improves query performance by reducing joins at read time — this is exactly what dimensional modeling and wide tables do.
The key word is deliberate. A denormalized analytical table is a design decision, not a mistake. Document why it is denormalized, what the source-of-truth table is, and how updates flow from the normalized model to the denormalized one.
Watch Out
Denormalization is a deliberate trade-off, not a shortcut. If you denormalize without documenting it, the next engineer will normalize it — and break every query pattern that depended on the wide format.
- Denormalize when join cost dominates query runtime on large fact tables.
- Always maintain a normalized source and derive denormalized outputs from it.
- Document the denormalization decision in the schema's data dictionary.
- Use dbt to automate the derivation so the denormalized table stays in sync.