TL;DR: Every data analysis contains a plethora of assumptions that you need to test, either manually or explicitly as part of your workflow. The latter is usually the better approach, because explicit data tests enforces and documents your assumptions, while making your analysis repeatable. In this article, I show you how to explicitly test your data assumptions with PostgreSQL.
It's Friday at 16:59 and you're getting ready for the weekend. Suddenly, your phone rings, it's your manager Bill! He tells you he needs statistics on the number of purchases the customer Acme, Inc. has performed outside of Sweden and he needs it yesterday, otherwise the company will lose its largest revenue stream.
Being a loyal, diligent, and naive data analyst, you tell him you'll have it ready by the hour. As you connect to the data warehouse running Postgres version 15.4, you ask yourself: How hard can it be?
This should require no more than a simple GROUP BY
query. You glance through the documentation and find a table named purchase
.
Creating the purchase table:
CREATE TABLE purchase (
-- Primary key
id SERIAL PRIMARY KEY,
-- Unique ID of the customer
-- Not a foreign key because this is a
-- data warehouse with no customer table
customer_id INTEGER,
-- Human readable name of the customer
customer_name VARCHAR(50),
-- The country of the purchase
customer_country VARCHAR(50),
-- When the purchase row was updated
updated_at TIMESTAMP
-- ... other columns ...
);
Populating the purchase table with test data:
INSERT INTO purchase (customer_id, customer_name, customer_country, updated_at)
VALUES
(2, 'Acme, Inc.', 'Germany', '2021-01-02 00:00:00'),
(2, 'Acme, Inc.', 'Germany', '2021-12-02 00:00:00'),
(2, 'Acme, Inc.', 'Germany', '2022-05-02 00:00:00'),
(33, 'Acme, Inc.', 'Germany', '2023-12-02 00:00:00'),
(33, 'Acme, Inc.', 'France', '2023-11-08 00:00:00'),
(33, 'Acme, Inc.', 'Italy', '2023-11-09 00:00:00'),
(33, 'Acme, Inc.', 'Sweden', '2023-12-10 00:00:00'),
(33, 'Acme, Inc.', 'Sweden', '2023-10-11 00:00:00'),
(33, 'Acme, Inc.', 'Norway', '2023-09-12 00:00:00'),
(33, 'Acme, Inc.', 'Denmark', '2023-09-13 00:00:00'),
(33, 'Acme, Inc.', 'Tjotahejti', CURRENT_DATE),
(33, 'Acme, Inc.', 'Germany', '2023-12-02 00:00:00'),
(101, 'Whatever AB', 'Sweden', '2023-12-01 00:00:00'),
(102, 'Entrepreneur Inc.', 'Norway', '2023-11-02 00:00:00'),
(2, 'Acme, Inc.', 'Denmark', '2023-10-03 00:00:00'),
(104, 'Our Home Country Is The World Inc.', NULL, '2023-11-04 00:00:00'),
(105, 'Sauna Inc.', 'Finland', '2023-12-01 00:00:00'),
(NULL, 'Non-paying Inc.', NULL, '2023-12-03 00:00:00')
;
Ah, this is going to be a breeze. You lookup a few more details and create the following analysis query:
SELECT customer_id, COUNT(*)
FROM purchase
WHERE
customer_id = 33
AND
customer_country != 'Sweden'
GROUP BY customer_id;
Ha, you must be a genius after all, coding at this speed!
As you prepare to send the result to Bill, you start to think about the assumptions you're making...
customer_id
s? You're assuming that all purchases belonging to the customer is marked with customer_id = 33
.customer_country IS NULL
? The query won't return those rows, so you're assuming that customer_country
is always set for the relevant purchases.purchase
doesn't contain the latest data? You recall a conversation from two months ago about lacking monitoring of the data pipeline and there being no time to prioritize setting it up.purchase
do not represent a purchase? You recall someone creating a bunch of sub-purchases in the purchase
table to solve an urgent problem last month. Was this customer affected by that?1purchase
is not the right table to use? Maybe another table was used for purchases before or there is some other split.OK, even a query as easy as this one makes a plethora of assumptions about the underlying data. Disgruntled, you realise that you have to confirm these assumptions before you send anything to Bill.
You consider testing your assumptions by exploring the data, building some intuition, and then manually drawing conclusions.
This is a good start for understanding the data. But you're faced with a challenge: you don't trust your own short-term memory and you certainly don't trust your ability to interpret the result of complex queries without fault this late on a Friday.
One option to circumvent this is to write down and document your tests. That's a step in the right direction, but you realize this comes with its own set of problems. For example, what if you draw conclusions about the data during the time of query execution, and the data then changes as you run the analysis query?
Ugh, you don't know what to do so you go out to get some air. As you stare into empty space while buying a protein bar at 7-Eleven, you realise something: What if you were to bake in the data assumptions into the actual code? If you made the assumptions explicit in the code, there would be a number of benefits:
Eureka! You run back to the office to get back to work.
You want to use a standalone Postgres script, so you opt for running a simple .sql
file via the classic psql command. As a first step to test your assumptions, you start exploring the data.
-- Assumption 1: Manually inspecting the customer names to ensure
-- all relevant purchases are `customer_id = 33`.
SELECT customer_name, customer_id
FROM purchase
GROUP BY customer_name, customer_id
ORDER by customer_name, customer_id;
It turns out that "Acme, Inc." has a few purchases where customer_id = 2
. Why is that? After skimming through old issues in JIRA, you find that someone created a new customer_id
for the customer a year ago because an immutable property needed to be changed.2
OK, so you need to update your analysis query with this finding and test the updated assumption in code. To do that, you decide to use anonymous code blocks in Postgres. You read the docs and find that they are used to execute one-time code in their own scope with their own variables.
You also decide to make use of the ASSERT
statement in Postgres to ensure execution is aborted if an assumption turns out to be false.3 You then write out the code:
DO $$
DECLARE
-- Declare variables prefixed with their assumption number to
-- avoid accidentally using the wrong one
a1_n_potentially_overlooked_rows INTEGER;
BEGIN
---------------
-- Assumption 1: (Updated from initial) All of the customer's
-- purchases can be found with `customer_id IN (2, 33)`
---------------
-- Copy result of COUNT(*) INTO the declared integer variable
SELECT COUNT(*)
INTO a1_n_potentially_overlooked_rows
FROM purchase
WHERE
-- Exclude purchases we know are relevant
customer_id NOT IN (2, 33)
AND
-- Below we make the assumption that other purchase rows
-- with the same `customer_name` may represent the customer.
-- Note that this is not waterproof, what if someone named
-- the customer something slightly differently? Then you are
-- in trouble. But you cannot test for every possible
-- circumstance and have to make trade-offs as to how and
-- to what degree you test the data, so you make a
-- qualitative judgement and deem this is good enough.
customer_name = 'Acme, Inc.'
;
-- Make the assumption check!
-- If the check fails, we may be missing relevant purchase
-- rows and should inspect the data anew
ASSERT
a1_n_potentially_overlooked_rows = 0,
'Assumption 1 is false'
;
-- Subsequent code inside this block (i.e., code before END)
-- will not execute if an assertion fails
-- ... other assumption tests here ...
END$$;
Wow! This exercise really forced you to think about what you were assuming under what pretenses. What's more, you've ensured that subsequent code in the block won't be executed if an assumption turns out to be false.
You continue to explore the data and write tests for the remaining assumptions.
DO $$
DECLARE
-- Declare variables prefixed with their assumption number to
-- avoid accidentally using the wrong
a1_n_potentially_overlooked_rows INTEGER;
a2_customer_country_is_null INTEGER;
a3_n_purchases_from_today INTEGER;
BEGIN
---------------
-- Assumption 1: (Updated from initial) All of the customer's
-- purchases can be found with `customer_id IN (2, 33)`
---------------
-- Copy result of COUNT(*) INTO the declared integer variable
SELECT COUNT(*)
INTO a1_n_potentially_overlooked_rows
FROM purchase
WHERE
-- Exclude purchases we know are relevant
customer_id NOT IN (2, 33)
AND
-- Below we make the assumption that other purchase rows with
-- the same `customer_name` may represent the customer.
-- Note that this is not waterproof, what if someone named
-- the customer something slightly differently? Then you are
-- in trouble. But you cannot test for every possible circumstance
-- and have to make trade-offs as to how and to what degree you test
-- the data, so you make a qualitative judgement to and deem this
-- good enough for this purpose.
customer_name = 'Acme, Inc.'
;
-- Test the assumption!
-- If the check fails, we may be missing relevant purchase rows
-- and should inspect the data anew
ASSERT a1_n_potentially_overlooked_rows = 0, 'Assumption 1 is false';
---------------
-- Assumption 2: `customer_country IS NOT NULL` for the customer
-- Assumes assumption 1 is true
---------------
SELECT COUNT(*)
INTO a2_customer_country_is_null
FROM purchase
WHERE
customer_id in (2, 33)
AND
customer_country IS NULL
;
-- Test assumption!
ASSERT a2_customer_country_is_null = 0, 'Assumption 2 is false';
---------------
-- Assumption 3: We have the latest data.
---------------
SELECT COUNT(*)
INTO a3_n_purchases_from_today
FROM purchase
WHERE
-- Only get rows that were created today
date(updated_at) = CURRENT_DATE;
-- Test by assuming that we have the latest data if
-- we have purchases that were updated today
ASSERT a3_n_purchases_from_today > 0, 'Assumption 3 is false';
---------------
-- Assumption 4: Every row represents a purchase
---------------
-- Difficult to test in code. Confirmed by looking
-- through the documentation. See Confluence page C-123
-- version 7 under headline "The Curious Case of `purchase`"
-- for more information.
ASSERT true, 'Assumption 4 is false';
---------------
-- Assumption 5: `purchase` is the right table to look at
---------------
-- Difficult to test in code, need to investigate by reading the
-- documentation.
-- Note how we explicitly fail the assertion here because we have
-- not yet confirmed the assumption to be true. Once we have done
-- that, we will document our findings and make it pass.
ASSERT false, 'Assumption 5 is false';
END$$;
You note how you were not able to test all assumptions in code. For example, you could not express a test for assumption 5:
DO $$
DECLARE
-- ... declared variables ...
BEGIN
-- ... other assumptions ...
---------------
-- Assumption 5: `purchase` is the right table to look at
---------------
-- Difficult to test in code, need to investigate by reading the
-- documentation.
-- Note how we explicitly fail the assertion here because we have
-- not yet confirmed the assumption to be true. Once we have done
-- that, we will document our findings and make it pass.
ASSERT false, 'Assumption 5 is false';
-- ... other assumptions ...
END$$;
So instead you opted for writing a manual ASSERT false
statement for that one while you look up the answer in your documentation. Once you find the answer, you'll change the statement to ASSERT true
and document your findings as a comment, with references to sources you used to confirm it. All for making it as explicit and documented as possible, for your current and future self as well as your colleagues.
This is great, with one caveat. How do you output the result of the analysis query? It turns out that this is not straight-forward in an anonymous code block, because a query executed there does not return values the same way as a top-level query.
-- script.sql
DO $$
BEGIN
SELECT 'Result from final query';
END$$;
-- Error from running `psql -f script.sql`:
-- psql:script.sql:4: ERROR: query has no destination for result data
-- HINT: If you want to discard the results of a SELECT, use PERFORM instead.
-- CONTEXT: PL/pgSQL function inline_code_block line 3 at SQL statement
To get around this you could create a temporary table and insert the output there, but that won't work now because you only have READ access to the database. Besides, you don't want to make manual changes to the data warehouse, as you might break something or expose data to the wrong person.
The most feasible option left4 is to place the analysis query outside the anonymous code block, on the top-level at the bottom of the script, like so:
DO $$
BEGIN
-- Some assumption is false, so we want to stop execution
ASSERT false, 'Assumption is false!';
END$$;
-- Will this be executed if the assumption above is false?
SELECT 'Result from final query';
You try this and find that the analysis query was run even though an assumption was false! What's going on here?
It turns out that each block/statement is evaluated by psql separately, so the failed assertion inside the anonymous code block does not abort execution of the query outside and below the block.
Being a SQL veteran, you don't let this discourage you. Instead, you decide to execute everything inside a SQL transaction, because you know that failed assertions inside a transaction will abort all subsequent statements.
-- Start a read-only transaction because we are not
-- changing anything
BEGIN TRANSACTION READ ONLY;
DO $$
BEGIN
-- Some assumption is false, so we want to stop execution
ASSERT false, 'Assumption incorrect!';
END$$;
-- Does not execute this time, because the transaction is
-- aborted, exactly as expected!
SELECT 'Result from final query';
-- We use ROLLBACK to end the SQL transaction, to be sure that no
-- changes are made to the database. Could theoretically use
-- COMMIT instead since we are not making any changes and because
-- we are in a read-only transaction, but why open up the
-- remote possibility for human mistake?
ROLLBACK;
This actually works! An added bonus is that you can now separate each assumption test into its own anonymous code block, making the code easier to read and reducing the risk of re-using variables by mistake. You do that and end up with a final version of the script:
-- Start read-only transaction
BEGIN TRANSACTION READ ONLY;
---------------
-- Assumption 1: (Updated from initial) All of the customer's
-- purchases can be found with `customer_id IN (2, 33)`
---------------
DO $$
DECLARE
-- Declare result variable
n_potentially_overlooked_rows INTEGER;
BEGIN
-- Copy result of COUNT(*) INTO the result variable
SELECT COUNT(*)
INTO n_potentially_overlooked_rows
FROM purchase
WHERE
-- Exclude purchases we know are relevant
customer_id NOT IN (2, 33)
AND
-- Below we make the assumption that other purchase rows with
-- the same `customer_name` may represent the customer.
-- Note that this is not waterproof, what if someone named
-- the customer something slightly differently? Then you are
-- in trouble. But you cannot test for every possible circumstance
-- and have to make trade-offs as to how and to what degree you test
-- the data, so you make a qualitative judgement to and deem this
-- good enough for this purpose.
customer_name = 'Acme, Inc.'
;
-- Test the assumption!
-- If the check fails, we may be missing relevant purchase rows
-- and should inspect the data anew
ASSERT
n_potentially_overlooked_rows = 0,
'Assumption 1 is false'
;
END$$;
---------------
-- Assumption 2: `customer_country IS NOT NULL` for the customer
-- Assumes assumption 1 is true
---------------
DO $$
DECLARE
-- Declare result variable
customer_country_is_null INTEGER;
BEGIN
SELECT COUNT(*)
INTO customer_country_is_null
FROM purchase
WHERE
customer_id in (2, 33)
AND
customer_country IS NULL
;
-- Test the assumption!
ASSERT customer_country_is_null = 0, 'Assumption 2 is false';
END$$;
---------------
-- Assumption 3: We have the latest data.
---------------
DO $$
DECLARE
-- Declare result variable
n_purchases_from_today INTEGER;
BEGIN
SELECT COUNT(*)
INTO n_purchases_from_today
FROM purchase
WHERE
-- Only get rows that were created today
date(updated_at) = current_date;
-- Test by assuming that we have the latest data if
-- we have purchases that were updated today
ASSERT
n_purchases_from_today > 0,
'Assumption 3 is false'
;
END$$;
---------------
-- Assumption 4: Every row represents a purchase
---------------
DO $$
BEGIN
-- Difficult to test in code. Confirmed by looking
-- through the documentation. See Confluence page C-123
-- version 7 under headline "The Curious Case of `purchase`"
-- for more information.
ASSERT true, 'Assumption 4 is false';
END$$;
---------------
-- Assumption 5: `purchase` is the right table to look at
---------------
DO $$
BEGIN
-- Difficult to test in code. Confirmed via discussion in
-- issue T-456 from JIRA project 'In the Biz of Data'
ASSERT true, 'Assumption 5 is false';
END$$;
---------------
-- Analysis query: Get the number of purchases 'Acme, Inc.' has
-- made outside of Sweden
---------------
SELECT COUNT(*)
FROM purchase
WHERE
customer_id in (2, 33)
AND
customer_country != 'Sweden'
;
-- End transaction
ROLLBACK;
Success! And it only took double the estimated time to complete.
After commiting the script to Git and documenting the result in JIRA, you pick up the phone to tell Bill the good news. He doesn't pick up. Classic Bill, so busy working. You send him a text with a link to the JIRA issue, telling yourself this could not have waited until Monday...
You decide that is quite enough of thinking for one day and instead rejoice in another Friday night spent creating shareholder value... Or at least you try to.
You make a note to yourself to give a passive aggressive stare on Monday to the "hero" who came up with that "smart" fix... ↩
You swear as you wonder why the hell someone enforced immutability on a property that evidently changes from time to time. ↩
Another alternative would be to RAISE EXCEPTION
, which would be just as valid, but you opt for ASSERT
because of its simplicity and familiarity for those used to writing unit tests in other programming languages. ↩
Another option you think about is using something like RAISE NOTICE 'Result: %', result_variable
to print the output from the query. This works well for simple queries with one result row, but not as well for queries that return many rows. This would work in this case because your analysis query only returns one row. Even still, you decide not to go this route because you want the security of seeing the full output from the analysis query. ↩