Testing Data Assumptions with PostgreSQL

Wed Dec 06 2023

TL;DR: Every data analysis contains a plethora of assumptions that you need to test, either manually or explicitly as part of your workflow. The latter is usually the better approach, because explicit data tests enforces and documents your assumptions, while making your analysis repeatable. In this article, I show you how to explicitly test your data assumptions with PostgreSQL.

1. Problem

It's Friday at 16:59 and you're getting ready for the weekend. Suddenly, your phone rings, it's your manager Bill! He tells you he needs statistics on the number of purchases the customer Acme, Inc. has performed outside of Sweden and he needs it yesterday, otherwise the company will lose its largest revenue stream.

Being a loyal, diligent, and naive data analyst, you tell him you'll have it ready by the hour. As you connect to the data warehouse running Postgres version 15.4, you ask yourself: How hard can it be?

2. Initial Solution

This should require no more than a simple GROUP BY query. You glance through the documentation and find a table named purchase.

Creating the purchase table:

CREATE TABLE purchase ( -- Primary key id SERIAL PRIMARY KEY, -- Unique ID of the customer -- Not a foreign key because this is a -- data warehouse with no customer table customer_id INTEGER, -- Human readable name of the customer customer_name VARCHAR(50), -- The country of the purchase customer_country VARCHAR(50), -- When the purchase row was updated updated_at TIMESTAMP -- ... other columns ... );

Populating the purchase table with test data:

INSERT INTO purchase (customer_id, customer_name, customer_country, updated_at) VALUES (2, 'Acme, Inc.', 'Germany', '2021-01-02 00:00:00'), (2, 'Acme, Inc.', 'Germany', '2021-12-02 00:00:00'), (2, 'Acme, Inc.', 'Germany', '2022-05-02 00:00:00'), (33, 'Acme, Inc.', 'Germany', '2023-12-02 00:00:00'), (33, 'Acme, Inc.', 'France', '2023-11-08 00:00:00'), (33, 'Acme, Inc.', 'Italy', '2023-11-09 00:00:00'), (33, 'Acme, Inc.', 'Sweden', '2023-12-10 00:00:00'), (33, 'Acme, Inc.', 'Sweden', '2023-10-11 00:00:00'), (33, 'Acme, Inc.', 'Norway', '2023-09-12 00:00:00'), (33, 'Acme, Inc.', 'Denmark', '2023-09-13 00:00:00'), (33, 'Acme, Inc.', 'Tjotahejti', CURRENT_DATE), (33, 'Acme, Inc.', 'Germany', '2023-12-02 00:00:00'), (101, 'Whatever AB', 'Sweden', '2023-12-01 00:00:00'), (102, 'Entrepreneur Inc.', 'Norway', '2023-11-02 00:00:00'), (2, 'Acme, Inc.', 'Denmark', '2023-10-03 00:00:00'), (104, 'Our Home Country Is The World Inc.', NULL, '2023-11-04 00:00:00'), (105, 'Sauna Inc.', 'Finland', '2023-12-01 00:00:00'), (NULL, 'Non-paying Inc.', NULL, '2023-12-03 00:00:00') ;

Ah, this is going to be a breeze. You lookup a few more details and create the following analysis query:

SELECT customer_id, COUNT(*) FROM purchase WHERE customer_id = 33 AND customer_country != 'Sweden' GROUP BY customer_id;

Ha, you must be a genius after all, coding at this speed!

3. Underlying Assumptions

As you prepare to send the result to Bill, you start to think about the assumptions you're making...

  1. What if the customer has several customer_ids? You're assuming that all purchases belonging to the customer is marked with customer_id = 33.
  2. What if customer_country IS NULL? The query won't return those rows, so you're assuming that customer_country is always set for the relevant purchases.
  3. What if purchase doesn't contain the latest data? You recall a conversation from two months ago about lacking monitoring of the data pipeline and there being no time to prioritize setting it up.
  4. What if certain rows in purchase do not represent a purchase? You recall someone creating a bunch of sub-purchases in the purchase table to solve an urgent problem last month. Was this customer affected by that?1
  5. What if purchase is not the right table to use? Maybe another table was used for purchases before or there is some other split.

OK, even a query as easy as this one makes a plethora of assumptions about the underlying data. Disgruntled, you realise that you have to confirm these assumptions before you send anything to Bill.

4. Method for Assumption Testing

You consider testing your assumptions by exploring the data, building some intuition, and then manually drawing conclusions.

This is a good start for understanding the data. But you're faced with a challenge: you don't trust your own short-term memory and you certainly don't trust your ability to interpret the result of complex queries without fault this late on a Friday.

One option to circumvent this is to write down and document your tests. That's a step in the right direction, but you realize this comes with its own set of problems. For example, what if you draw conclusions about the data during the time of query execution, and the data then changes as you run the analysis query?

Ugh, you don't know what to do so you go out to get some air. As you stare into empty space while buying a protein bar at 7-Eleven, you realise something: What if you were to bake in the data assumptions into the actual code? If you made the assumptions explicit in the code, there would be a number of benefits:

  1. Enforced, to make sure the analysis query only runs if all assumptions are true.
  2. Documented, if you or someone else needs to review the assumptions now or in the future. For example to understand what the hell went wrong.
  3. Repeatable, if the "one-off" query ever needs to be run again.

Eureka! You run back to the office to get back to work.

5. Performing Assumption Testing

You want to use a standalone Postgres script, so you opt for running a simple .sql file via the classic psql command. As a first step to test your assumptions, you start exploring the data.

-- Assumption 1: Manually inspecting the customer names to ensure -- all relevant purchases are `customer_id = 33`. SELECT customer_name, customer_id FROM purchase GROUP BY customer_name, customer_id ORDER by customer_name, customer_id;

It turns out that "Acme, Inc." has a few purchases where customer_id = 2. Why is that? After skimming through old issues in JIRA, you find that someone created a new customer_id for the customer a year ago because an immutable property needed to be changed.2

OK, so you need to update your analysis query with this finding and test the updated assumption in code. To do that, you decide to use anonymous code blocks in Postgres. You read the docs and find that they are used to execute one-time code in their own scope with their own variables.

You also decide to make use of the ASSERT statement in Postgres to ensure execution is aborted if an assumption turns out to be false.3 You then write out the code:

DO $$ DECLARE -- Declare variables prefixed with their assumption number to -- avoid accidentally using the wrong one a1_n_potentially_overlooked_rows INTEGER; BEGIN --------------- -- Assumption 1: (Updated from initial) All of the customer's -- purchases can be found with `customer_id IN (2, 33)` --------------- -- Copy result of COUNT(*) INTO the declared integer variable SELECT COUNT(*) INTO a1_n_potentially_overlooked_rows FROM purchase WHERE -- Exclude purchases we know are relevant customer_id NOT IN (2, 33) AND -- Below we make the assumption that other purchase rows -- with the same `customer_name` may represent the customer. -- Note that this is not waterproof, what if someone named -- the customer something slightly differently? Then you are -- in trouble. But you cannot test for every possible -- circumstance and have to make trade-offs as to how and -- to what degree you test the data, so you make a -- qualitative judgement and deem this is good enough. customer_name = 'Acme, Inc.' ; -- Make the assumption check! -- If the check fails, we may be missing relevant purchase -- rows and should inspect the data anew ASSERT a1_n_potentially_overlooked_rows = 0, 'Assumption 1 is false' ; -- Subsequent code inside this block (i.e., code before END) -- will not execute if an assertion fails -- ... other assumption tests here ... END$$;

Wow! This exercise really forced you to think about what you were assuming under what pretenses. What's more, you've ensured that subsequent code in the block won't be executed if an assumption turns out to be false.

You continue to explore the data and write tests for the remaining assumptions.

DO $$ DECLARE -- Declare variables prefixed with their assumption number to -- avoid accidentally using the wrong a1_n_potentially_overlooked_rows INTEGER; a2_customer_country_is_null INTEGER; a3_n_purchases_from_today INTEGER; BEGIN --------------- -- Assumption 1: (Updated from initial) All of the customer's -- purchases can be found with `customer_id IN (2, 33)` --------------- -- Copy result of COUNT(*) INTO the declared integer variable SELECT COUNT(*) INTO a1_n_potentially_overlooked_rows FROM purchase WHERE -- Exclude purchases we know are relevant customer_id NOT IN (2, 33) AND -- Below we make the assumption that other purchase rows with -- the same `customer_name` may represent the customer. -- Note that this is not waterproof, what if someone named -- the customer something slightly differently? Then you are -- in trouble. But you cannot test for every possible circumstance -- and have to make trade-offs as to how and to what degree you test -- the data, so you make a qualitative judgement to and deem this -- good enough for this purpose. customer_name = 'Acme, Inc.' ; -- Test the assumption! -- If the check fails, we may be missing relevant purchase rows -- and should inspect the data anew ASSERT a1_n_potentially_overlooked_rows = 0, 'Assumption 1 is false'; --------------- -- Assumption 2: `customer_country IS NOT NULL` for the customer -- Assumes assumption 1 is true --------------- SELECT COUNT(*) INTO a2_customer_country_is_null FROM purchase WHERE customer_id in (2, 33) AND customer_country IS NULL ; -- Test assumption! ASSERT a2_customer_country_is_null = 0, 'Assumption 2 is false'; --------------- -- Assumption 3: We have the latest data. --------------- SELECT COUNT(*) INTO a3_n_purchases_from_today FROM purchase WHERE -- Only get rows that were created today date(updated_at) = CURRENT_DATE; -- Test by assuming that we have the latest data if -- we have purchases that were updated today ASSERT a3_n_purchases_from_today > 0, 'Assumption 3 is false'; --------------- -- Assumption 4: Every row represents a purchase --------------- -- Difficult to test in code. Confirmed by looking -- through the documentation. See Confluence page C-123 -- version 7 under headline "The Curious Case of `purchase`" -- for more information. ASSERT true, 'Assumption 4 is false'; --------------- -- Assumption 5: `purchase` is the right table to look at --------------- -- Difficult to test in code, need to investigate by reading the -- documentation. -- Note how we explicitly fail the assertion here because we have -- not yet confirmed the assumption to be true. Once we have done -- that, we will document our findings and make it pass. ASSERT false, 'Assumption 5 is false'; END$$;

You note how you were not able to test all assumptions in code. For example, you could not express a test for assumption 5:

DO $$ DECLARE -- ... declared variables ... BEGIN -- ... other assumptions ... --------------- -- Assumption 5: `purchase` is the right table to look at --------------- -- Difficult to test in code, need to investigate by reading the -- documentation. -- Note how we explicitly fail the assertion here because we have -- not yet confirmed the assumption to be true. Once we have done -- that, we will document our findings and make it pass. ASSERT false, 'Assumption 5 is false'; -- ... other assumptions ... END$$;

So instead you opted for writing a manual ASSERT false statement for that one while you look up the answer in your documentation. Once you find the answer, you'll change the statement to ASSERT true and document your findings as a comment, with references to sources you used to confirm it. All for making it as explicit and documented as possible, for your current and future self as well as your colleagues.

6. Reading the Output From the Analysis Query

This is great, with one caveat. How do you output the result of the analysis query? It turns out that this is not straight-forward in an anonymous code block, because a query executed there does not return values the same way as a top-level query.

-- script.sql DO $$ BEGIN SELECT 'Result from final query'; END$$; -- Error from running `psql -f script.sql`: -- psql:script.sql:4: ERROR: query has no destination for result data -- HINT: If you want to discard the results of a SELECT, use PERFORM instead. -- CONTEXT: PL/pgSQL function inline_code_block line 3 at SQL statement

To get around this you could create a temporary table and insert the output there, but that won't work now because you only have READ access to the database. Besides, you don't want to make manual changes to the data warehouse, as you might break something or expose data to the wrong person.

The most feasible option left4 is to place the analysis query outside the anonymous code block, on the top-level at the bottom of the script, like so:

DO $$ BEGIN -- Some assumption is false, so we want to stop execution ASSERT false, 'Assumption is false!'; END$$; -- Will this be executed if the assumption above is false? SELECT 'Result from final query';

You try this and find that the analysis query was run even though an assumption was false! What's going on here?

It turns out that each block/statement is evaluated by psql separately, so the failed assertion inside the anonymous code block does not abort execution of the query outside and below the block.

Being a SQL veteran, you don't let this discourage you. Instead, you decide to execute everything inside a SQL transaction, because you know that failed assertions inside a transaction will abort all subsequent statements.

-- Start a read-only transaction because we are not -- changing anything BEGIN TRANSACTION READ ONLY; DO $$ BEGIN -- Some assumption is false, so we want to stop execution ASSERT false, 'Assumption incorrect!'; END$$; -- Does not execute this time, because the transaction is -- aborted, exactly as expected! SELECT 'Result from final query'; -- We use ROLLBACK to end the SQL transaction, to be sure that no -- changes are made to the database. Could theoretically use -- COMMIT instead since we are not making any changes and because -- we are in a read-only transaction, but why open up the -- remote possibility for human mistake? ROLLBACK;

This actually works! An added bonus is that you can now separate each assumption test into its own anonymous code block, making the code easier to read and reducing the risk of re-using variables by mistake. You do that and end up with a final version of the script:

-- Start read-only transaction BEGIN TRANSACTION READ ONLY; --------------- -- Assumption 1: (Updated from initial) All of the customer's -- purchases can be found with `customer_id IN (2, 33)` --------------- DO $$ DECLARE -- Declare result variable n_potentially_overlooked_rows INTEGER; BEGIN -- Copy result of COUNT(*) INTO the result variable SELECT COUNT(*) INTO n_potentially_overlooked_rows FROM purchase WHERE -- Exclude purchases we know are relevant customer_id NOT IN (2, 33) AND -- Below we make the assumption that other purchase rows with -- the same `customer_name` may represent the customer. -- Note that this is not waterproof, what if someone named -- the customer something slightly differently? Then you are -- in trouble. But you cannot test for every possible circumstance -- and have to make trade-offs as to how and to what degree you test -- the data, so you make a qualitative judgement to and deem this -- good enough for this purpose. customer_name = 'Acme, Inc.' ; -- Test the assumption! -- If the check fails, we may be missing relevant purchase rows -- and should inspect the data anew ASSERT n_potentially_overlooked_rows = 0, 'Assumption 1 is false' ; END$$; --------------- -- Assumption 2: `customer_country IS NOT NULL` for the customer -- Assumes assumption 1 is true --------------- DO $$ DECLARE -- Declare result variable customer_country_is_null INTEGER; BEGIN SELECT COUNT(*) INTO customer_country_is_null FROM purchase WHERE customer_id in (2, 33) AND customer_country IS NULL ; -- Test the assumption! ASSERT customer_country_is_null = 0, 'Assumption 2 is false'; END$$; --------------- -- Assumption 3: We have the latest data. --------------- DO $$ DECLARE -- Declare result variable n_purchases_from_today INTEGER; BEGIN SELECT COUNT(*) INTO n_purchases_from_today FROM purchase WHERE -- Only get rows that were created today date(updated_at) = current_date; -- Test by assuming that we have the latest data if -- we have purchases that were updated today ASSERT n_purchases_from_today > 0, 'Assumption 3 is false' ; END$$; --------------- -- Assumption 4: Every row represents a purchase --------------- DO $$ BEGIN -- Difficult to test in code. Confirmed by looking -- through the documentation. See Confluence page C-123 -- version 7 under headline "The Curious Case of `purchase`" -- for more information. ASSERT true, 'Assumption 4 is false'; END$$; --------------- -- Assumption 5: `purchase` is the right table to look at --------------- DO $$ BEGIN -- Difficult to test in code. Confirmed via discussion in -- issue T-456 from JIRA project 'In the Biz of Data' ASSERT true, 'Assumption 5 is false'; END$$; --------------- -- Analysis query: Get the number of purchases 'Acme, Inc.' has -- made outside of Sweden --------------- SELECT COUNT(*) FROM purchase WHERE customer_id in (2, 33) AND customer_country != 'Sweden' ; -- End transaction ROLLBACK;

Success! And it only took double the estimated time to complete.

After commiting the script to Git and documenting the result in JIRA, you pick up the phone to tell Bill the good news. He doesn't pick up. Classic Bill, so busy working. You send him a text with a link to the JIRA issue, telling yourself this could not have waited until Monday...

You decide that is quite enough of thinking for one day and instead rejoice in another Friday night spent creating shareholder value... Or at least you try to.


Footnotes

  1. You make a note to yourself to give a passive aggressive stare on Monday to the "hero" who came up with that "smart" fix...

  2. You swear as you wonder why the hell someone enforced immutability on a property that evidently changes from time to time.

  3. Another alternative would be to RAISE EXCEPTION, which would be just as valid, but you opt for ASSERT because of its simplicity and familiarity for those used to writing unit tests in other programming languages.

  4. Another option you think about is using something like RAISE NOTICE 'Result: %', result_variable to print the output from the query. This works well for simple queries with one result row, but not as well for queries that return many rows. This would work in this case because your analysis query only returns one row. Even still, you decide not to go this route because you want the security of seeing the full output from the analysis query.