How Can I Fill Sequential NULL Values in a PostgreSQL Table Using the Previous Not-NULL Value?

When working with datasets in PostgreSQL, particularly in handling transaction records or time series data, it’s quite common to encounter missing values (NULLs) in your data. In some scenarios, to maintain data integrity or for specific analytical needs, you might need to replace these NULLs with the most recent non-NULL value within the dataset. Let’s delve into how you can achieve this using PostgreSQL’s powerful window functions.

Original Attempt with Issues

Here’s the situation: I have a table named orders with columns offer_id and date. I need to fill all NULL cases in the column offer_id using the previous non-NULL value. My initial approach used the LAG() function, a window function provided by PostgreSQL for accessing data from a previous row. Here’s what the initial code looked like:

WITH orders AS (
    SELECT 2 AS offer_id, '2021-01-01'::date AS date UNION ALL
    SELECT 3 AS offer_id, '2021-01-02'::date AS date UNION ALL
    SELECT NULL AS offer_id, '2021-01-03'::date AS date UNION ALL
    SELECT NULL AS offer_id, '2021-01-04'::date AS date UNION ALL
    SELECT NULL AS offer_id, '2021-01-05'::date AS date UNION ALL
    SELECT 4 AS offer_id, '2021-01-07'::date AS date UNION ALL
    SELECT 5 AS offer_id, '2021-01-08'::date AS date UNION ALL
    SELECT NULL AS offer_id, '2021-01-09'::date AS date UNION ALL
    SELECT 8 AS offer_id, '2021-01-10'::date AS date UNION ALL
    SELECT 9 AS offer_id, '2021-01-11'::date AS date UNION ALL
    SELECT NULL AS offer_id, '2021-01-12'::date AS date UNION ALL
    SELECT NULL AS offer_id, '2021-01-13'::date AS date UNION ALL
    SELECT 13 AS offer_id, '2021-01-14'::date AS date UNION ALL
    SELECT 13 AS offer_id, '2021-01-15'::date AS date UNION ALL
    SELECT NULL AS offer_id, '2021-01-16'::date
)
SELECT *, CASE WHEN offer_id IS NULL 
THEN LAG(offer_id) OVER (ORDER BY date) ELSE offer_id END AS updated_offer_id 
FROM orders;

The problem with this code is that the LAG() function only retrieves the immediate previous value and does not “skip” over NULLs. This limitation makes the LAG() function alone unsuitable for sequences of multiple contiguous NULL values.

Correcting With a Comprehensive Approach

To effectively fill multiple NULL values using the last known non-NULL value, we need a method to carry the last non-NULL value through successive rows until it encounters another non-NULL value. Here’s how we can do it:

WITH orders AS (
    SELECT 2 AS offer_id, '2021-01-01'::date AS date UNION ALL
    SELECT 3 AS offer_id, '2021-01-02'::date AS date UNION ALL
    SELECT NULL AS offer_id, '2021-01-03'::date AS date UNION ALL
    SELECT NULL AS offer_id, '2021-01-04'::date AS date UNION ALL
    SELECT NULL AS offer_id, '2021-01-05'::date AS date UNION ALL
    SELECT 4 AS offer_id, '2021-01-07'::date AS date UNION ALL
    SELECT 5 AS offer_id, '2021-01-08'::date AS date UNION ALL
    SELECT NULL AS offer_id, '2021-01-09'::date AS date UNION ALL
    SELECT 8 AS offer_id, '2021-01-10'::date AS date UNION ALL
    SELECT 9 AS offer_id, '2021-01-11'::date AS date UNION ALL
    SELECT NULL AS offer_id, '2021-01-12'::date AS date UNION ALL
    SELECT NULL AS offer_id, '2021-01-13'::date AS date UNION ALL
    SELECT 13 AS offer_id, '2021-01-14'::date AS date UNION ALL
    SELECT 13 AS offer_id, '2021-01-15'::date AS date UNION ALL
    SELECT NULL AS offer_id, '2021-01-16'::date
)
, extended AS (
    SELECT
        offer_id,
        date,
        MAX(offer_id) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS updated_offer_id
    FROM orders
)
SELECT * FROM extended;

In this revised code, instead of using LAG(), I utilized the MAX() window function combined with the frame spec ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This construction ensures that for each row, the maximum (which in this context is the most recent non-NULL due to an ascending date order and the nature of our data where IDs don’t reduce) is considered over all preceding rows including the current one. This effectively propagates the last seen non-NULL offer_id down to succeeding NULL entries until a new non-NULL value is encountered.

Hopefully, this explanation helps clarify the process of dealing with sequential NULL values in PostgreSQL using window functions, utilizing methods that ensure data integrity and fulfillment of analytical requirements.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *